1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
|
.. image:: doc/pynndescent_logo.png
:width: 600
:align: center
:alt: PyNNDescent Logo
.. image:: https://dev.azure.com/TutteInstitute/build-pipelines/_apis/build/status%2Flmcinnes.pynndescent?branchName=master
:target: https://dev.azure.com/TutteInstitute/build-pipelines/_build?definitionId=17
:alt: Azure Pipelines Build Status
.. image:: https://readthedocs.org/projects/pynndescent/badge/?version=latest
:target: https://pynndescent.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
===========
PyNNDescent
===========
PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.
It provides a python implementation of Nearest Neighbor
Descent for k-neighbor-graph construction and approximate nearest neighbor
search, as per the paper:
Dong, Wei, Charikar Moses, and Kai Li.
*"Efficient k-nearest neighbor graph construction for generic similarity
measures."*
Proceedings of the 20th international conference on World wide web. ACM, 2011.
This library supplements that approach with the use of random projection trees for
initialisation. This can be particularly useful for the metrics that are
amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph
diversification is also performed, pruning the longest edges of any triangles in the
graph.
Currently this library targets relatively high accuracy
(80%-100% accuracy rate) approximate nearest neighbor searches.
--------------------
Why use PyNNDescent?
--------------------
PyNNDescent provides fast approximate nearest neighbor queries. The
`ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_ system puts it
solidly in the mix of top performing ANN libraries:
**SIFT-128 Euclidean**
.. image:: https://pynndescent.readthedocs.io/en/latest/_images/sift.png
:alt: ANN benchmark performance for SIFT 128 dataset
**NYTimes-256 Angular**
.. image:: https://pynndescent.readthedocs.io/en/latest/_images/nytimes.png
:alt: ANN benchmark performance for NYTimes 256 dataset
While PyNNDescent is among fastest ANN library, it is also both easy to install (pip
and conda installable) with no platform or compilation issues, and is very flexible,
supporting a wide variety of distance metrics by default:
**Minkowski style metrics**
- euclidean
- manhattan
- chebyshev
- minkowski
**Miscellaneous spatial metrics**
- canberra
- braycurtis
- haversine
**Normalized spatial metrics**
- mahalanobis
- wminkowski
- seuclidean
**Angular and correlation metrics**
- cosine
- dot
- correlation
- spearmanr
- tsss
- true_angular
**Probability metrics**
- hellinger
- wasserstein
**Metrics for binary data**
- hamming
- jaccard
- dice
- russelrao
- kulsinski
- rogerstanimoto
- sokalmichener
- sokalsneath
- yule
and also custom user defined distance metrics while still retaining performance.
PyNNDescent also integrates well with Scikit-learn, including providing support
for the KNeighborTransformer as a drop in replacement for algorithms
that make use of nearest neighbor computations.
----------------------
How to use PyNNDescent
----------------------
PyNNDescent aims to have a very simple interface. It is similar to (but more
limited than) KDTrees and BallTrees in ``sklearn``. In practice there are
only two operations -- index construction, and querying an index for nearest
neighbors.
To build a new search index on some training data ``data`` you can do something
like
.. code:: python
from pynndescent import NNDescent
index = NNDescent(data)
You can then use the index for searching (and can pickle it to disk if you
wish). To search a pynndescent index for the 15 nearest neighbors of a test data
set ``query_data`` you can do something like
.. code:: python
index.query(query_data, k=15)
and that is pretty much all there is to it. You can find more details in the
`documentation <https://pynndescent.readthedocs.org>`_.
----------
Installing
----------
PyNNDescent is designed to be easy to install being a pure python module with
relatively light requirements:
* numpy
* scipy
* scikit-learn >= 0.22
* numba >= 0.51
all of which should be pip or conda installable. The easiest way to install should be
via conda:
.. code:: bash
conda install -c conda-forge pynndescent
or via pip:
.. code:: bash
pip install pynndescent
To manually install this package:
.. code:: bash
wget https://github.com/lmcinnes/pynndescent/archive/master.zip
unzip master.zip
rm master.zip
cd pynndescent-master
python setup.py install
----------------
Help and Support
----------------
This project is still young. The documentation is still growing. In the meantime please
`open an issue <https://github.com/lmcinnes/pynndescent/issues/new>`_
and I will try to provide any help and guidance that I can. Please also check
the docstrings on the code, which provide some descriptions of the parameters.
-------
License
-------
The pynndescent package is 2-clause BSD licensed. Enjoy.
------------
Contributing
------------
Contributions are more than welcome! There are lots of opportunities
for potential projects, so please get in touch if you would like to
help out. Everything from code to notebooks to
examples and documentation are all *equally valuable* so please don't feel
you can't contribute. To contribute please `fork the project <https://github.com/lmcinnes/pynndescent/issues#fork-destination-box>`_ make your changes and
submit a pull request. We will do our best to work through any issues with
you and get your code merged into the main branch.
|