File: levenshtein.rst

package info (click to toggle)
python-sphobjinv 2.3.1.3-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 2,328 kB
  • sloc: python: 4,194; makefile: 29; sh: 20
file content (144 lines) | stat: -rw-r--r-- 6,085 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
.. Info on speedups from python-Levenshtein

Speeding up "suggest" with python-Levenshtein (DEPRECATED)
==========================================================

|soi| uses |fuzzywuzzy|_ for fuzzy-match searching of object
names/domains/roles as part of the
:meth:`Inventory.suggest() <sphobjinv.inventory.Inventory.suggest>` functionality,
also implemented as the CLI :doc:`suggest <cli/suggest>` subcommand.

|fuzzywuzzy|_ uses :class:`difflib.SequenceMatcher`
from the Python standard library for its fuzzy searching.
While earlier versions of |soi| were able to make use of
|fuzzywuzzy|_\ 's optional link to |python-Levenshtein|_,
a Python C extension providing similar functionality,
due to a licensing conflict this is no longer possible.
|soi| now uses a vendored copy of |fuzzywuzzy|_ from an
era when it was released under the MIT License.

Formally:

.. versionremoved:: 2.2

    Acceleration of the |soi| "suggest" mode via |python-Levenshtein|_
    has been deprecated and is no longer available. 

The discussion of performance benchmarks and variations in matching
behavior is kept below for historical interest.


Performance Benchmark
---------------------

The chart below presents one dataset illustrating the performance enhancement
that can be obtained by installing |python-Levenshtein|_.
The timings plotted here are from execution of 
:func:`timeit.repeat` around a 
:meth:`~sphobjinv.inventory.Inventory.suggest` call,
searching for the term "function", for a number of
|objects.inv| files from different projects (see
`here <https://github.com/bskinn/sphobjinv/tree/
6c1f22e40dc3d129485462aec05adbed2ff40ab8/sphobjinv/test/resource>`__).

The timings were collected using the following code::

    import sphobjinv as soi

    durations = {}
    obj_counts = {}

    for fn in os.listdir():
        if fn.endswith('.inv'):
            inv = soi.Inventory(fn)

            # Execute the 'suggest' operation 20 times for each file
            timings = timeit.repeat("inv.suggest('function')", repeat=20, number=1, globals=globals())

            # Store the average timing for each file
            durations.update({fn: sum(timings) / len(timings)})

            # Store the number of objects
            obj_counts.update({fn: inv.count})


As can be seen, the fifty-two |objects.inv| files in this dataset 
contain widely varying numbers of objects :math:`n`:

.. image:: /_static/suggest_timing.png

Unsurprisingly, larger inventories require more time to search.
Also relatively unsurprisingly, the time required appears to be
roughly :math:`O(n)`, since the fuzzy search must be performed once on the
:attr:`~sphobjinv.data.SuperDataObj.as_rst` representation of each object.

For this specific search, using |python-Levenshtein|_ instead of
:mod:`difflib` decreases the time required from 0.90 seconds per thousand objects
down to 0.15 seconds per thousand objects,
representing a performance improvement of almost exactly six-fold.
Other searches will likely exhibit somewhat better or worse
improvement from the use of |python-Levenshtein|_,
depending on the average length of the reST-like representations
of the objects in an |objects.inv|
and the length of the search term.


Variations in Matching Behavior 
-------------------------------

Note that the matching scores calculated by
:mod:`difflib` and |python-Levenshtein|_ can often
differ appreciably.  (This is illustrated in
`issue #128 <https://github.com/seatgeek/fuzzywuzzy/issues/128>`__
of the |fuzzywuzzy|_ GitHub repo.)
This difference in behavior doesn't have much practical significance,
save for the potential of causing some confusion between users with/without
|python-Levenshtein|_ installed.


As an example, the following shows an excerpt of the results of a representative
CLI :doc:`suggest </cli/suggest>` call **without**
|python-Levenshtein|_::

    $ sphobjinv suggest objects_scipy.inv surface -asit 40

      Name                                                   Score    Index
    ------------------------------------------------------  -------  -------
    :py:function:`scipy.misc.face`                            64      1018
    :py:function:`scipy.misc.source`                          64      1032
    :std:doc:`generated/scipy.misc.face`                      64      4042
    :std:doc:`generated/scipy.misc.source`                    64      4056
    :py:data:`scipy.stats.rice`                               56      2688
    :std:label:`continuous-rice`                              56      2896
    :py:method:`scipy.integrate.complex_ode.successful`       51       156
    :py:method:`scipy.integrate.ode.successful`               51       171
    :py:function:`scipy.linalg.lu_factor`                     51       967

    ... more with score 51 ...

    :py:attribute:`scipy.LowLevelCallable.signature`          50        5
    :py:function:`scipy.constants.convert_temperature`        50       53
    :py:function:`scipy.integrate.quadrature`                 50       176

    ... more with score 50 and below ...

This is a similar excerpt **with** |python-Levenshtein|_::

      Name                                                   Score    Index
    ------------------------------------------------------  -------  -------
    :py:function:`scipy.misc.face`                            64      1018
    :py:function:`scipy.misc.source`                          64      1032
    :std:doc:`generated/scipy.misc.face`                      64      4042
    :std:doc:`generated/scipy.misc.source`                    64      4056
    :py:method:`scipy.integrate.ode.successful`               51       171
    :py:function:`scipy.linalg.lu_factor`                     51       967
    :py:function:`scipy.linalg.subspace_angles`               51      1003

    ... more with score 51 ...

    :py:function:`scipy.cluster.hierarchy.fcluster`           49       23
    :py:function:`scipy.cluster.hierarchy.fclusterdata`       49       24
    :py:method:`scipy.integrate.complex_ode.successful`       49       156

    ... more with score 49 and below ...