File: PKG-INFO

package info (click to toggle)
python-datrie 0.8.3-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 1,036 kB
  • sloc: ansic: 3,794; python: 756; sh: 13; makefile: 9
file content (498 lines) | stat: -rw-r--r-- 15,364 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
Metadata-Version: 2.4
Name: datrie
Version: 0.8.3
Summary: Super-fast, efficiently stored Trie for Python.
Home-page: https://github.com/kmike/datrie
Author: Mikhail Korobov
Author-email: kmike84@gmail.com
License: LGPLv2+
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU Lesser General Public License v2 or later (LGPLv2+)
Classifier: Programming Language :: Cython
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*
License-File: COPYING
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: requires-python
Dynamic: summary

datrie |travis| |appveyor|
==========================

.. |travis| image:: https://travis-ci.org/pytries/datrie.svg
   :target: https://travis-ci.org/pytries/datrie

.. |appveyor| image:: https://ci.appveyor.com/api/projects/status/6bpvhllpjhlau7x0?svg=true
   :target: https://ci.appveyor.com/project/superbobry/datrie

Super-fast, efficiently stored Trie for Python (2.x and 3.x).
Uses `libdatrie`_.

.. _libdatrie: https://linux.thai.net/~thep/datrie/datrie.html

Installation
============

::

    pip install datrie

Usage
=====

Create a new trie capable of storing items with lower-case ascii keys::

    >>> import string
    >>> import datrie
    >>> trie = datrie.Trie(string.ascii_lowercase)

``trie`` variable is a dict-like object that can have unicode keys of
certain ranges and Python objects as values.

In addition to implementing the mapping interface, tries facilitate
finding the items for a given prefix, and vice versa, finding the
items whose keys are prefixes of a given string. As a common special
case, finding the longest-prefix item is also supported.

.. warning::

    For efficiency you must define allowed character range(s) while
    creating trie. ``datrie`` doesn't check if keys are in allowed
    ranges at runtime, so be careful! Invalid keys are OK at lookup time
    but values won't be stored correctly for such keys.

Add some values to it (datrie keys must be unicode; the examples
are for Python 2.x)::

    >>> trie[u'foo'] = 5
    >>> trie[u'foobar'] = 10
    >>> trie[u'bar'] = 'bar value'
    >>> trie.setdefault(u'foobar', 15)
    10

Check if u'foo' is in trie::

    >>> u'foo' in trie
    True

Get a value::

    >>> trie[u'foo']
    5

Find all prefixes of a word::

    >>> trie.prefixes(u'foobarbaz')
    [u'foo', u'foobar']

    >>> trie.prefix_items(u'foobarbaz')
    [(u'foo', 5), (u'foobar', 10)]

    >>> trie.iter_prefixes(u'foobarbaz')
    <generator object ...>

    >>> trie.iter_prefix_items(u'foobarbaz')
    <generator object ...>

Find the longest prefix of a word::

    >>> trie.longest_prefix(u'foo')
    u'foo'

    >>> trie.longest_prefix(u'foobarbaz')
    u'foobar'

    >>> trie.longest_prefix(u'gaz')
    KeyError: u'gaz'

    >>> trie.longest_prefix(u'gaz', default=u'vasia')
    u'vasia'

    >>> trie.longest_prefix_item(u'foobarbaz')
    (u'foobar', 10)

Check if the trie has keys with a given prefix::

    >>> trie.has_keys_with_prefix(u'fo')
    True

    >>> trie.has_keys_with_prefix(u'FO')
    False

Get all items with a given prefix from a trie::

    >>> trie.keys(u'fo')
    [u'foo', u'foobar']

    >>> trie.items(u'ba')
    [(u'bar', 'bar value')]

    >>> trie.values(u'foob')
    [10]

Get all suffixes of certain word starting with a given prefix from a trie::

    >>> trie.suffixes()
    [u'pro', u'producer', u'producers', u'product', u'production', u'productivity', u'prof']
    >>> trie.suffixes(u'prod')
    [u'ucer', u'ucers', u'uct', u'uction', u'uctivity']


Save & load a trie (values must be picklable)::

    >>> trie.save('my.trie')
    >>> trie2 = datrie.Trie.load('my.trie')



Trie and BaseTrie
=================

There are two Trie classes in datrie package: ``datrie.Trie`` and
``datrie.BaseTrie``. ``datrie.BaseTrie`` is slightly faster and uses less
memory but it can store only integer numbers -2147483648 <= x <= 2147483647.
``datrie.Trie`` is a bit slower but can store any Python object as a value.

If you don't need values or integer values are OK then use ``datrie.BaseTrie``::

    import datrie
    import string
    trie = datrie.BaseTrie(string.ascii_lowercase)

Custom iteration
================

If the built-in trie methods don't fit you can use ``datrie.State`` and
``datrie.Iterator`` to implement custom traversal.

.. note::

    If you use ``datrie.BaseTrie`` you need ``datrie.BaseState`` and
    ``datrie.BaseIterator`` for custom traversal.


For example, let's find all suffixes of ``'fo'`` for our trie and get
the values::

    >>> state = datrie.State(trie)
    >>> state.walk(u'foo')
    >>> it = datrie.Iterator(state)
    >>> while it.next():
    ...     print(it.key())
    ...     print(it.data))
    o
    5
    obar
    10

Performance
===========

Performance is measured for ``datrie.Trie`` against Python's dict with
100k unique unicode words (English and Russian) as keys and '1' numbers
as values.

``datrie.Trie`` uses about 5M memory for 100k words; Python's dict
uses about 22M for this according to my unscientific tests.

This trie implementation is 2-6 times slower than python's dict
on __getitem__. Benchmark results (macbook air i5 1.8GHz,
"1.000M ops/sec" == "1 000 000 operations per second")::

    Python 2.6:
    dict __getitem__: 7.107M ops/sec
    trie __getitem__: 2.478M ops/sec

    Python 2.7:
    dict __getitem__: 6.550M ops/sec
    trie __getitem__: 2.474M ops/sec

    Python 3.2:
    dict __getitem__: 8.185M ops/sec
    trie __getitem__: 2.684M ops/sec

    Python 3.3:
    dict __getitem__: 7.050M ops/sec
    trie __getitem__: 2.755M ops/sec

Looking for prefixes of a given word is almost as fast as
``__getitem__`` (results are for Python 3.3)::

    trie.iter_prefix_items (hits):      0.461M ops/sec
    trie.prefix_items (hits):           0.743M ops/sec
    trie.prefix_items loop (hits):      0.629M ops/sec
    trie.iter_prefixes (hits):          0.759M ops/sec
    trie.iter_prefixes (misses):        1.538M ops/sec
    trie.iter_prefixes (mixed):         1.359M ops/sec
    trie.has_keys_with_prefix (hits):   1.896M ops/sec
    trie.has_keys_with_prefix (misses): 2.590M ops/sec
    trie.longest_prefix (hits):         1.710M ops/sec
    trie.longest_prefix (misses):       1.506M ops/sec
    trie.longest_prefix (mixed):        1.520M ops/sec
    trie.longest_prefix_item (hits):    1.276M ops/sec
    trie.longest_prefix_item (misses):  1.292M ops/sec
    trie.longest_prefix_item (mixed):   1.379M ops/sec

Looking for all words starting with a given prefix is mostly limited
by overall result count (this can be improved in future because a
lot of time is spent decoding strings from utf_32_le to Python's
unicode)::

    trie.items(prefix="xxx"), avg_len(res)==415:        0.609K ops/sec
    trie.keys(prefix="xxx"), avg_len(res)==415:         0.642K ops/sec
    trie.values(prefix="xxx"), avg_len(res)==415:       4.974K ops/sec
    trie.items(prefix="xxxxx"), avg_len(res)==17:       14.781K ops/sec
    trie.keys(prefix="xxxxx"), avg_len(res)==17:        15.766K ops/sec
    trie.values(prefix="xxxxx"), avg_len(res)==17:      96.456K ops/sec
    trie.items(prefix="xxxxxxxx"), avg_len(res)==3:     75.165K ops/sec
    trie.keys(prefix="xxxxxxxx"), avg_len(res)==3:      77.225K ops/sec
    trie.values(prefix="xxxxxxxx"), avg_len(res)==3:    320.755K ops/sec
    trie.items(prefix="xxxxx..xx"), avg_len(res)==1.4:  173.591K ops/sec
    trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4:   180.678K ops/sec
    trie.values(prefix="xxxxx..xx"), avg_len(res)==1.4: 503.392K ops/sec
    trie.items(prefix="xxx"), NON_EXISTING:             2023.647K ops/sec
    trie.keys(prefix="xxx"), NON_EXISTING:              1976.928K ops/sec
    trie.values(prefix="xxx"), NON_EXISTING:            2060.372K ops/sec

Random insert time is very slow compared to dict, this is the limitation
of double-array tries; updates are quite fast. If you want to build a trie,
consider sorting keys before the insertion::

    dict __setitem__ (updates):            6.497M ops/sec
    trie __setitem__ (updates):            2.633M ops/sec
    dict __setitem__ (inserts, random):    5.808M ops/sec
    trie __setitem__ (inserts, random):    0.053M ops/sec
    dict __setitem__ (inserts, sorted):    5.749M ops/sec
    trie __setitem__ (inserts, sorted):    0.624M ops/sec
    dict setdefault (updates):             3.455M ops/sec
    trie setdefault (updates):             1.910M ops/sec
    dict setdefault (inserts):             3.466M ops/sec
    trie setdefault (inserts):             0.053M ops/sec

Other results (note that ``len(trie)`` is currently implemented
using trie traversal)::

    dict __contains__ (hits):    6.801M ops/sec
    trie __contains__ (hits):    2.816M ops/sec
    dict __contains__ (misses):  5.470M ops/sec
    trie __contains__ (misses):  4.224M ops/sec
    dict __len__:                334336.269 ops/sec
    trie __len__:                22.900 ops/sec
    dict values():               406.507 ops/sec
    trie values():               20.864 ops/sec
    dict keys():                 189.298 ops/sec
    trie keys():                 2.773 ops/sec
    dict items():                48.734 ops/sec
    trie items():                2.611 ops/sec

Please take this benchmark results with a grain of salt; this
is a very simple benchmark and may not cover your use case.

Current Limitations
===================

* keys must be unicode (no implicit conversion for byte strings
  under Python 2.x, sorry);
* there are no iterator versions of keys/values/items (this is not
  implemented yet);
* it is painfully slow and maybe buggy under pypy;
* library is not tested with narrow Python builds.

Contributing
============

Development happens at github: https://github.com/pytries/datrie.

Feel free to submit ideas, bugs, pull requests.

Running tests and benchmarks
----------------------------

Make sure `tox`_ is installed and run

::

    $ tox

from the source checkout. Tests should pass under Python 2.7 and 3.4+.

::

    $ tox -c tox-bench.ini

runs benchmarks.

If you've changed anything in the source code then
make sure `cython`_ is installed and run

::

    $ update_c.sh

before each ``tox`` command.

Please note that benchmarks are not included in the release
tar.gz's because benchmark data is large and this
saves a lot of bandwidth; use source checkouts from
github or bitbucket for the benchmarks.

.. _cython: https://cython.org/
.. _tox: https://tox.readthedocs.io/

Authors & Contributors
----------------------

See https://github.com/pytries/datrie/graphs/contributors.

This module is based on `libdatrie`_ C library by Theppitak Karoonboonyanan
and is inspired by `fast_trie`_ Ruby bindings, `PyTrie`_ pure
Python implementation and `Tree::Trie`_ Perl implementation;
some docs and API ideas are borrowed from these projects.

.. _fast_trie: https://github.com/tyler/trie
.. _PyTrie: https://github.com/gsakkis/pytrie
.. _Tree::Trie: https://metacpan.org/pod/release/AVIF/Tree-Trie-1.9/Trie.pm

License
=======

Licensed under LGPL v2.1.
CHANGES
=======

0.8.2 (2020-03-25)
------------------
* Future-proof Python support by making cython a build time dependency and
  removing cython generated c files from the repo (and sdist).
* Fix collections.abc.MutableMapping import
* CI and test updates
* Adjust library name to unbreak some linkers

0.8.1 (skipped)
---------------
This version intentionally skipped

0.8 (2019-07-03)
----------------
* Python 3.7 compatibility; extension is rebuilt with Cython 0.29.11.
* Trie.get function;
* Python 2.6 and 3.3 support is dropped;
* removed patch to libdatrie which is no longer required;
* testing and CI fixes.

0.7.1 (2016-03-12)
------------------

* updated the bundled C library to version 0.2.9;
* implemented ``Trie.__len__`` in terms of ``trie_enumerate``;
* rebuilt Cython wrapper with Cython 0.23.4;
* changed ``Trie`` to implement ``collections.abc.MutableMapping``;
* fixed ``Trie`` pickling, which segfaulted on Python2.X.

0.7 (2014-02-18)
----------------

* bundled libdatrie C library is updated to version 0.2.8;
* new `.suffixes()` method (thanks Ahmed T. Youssef);
* wrapper is rebuilt with Cython 0.20.1.

0.6.1 (2013-09-21)
------------------

* fixed build for Visual Studio (thanks Gabi Davar).

0.6 (2013-07-09)
----------------

* datrie is rebuilt with Cython 0.19.1;
* ``iter_prefix_values``, ``prefix_values`` and ``longest_prefix_value``
  methods for ``datrie.BaseTrie`` and ``datrie.Trie`` (thanks Jared Suttles).

0.5.1 (2013-01-30)
------------------

* Recently introduced memory leak in ``longest_prefix``
  and ``longest_prefix_item`` is fixed.

0.5 (2013-01-29)
----------------

* ``longest_prefix`` and ``longest_prefix_item`` methods are fixed;
* datrie is rebuilt with Cython 0.18;
* misleading benchmark results in README are fixed;
* State._walk is renamed to State.walk_char.

0.4.2 (2012-09-02)
------------------

* Update to latest libdatrie; this makes ``.keys()`` method a bit slower but
  removes a keys length limitation.

0.4.1 (2012-07-29)
------------------

* cPickle is used for saving/loading ``datrie.Trie`` if it is available.

0.4 (2012-07-27)
----------------

* ``libdatrie`` improvements and bugfixes, including C iterator API support;
* custom iteration support using ``datrie.State`` and ``datrie.Iterator``.
* speed improvements: ``__length__``, ``keys``, ``values`` and
  ``items`` methods should be up to 2x faster.
* keys longer than 32768 are not supported in this release.


0.3 (2012-07-21)
----------------

There are no new features or speed improvements in this release.

* ``datrie.new`` is deprecated; use ``datrie.Trie`` with the same arguments;
* small test & benchmark improvements.

0.2 (2012-07-16)
----------------

* ``datrie.Trie`` items can have any Python object as a value
  (``Trie`` from 0.1.x becomes ``datrie.BaseTrie``);
* ``longest_prefix`` and ``longest_prefix_items`` are fixed;
* ``save`` & ``load`` are rewritten;
* ``setdefault`` method.


0.1.1 (2012-07-13)
------------------

* Windows support (upstream libdatrie changes are merged);
* license is changed from LGPL v3 to LGPL v2.1 to match the libdatrie license.

0.1 (2012-07-12)
----------------

Initial release.