1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341
|
datrie |travis| |appveyor|
==========================
.. |travis| image:: https://travis-ci.org/pytries/datrie.svg
:target: https://travis-ci.org/pytries/datrie
.. |appveyor| image:: https://ci.appveyor.com/api/projects/status/6bpvhllpjhlau7x0?svg=true
:target: https://ci.appveyor.com/project/superbobry/datrie
Super-fast, efficiently stored Trie for Python (2.x and 3.x).
Uses `libdatrie`_.
.. _libdatrie: https://linux.thai.net/~thep/datrie/datrie.html
Installation
============
::
pip install datrie
Usage
=====
Create a new trie capable of storing items with lower-case ascii keys::
>>> import string
>>> import datrie
>>> trie = datrie.Trie(string.ascii_lowercase)
``trie`` variable is a dict-like object that can have unicode keys of
certain ranges and Python objects as values.
In addition to implementing the mapping interface, tries facilitate
finding the items for a given prefix, and vice versa, finding the
items whose keys are prefixes of a given string. As a common special
case, finding the longest-prefix item is also supported.
.. warning::
For efficiency you must define allowed character range(s) while
creating trie. ``datrie`` doesn't check if keys are in allowed
ranges at runtime, so be careful! Invalid keys are OK at lookup time
but values won't be stored correctly for such keys.
Add some values to it (datrie keys must be unicode; the examples
are for Python 2.x)::
>>> trie[u'foo'] = 5
>>> trie[u'foobar'] = 10
>>> trie[u'bar'] = 'bar value'
>>> trie.setdefault(u'foobar', 15)
10
Check if u'foo' is in trie::
>>> u'foo' in trie
True
Get a value::
>>> trie[u'foo']
5
Find all prefixes of a word::
>>> trie.prefixes(u'foobarbaz')
[u'foo', u'foobar']
>>> trie.prefix_items(u'foobarbaz')
[(u'foo', 5), (u'foobar', 10)]
>>> trie.iter_prefixes(u'foobarbaz')
<generator object ...>
>>> trie.iter_prefix_items(u'foobarbaz')
<generator object ...>
Find the longest prefix of a word::
>>> trie.longest_prefix(u'foo')
u'foo'
>>> trie.longest_prefix(u'foobarbaz')
u'foobar'
>>> trie.longest_prefix(u'gaz')
KeyError: u'gaz'
>>> trie.longest_prefix(u'gaz', default=u'vasia')
u'vasia'
>>> trie.longest_prefix_item(u'foobarbaz')
(u'foobar', 10)
Check if the trie has keys with a given prefix::
>>> trie.has_keys_with_prefix(u'fo')
True
>>> trie.has_keys_with_prefix(u'FO')
False
Get all items with a given prefix from a trie::
>>> trie.keys(u'fo')
[u'foo', u'foobar']
>>> trie.items(u'ba')
[(u'bar', 'bar value')]
>>> trie.values(u'foob')
[10]
Get all suffixes of certain word starting with a given prefix from a trie::
>>> trie.suffixes()
[u'pro', u'producer', u'producers', u'product', u'production', u'productivity', u'prof']
>>> trie.suffixes(u'prod')
[u'ucer', u'ucers', u'uct', u'uction', u'uctivity']
Save & load a trie (values must be picklable)::
>>> trie.save('my.trie')
>>> trie2 = datrie.Trie.load('my.trie')
Trie and BaseTrie
=================
There are two Trie classes in datrie package: ``datrie.Trie`` and
``datrie.BaseTrie``. ``datrie.BaseTrie`` is slightly faster and uses less
memory but it can store only integer numbers -2147483648 <= x <= 2147483647.
``datrie.Trie`` is a bit slower but can store any Python object as a value.
If you don't need values or integer values are OK then use ``datrie.BaseTrie``::
import datrie
import string
trie = datrie.BaseTrie(string.ascii_lowercase)
Custom iteration
================
If the built-in trie methods don't fit you can use ``datrie.State`` and
``datrie.Iterator`` to implement custom traversal.
.. note::
If you use ``datrie.BaseTrie`` you need ``datrie.BaseState`` and
``datrie.BaseIterator`` for custom traversal.
For example, let's find all suffixes of ``'fo'`` for our trie and get
the values::
>>> state = datrie.State(trie)
>>> state.walk(u'foo')
>>> it = datrie.Iterator(state)
>>> while it.next():
... print(it.key())
... print(it.data))
o
5
obar
10
Performance
===========
Performance is measured for ``datrie.Trie`` against Python's dict with
100k unique unicode words (English and Russian) as keys and '1' numbers
as values.
``datrie.Trie`` uses about 5M memory for 100k words; Python's dict
uses about 22M for this according to my unscientific tests.
This trie implementation is 2-6 times slower than python's dict
on __getitem__. Benchmark results (macbook air i5 1.8GHz,
"1.000M ops/sec" == "1 000 000 operations per second")::
Python 2.6:
dict __getitem__: 7.107M ops/sec
trie __getitem__: 2.478M ops/sec
Python 2.7:
dict __getitem__: 6.550M ops/sec
trie __getitem__: 2.474M ops/sec
Python 3.2:
dict __getitem__: 8.185M ops/sec
trie __getitem__: 2.684M ops/sec
Python 3.3:
dict __getitem__: 7.050M ops/sec
trie __getitem__: 2.755M ops/sec
Looking for prefixes of a given word is almost as fast as
``__getitem__`` (results are for Python 3.3)::
trie.iter_prefix_items (hits): 0.461M ops/sec
trie.prefix_items (hits): 0.743M ops/sec
trie.prefix_items loop (hits): 0.629M ops/sec
trie.iter_prefixes (hits): 0.759M ops/sec
trie.iter_prefixes (misses): 1.538M ops/sec
trie.iter_prefixes (mixed): 1.359M ops/sec
trie.has_keys_with_prefix (hits): 1.896M ops/sec
trie.has_keys_with_prefix (misses): 2.590M ops/sec
trie.longest_prefix (hits): 1.710M ops/sec
trie.longest_prefix (misses): 1.506M ops/sec
trie.longest_prefix (mixed): 1.520M ops/sec
trie.longest_prefix_item (hits): 1.276M ops/sec
trie.longest_prefix_item (misses): 1.292M ops/sec
trie.longest_prefix_item (mixed): 1.379M ops/sec
Looking for all words starting with a given prefix is mostly limited
by overall result count (this can be improved in future because a
lot of time is spent decoding strings from utf_32_le to Python's
unicode)::
trie.items(prefix="xxx"), avg_len(res)==415: 0.609K ops/sec
trie.keys(prefix="xxx"), avg_len(res)==415: 0.642K ops/sec
trie.values(prefix="xxx"), avg_len(res)==415: 4.974K ops/sec
trie.items(prefix="xxxxx"), avg_len(res)==17: 14.781K ops/sec
trie.keys(prefix="xxxxx"), avg_len(res)==17: 15.766K ops/sec
trie.values(prefix="xxxxx"), avg_len(res)==17: 96.456K ops/sec
trie.items(prefix="xxxxxxxx"), avg_len(res)==3: 75.165K ops/sec
trie.keys(prefix="xxxxxxxx"), avg_len(res)==3: 77.225K ops/sec
trie.values(prefix="xxxxxxxx"), avg_len(res)==3: 320.755K ops/sec
trie.items(prefix="xxxxx..xx"), avg_len(res)==1.4: 173.591K ops/sec
trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 180.678K ops/sec
trie.values(prefix="xxxxx..xx"), avg_len(res)==1.4: 503.392K ops/sec
trie.items(prefix="xxx"), NON_EXISTING: 2023.647K ops/sec
trie.keys(prefix="xxx"), NON_EXISTING: 1976.928K ops/sec
trie.values(prefix="xxx"), NON_EXISTING: 2060.372K ops/sec
Random insert time is very slow compared to dict, this is the limitation
of double-array tries; updates are quite fast. If you want to build a trie,
consider sorting keys before the insertion::
dict __setitem__ (updates): 6.497M ops/sec
trie __setitem__ (updates): 2.633M ops/sec
dict __setitem__ (inserts, random): 5.808M ops/sec
trie __setitem__ (inserts, random): 0.053M ops/sec
dict __setitem__ (inserts, sorted): 5.749M ops/sec
trie __setitem__ (inserts, sorted): 0.624M ops/sec
dict setdefault (updates): 3.455M ops/sec
trie setdefault (updates): 1.910M ops/sec
dict setdefault (inserts): 3.466M ops/sec
trie setdefault (inserts): 0.053M ops/sec
Other results (note that ``len(trie)`` is currently implemented
using trie traversal)::
dict __contains__ (hits): 6.801M ops/sec
trie __contains__ (hits): 2.816M ops/sec
dict __contains__ (misses): 5.470M ops/sec
trie __contains__ (misses): 4.224M ops/sec
dict __len__: 334336.269 ops/sec
trie __len__: 22.900 ops/sec
dict values(): 406.507 ops/sec
trie values(): 20.864 ops/sec
dict keys(): 189.298 ops/sec
trie keys(): 2.773 ops/sec
dict items(): 48.734 ops/sec
trie items(): 2.611 ops/sec
Please take this benchmark results with a grain of salt; this
is a very simple benchmark and may not cover your use case.
Current Limitations
===================
* keys must be unicode (no implicit conversion for byte strings
under Python 2.x, sorry);
* there are no iterator versions of keys/values/items (this is not
implemented yet);
* it is painfully slow and maybe buggy under pypy;
* library is not tested with narrow Python builds.
Contributing
============
Development happens at github: https://github.com/pytries/datrie.
Feel free to submit ideas, bugs, pull requests.
Running tests and benchmarks
----------------------------
Make sure `tox`_ is installed and run
::
$ tox
from the source checkout. Tests should pass under Python 2.7 and 3.4+.
::
$ tox -c tox-bench.ini
runs benchmarks.
If you've changed anything in the source code then
make sure `cython`_ is installed and run
::
$ update_c.sh
before each ``tox`` command.
Please note that benchmarks are not included in the release
tar.gz's because benchmark data is large and this
saves a lot of bandwidth; use source checkouts from
github or bitbucket for the benchmarks.
.. _cython: https://cython.org/
.. _tox: https://tox.readthedocs.io/
Authors & Contributors
----------------------
See https://github.com/pytries/datrie/graphs/contributors.
This module is based on `libdatrie`_ C library by Theppitak Karoonboonyanan
and is inspired by `fast_trie`_ Ruby bindings, `PyTrie`_ pure
Python implementation and `Tree::Trie`_ Perl implementation;
some docs and API ideas are borrowed from these projects.
.. _fast_trie: https://github.com/tyler/trie
.. _PyTrie: https://github.com/gsakkis/pytrie
.. _Tree::Trie: https://metacpan.org/pod/release/AVIF/Tree-Trie-1.9/Trie.pm
License
=======
Licensed under LGPL v2.1.
|