File: automaton_iter_long.rst

package info (click to toggle)
python-pyahocorasick 1.4.1-2
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 748 kB
  • sloc: ansic: 4,554; python: 2,823; sh: 312; makefile: 242
file content (44 lines) | stat: -rw-r--r-- 1,486 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
iter_long(string, [start, [end]])
----------------------------------------------------------------------

Perform the modified Aho-Corasick search procedure which matches
the longest words from set.

Return an iterator of tuples (``end_index``, ``value``) for keys found in
string where:

- ``end_index`` is the end index in the input string where a trie key
  string was found.
- ``value`` is the value associated with the found key string.

The ``start`` and ``end`` optional arguments can be used to limit the search
to an input string slice as in ``string[start:end]``.


Example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The default Aho-Corasick algorithm returns all occurrences of words stored
in the automaton, including substring of other words from string. Method
``iter_long`` reports only the longest match.

For set of words {"he", "her", "here"} and a needle "he here her" the
default algorithm finds following words: "he", "he", "her", "here", "he",
"her", while the modified one yields only: "he", "here", "her".

.. code:: python

    >>> import ahocorasick
    >>> A = ahocorasick.Automaton()
    >>> A.add_word("he", "he")
    True
    >>> A.add_word("her", "her")
    True
    >>> A.add_word("here", "here")
    True
    >>> A.make_automaton()
    >>> needle = "he here her"
    >>> list(A.iter_long(needle))
    [(1, 'he'), (6, 'here'), (10, 'her')]
    >>> list(A.iter(needle))
    [(1, 'he'), (4, 'he'), (5, 'her'), (6, 'here'), (9, 'he'), (10, 'her')]