File: advanced_search.rst

package info (click to toggle)
python-charset-normalizer 3.4.3-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 712 kB
  • sloc: python: 5,434; makefile: 25; sh: 17
file content (82 lines) | stat: -rw-r--r-- 2,788 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
Advanced Search
===============

Charset Normalizer method ``from_bytes``, ``from_fp`` and ``from_path`` provide some
optional parameters that can be tweaked.

As follow ::

    from charset_normalizer import from_bytes

    my_byte_str = 'Bсеки човек има право на образование.'.encode('cp1251')

    results = from_bytes(
        my_byte_str,
        steps=10,  # Number of steps/block to extract from my_byte_str
        chunk_size=512,  # Set block size of each extraction
        threshold=0.2,  # Maximum amount of chaos allowed on first pass
        cp_isolation=None,  # Finite list of encoding to use when searching for a match
        cp_exclusion=None,  # Finite list of encoding to avoid when searching for a match
        preemptive_behaviour=True,  # Determine if we should look into my_byte_str (ASCII-Mode) for pre-defined encoding
        explain=False,  # Print on screen what is happening when searching for a match
        language_threshold=0.1  # Minimum coherence ratio / language ratio match accepted
    )


Using CharsetMatches
------------------------------

Here, ``results`` is a ``CharsetMatches`` object. It behave like a list but does not implements all related methods.
Initially, it is sorted. Calling ``best()`` is sufficient to extract the most probable result.

.. autoclass:: charset_normalizer.CharsetMatches
    :members:

List behaviour
--------------

Like said earlier, ``CharsetMatches`` object behave like a list.

  ::

    # Call len on results also work
    if not results:
        print('No match for your sequence')

    # Iterate over results like a list
    for match in results:
        print(match.encoding, 'can decode properly your sequence using', match.alphabets, 'and language', match.language)

    # Using index to access results
    if results:
        print(str(results[0]))

Using best()
------------

Like said above, ``CharsetMatches`` object behave like a list and it is sorted by default after getting results from
``from_bytes``, ``from_fp`` or ``from_path``.

Using ``best()`` return the most probable result, the first entry of the list. Eg. idx 0.
It return a ``CharsetMatch`` object as return value or None if there is not results inside it.

 ::

    result = results.best()

Calling first()
---------------

The very same thing than calling the method ``best()``.

Class aliases
-------------

``CharsetMatches`` is also known as ``CharsetDetector``, ``CharsetDoctor`` and ``CharsetNormalizerMatches``.
It is useful if you prefer short class name.

Verbose output
--------------

You may want to understand why a specific encoding was not picked by charset_normalizer. All you have to do is passing
``explain`` to True when using methods ``from_bytes``, ``from_fp`` or ``from_path``.