File: NOTES.rst

package info (click to toggle)
chardet 4.0.0-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye, sid
  • size: 8,092 kB
  • sloc: xml: 66,422; python: 36,299; makefile: 155; sh: 3
file content (140 lines) | stat: -rw-r--r-- 3,773 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
Class Hierarchy for chardet
===========================

Universal Detector
------------------
Has a list of probers.

CharSetProber
-------------
Mostly abstract parent class.

CharSetGroupProber
------------------
Runs a bunch of related probers at the same time and decides which is best.

SBCSGroupProber
---------------
SBCS = Single-ByteCharSet. Runs a bunch of SingleByteCharSetProbers.  Always
contains the same SingleByteCharSetProbers.

SingleByteCharSetProber
-----------------------
A CharSetProber that is used for detecting single-byte encodings by using
a "precedence matrix" (i.e., a character bigram model).

MBCSGroupProber
---------------
Runs a bunch of MultiByteCharSetProbers. It also uses a UTF8Prober, which is
essentially a MultiByteCharSetProber that only has a state machine.  Always
contains the same MultiByteCharSetProbers.

MultiByteCharSetProber
----------------------
A CharSetProber that uses both a character unigram model (or "character
distribution analysis") and an independent state machine for trying to
detect and encoding.

CodingStateMachine
------------------
Used for "coding scheme" detection, where we just look for either invalid
byte sequences or sequences that only occur for that particular encoding.

CharDistributionAnalysis
------------------------
Used for character unigram distribution encoding detection.  Takes a mapping
from characters to a "frequency order" (i.e., what frequency rank that byte has
in the given encoding) and a "typical distribution ratio", which is the number
of occurrences of the 512 most frequently used characters divided by the number
of occurrences of the rest of the characters for a typical document.
The "characters" in this case are 2-byte sequences and they are first converted
to an "order" (name comes from ord() function, I believe). This "order" is used
to index into the frequency order table to determine the frequency rank of that
byte sequence.  The reason this extra step is necessary is that the frequency
rank table is language-specific (and not encoding-specific).


What's where
============


Bigram files
------------

- ``hebrewprober.py``
- ``jpcntxprober.py``
- ``langbulgarianmodel.py``
- ``langcyrillicmodel.py``
- ``langgreekmodel.py``
- ``langhebrewmodel.py``
- ``langhungarianmodel.py``
- ``langthaimodel.py``
- ``latin1prober.py``
- ``sbcharsetprober.py``
- ``sbcsgroupprober.py``


Coding Scheme files
-------------------

- ``escprober.py``
- ``escsm.py``
- ``utf8prober.py``
- ``codingstatemachine.py``
- ``mbcssmprober.py``


Unigram files
-------------

- ``big5freqprober.py``
- ``chardistribution.py``
- ``euckrfreqprober.py``
- ``euctwfreqprober.py``
- ``gb2312freqprober.py``
- ``jisfreqprober.py``

Multibyte probers
-----------------

- ``big5prober.py``
- ``cp949prober.py``
- ``eucjpprober.py``
- ``euckrprober.py``
- ``euctwprober.py``
- ``gb2312prober.py``
- ``mbcharsetprober.py``
- ``mbcsgroupprober.py``
- ``sjisprober.py``

Misc files
----------

- ``__init__.py`` (currently has ``detect`` function in it)
- ``compat.py``
- ``enums.py``
- ``universaldetector.py``
- ``version.py``


Useful links
============

This is just a collection of information that I've found useful or thought
might be useful in the future:

- `BOM by Encoding`_

- `A Composite Approach to Language/Encoding Detection`_

- `What Every Programmer Absolutely...`_

- The actual `source`_


.. _BOM by Encoding:
    https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
.. _A Composite Approach to Language/Encoding Detection:
    http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
.. _What Every Programmer Absolutely...: http://kunststube.net/encoding/
.. _source: https://dxr.mozilla.org/mozilla/source/intl/chardet/