1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
|
=================
Support
=================
**If you are running:**
- Python >=2.7,<3.5: Unsupported
- Python 3.5: charset-normalizer < 2.1
- Python 3.6: charset-normalizer < 3.1
Upgrade your Python interpreter as soon as possible.
-------------------
Supported Encodings
-------------------
Here are a list of supported encoding and supported language with latest update. Also this list
may change depending of your python version.
Charset Normalizer is able to detect any of those encoding. This list is NOT static and depends heavily on what your
current cPython version is shipped with. See https://docs.python.org/3/library/codecs.html#standard-encodings
=============== ===============================================================================================================================
IANA Code Page Aliases
=============== ===============================================================================================================================
ascii 646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii
big5 big5_tw, csbig5, x_mac_trad_chinese
big5hkscs big5_hkscs, hkscs
cp037 037, csibm037, ebcdic_cp_ca, ebcdic_cp_nl, ebcdic_cp_us, ebcdic_cp_wt, ibm037, ibm039
cp1026 1026, csibm1026, ibm1026
cp1125 1125, ibm1125, cp866u, ruscii
cp1140 1140, ibm1140
cp1250 1250, windows_1250
cp1251 1251, windows_1251
cp1252 1252, windows_1252
cp1253 1253, windows_1253
cp1254 1254, windows_1254
cp1255 1255, windows_1255
cp1256 1256, windows_1256
cp1257 1257, windows_1257
cp1258 1258, windows_1258
cp273 273, ibm273, csibm273
cp424 424, csibm424, ebcdic_cp_he, ibm424
cp437 437, cspc8codepage437, ibm437
cp500 500, csibm500, ebcdic_cp_be, ebcdic_cp_ch, ibm500
cp775 775, cspc775baltic, ibm775
cp850 850, cspc850multilingual, ibm850
cp852 852, cspcp852, ibm852
cp855 855, csibm855, ibm855
cp857 857, csibm857, ibm857
cp858 858, csibm858, ibm858
cp860 860, csibm860, ibm860
cp861 861, cp_is, csibm861, ibm861
cp862 862, cspc862latinhebrew, ibm862
cp863 863, csibm863, ibm863
cp864 864, csibm864, ibm864
cp865 865, csibm865, ibm865
cp866 866, csibm866, ibm866
cp869 869, cp_gr, csibm869, ibm869
cp932 932, ms932, mskanji, ms_kanji
cp949 949, ms949, uhc
cp950 950, ms950
euc_jis_2004 jisx0213, eucjis2004, euc_jis2004
euc_jisx0213 eucjisx0213
euc_jp eucjp, ujis, u_jis
euc_kr euckr, korean, ksc5601, ks_c_5601, ks_c_5601_1987, ksx1001, ks_x_1001, x_mac_korean
gb18030 gb18030_2000
gb2312 chinese, csiso58gb231280, euc_cn, euccn, eucgb2312_cn, gb2312_1980, gb2312_80, iso_ir_58, x_mac_simp_chinese
gbk 936, cp936, ms936
hp_roman8 roman8, r8, csHPRoman8
hz hzgb, hz_gb, hz_gb_2312
iso2022_jp csiso2022jp, iso2022jp, iso_2022_jp
iso2022_jp_1 iso2022jp_1, iso_2022_jp_1
iso2022_jp_2 iso2022jp_2, iso_2022_jp_2
iso2022_jp_3 iso2022jp_3, iso_2022_jp_3
iso2022_jp_ext iso2022jp_ext, iso_2022_jp_ext
iso2022_kr csiso2022kr, iso2022kr, iso_2022_kr
iso8859_10 csisolatin6, iso_8859_10, iso_8859_10_1992, iso_ir_157, l6, latin6
iso8859_11 thai, iso_8859_11, iso_8859_11_2001
iso8859_13 iso_8859_13, l7, latin7
iso8859_14 iso_8859_14, iso_8859_14_1998, iso_celtic, iso_ir_199, l8, latin8
iso8859_15 iso_8859_15, l9, latin9
iso8859_16 iso_8859_16, iso_8859_16_2001, iso_ir_226, l10, latin10
iso8859_2 csisolatin2, iso_8859_2, iso_8859_2_1987, iso_ir_101, l2, latin2
iso8859_3 csisolatin3, iso_8859_3, iso_8859_3_1988, iso_ir_109, l3, latin3
iso8859_4 csisolatin4, iso_8859_4, iso_8859_4_1988, iso_ir_110, l4, latin4
iso8859_5 csisolatincyrillic, cyrillic, iso_8859_5, iso_8859_5_1988, iso_ir_144
iso8859_6 arabic, asmo_708, csisolatinarabic, ecma_114, iso_8859_6, iso_8859_6_1987, iso_ir_127
iso8859_7 csisolatingreek, ecma_118, elot_928, greek, greek8, iso_8859_7, iso_8859_7_1987, iso_ir_126
iso8859_8 csisolatinhebrew, hebrew, iso_8859_8, iso_8859_8_1988, iso_ir_138
iso8859_9 csisolatin5, iso_8859_9, iso_8859_9_1989, iso_ir_148, l5, latin5
iso2022_jp_2004 iso_2022_jp_2004, iso2022jp_2004
johab cp1361, ms1361
koi8_r cskoi8r
kz1048 kz_1048, rk1048, strk1048_2002
latin_1 8859, cp819, csisolatin1, ibm819, iso8859, iso8859_1, iso_8859_1, iso_8859_1_1987, iso_ir_100, l1, latin, latin1
mac_cyrillic maccyrillic
mac_greek macgreek
mac_iceland maciceland
mac_latin2 maccentraleurope, maclatin2
mac_roman macintosh, macroman
mac_turkish macturkish
ptcp154 csptcp154, pt154, cp154, cyrillic_asian
shift_jis csshiftjis, shiftjis, sjis, s_jis, x_mac_japanese
shift_jis_2004 shiftjis2004, sjis_2004, s_jis_2004
shift_jisx0213 shiftjisx0213, sjisx0213, s_jisx0213
tis_620 tis620, tis_620_0, tis_620_2529_0, tis_620_2529_1, iso_ir_166
utf_16 u16, utf16
utf_16_be unicodebigunmarked, utf_16be
utf_16_le unicodelittleunmarked, utf_16le
utf_32 u32, utf32
utf_32_be utf_32be
utf_32_le utf_32le
utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4 (+utf_8_sig)
utf_7* u7, unicode-1-1-utf-7
cp720 N.A.
cp737 N.A.
cp856 N.A.
cp874 N.A.
cp875 N.A.
cp1006 N.A.
koi8_r N.A.
koi8_t N.A.
koi8_u N.A.
=============== ===============================================================================================================================
*: Only if a SIG/mark is found.
-------------------
Supported Languages
-------------------
Those language can be detected inside your content. All of these are specified in ./charset_normalizer/assets/__init__.py .
| English,
| German,
| French,
| Dutch,
| Italian,
| Polish,
| Spanish,
| Russian,
| Japanese,
| Portuguese,
| Swedish,
| Chinese,
| Ukrainian,
| Norwegian,
| Finnish,
| Vietnamese,
| Czech,
| Hungarian,
| Korean,
| Indonesian,
| Turkish,
| Romanian,
| Farsi,
| Arabic,
| Danish,
| Serbian,
| Lithuanian,
| Slovene,
| Slovak,
| Malay,
| Hebrew,
| Bulgarian,
| Croatian,
| Hindi,
| Estonian,
| Thai,
| Greek,
| Tamil.
----------------------------
Incomplete Sequence / Stream
----------------------------
It is not (yet) officially supported. If you feed an incomplete byte sequence (eg. truncated multi-byte sequence) the detector will
most likely fail to return a proper result.
If you are purposely feeding part of your payload for performance concerns, you may stop doing it as this package is fairly optimized.
We are working on a dedicated way to handle streams.
|