File: README

package info (click to toggle)
translitcodec 0.7.0-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 336 kB
  • sloc: python: 2,600; perl: 182; makefile: 33; sh: 5
file content (65 lines) | stat: -rw-r--r-- 3,113 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Unicode to 8-bit charset transliteration codec.

This package contains codecs for transliterating ISO 10646 texts into
best-effort representations using smaller coded character sets (ASCII,
ISO 8859, etc.).  The translation tables used by the codecs are from
the ``transtab`` collection by Markus Kuhn.

Three types of transliterating codecs are provided:

  "long", using as many characters as needed to make a natural
   replacement.  For example, \u00e4 LATIN SMALL LETTER A WITH
   DIAERESIS ``ä`` will be replaced with ``ae``.

  "short", using the minimum number of characters to make a
  replacement.  For example, \u00e4 LATIN SMALL LETTER A WITH
  DIAERESIS ``ä`` will be replaced with ``a``.

  "one", only performing single character replacements.  Characters
  that can not be transliterated with a single character are passed
  through unchanged. For example, \u2639 WHITE FROWNING FACE ``☹``
  will be passed through unchanged.

Using the codecs is simple::

  >>> import translitcodec
  >>> import codecs
  >>> codecs.encode('fácil € ☺', 'translit/long')
  'facil EUR :-)'
  >>> codecs.encode('fácil € ☺', 'translit/short')
  'facil E :-)'

The codecs return Unicode by default.  To receive a bytestring back,
either chain the output of encode() to another codec, or append the
name of the desired byte encoding to the codec name::

  >>> codecs.encode('fácil € ☺', 'translit/one').encode('ascii', 'replace')
  'facil E ?'
  >>> 'fácil € ☺'.encode('translit/one/ascii', 'replace')
  'facil E ?'

The package also supplies a 'transliterate' codec, an alias for
'translit/long'.

Another way to use the library is to use an error handle.
Error handles are available:

  * 'strict/translit/long', 'strict/translit/short', 'strict/translit/one' - similar to 'strict'
  * 'ignore/translit/long', 'ignore/translit/short', 'ignore/translit/one' - similar to 'ignore'
  * 'replace/translit/long', 'replace/translit/short', 'replace/translit/one' - similar to 'replace'

These error handles above, work similarly to Python's built-in ones.
The difference is that transliteration is attempted first.

  >>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'replace/translit/long').decode('ISO-8859-2')
  'Zażółć gęślą jaźń EUR :-)?!@#'
  >>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'replace/translit/short').decode('ISO-8859-2')
  'Zażółć gęślą jaźń E :-)?!@#'
  >>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'replace/translit/one').decode('ISO-8859-2')
  'Zażółć gęślą jaźń E ??!@#'
  >>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'ignore/translit/long').decode('ISO-8859-2')
  'Zażółć gęślą jaźń EUR :-)!@#'
  >>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'ignore/translit/short').decode('ISO-8859-2')
  'Zażółć gęślą jaźń E :-)!@#'
  >>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'ignore/translit/one').decode('ISO-8859-2')
  'Zażółć gęślą jaźń E !@#'