File: NOTES-unicode.txt

package info (click to toggle)
grabserial 2.1.0-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye
  • size: 188 kB
  • sloc: python: 988; sh: 89; makefile: 7
file content (79 lines) | stat: -rw-r--r-- 3,381 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
Here are some miscelaneous notes on unicode and byte strings.

Since grabserial tries to run with either python2 or python3,
and these languages handle strings, bytes, and unicode strings
differently, the distinctions matter.

Lots of this info comes from 2 resources:
 * https://nedbatchelder.com/text/unipain.html - talk by Net Batchhelder
   at pycon 2012

 * https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ - by Joel Spolsky

= facts =
 * unicode must be converted to bytes for storage or transmittal
 * when dealing with bytes (something you got from somewhere) you have
   to know the encoding
   * this is hard for grabserial

= assumptions =
grabserial assumes that:
 * commands entered at the Linux command line are utf-8
   * this means arguments to grabserial are utf-8
 * commands entered at the Windows command line are utf-8
   * this means arguments to grabserial are utf-8
 * user input typed into grabserial (to its stdin) on Linux are utf-8
   * but we use sys.stdin.encoding to get it right
 * user input typed into grabserial (to its stdin) on Windows has unknown encoding
   * but we use sys.stdin.encoding to get it right
 * bytes coming from a Linux serial port are utf-8 (for the most part)
   but could be anything
   * grabserial treats these as byte strings
   * grabserial converts the byte strings to unicode, with the
     encoding of sys.stdout (this is usually utf8 in Linux)
     (on my Windows 7 system, this is cp437)
     for output to it's own stdout
   * errors are ignored - this ends up truncating weird chars from the
   serial port, when they are sent to stdout.
 * if the user specifies an output file, it is openend "bw", and the
    bytes from the serial port are never decoded or encoded, in order
    to preserve the exact data that was received from the serial port

= info about python unicode handling =
encode is used to convert unicode into a byte string
 * must specify encoding
 * will get errors if unicode can't be put into requested encoding
   * can specify how to handle errors, with following strings as
   second argument:
     * ignore - don't output anything for the char
     * replace - replace with ?
     * xmlcharrefreplace - replace with XML/HTML-compatible character
       * it is up to the browser to do something sensible with these
     * backslashreplace - replace with backslashed escape sequences

 * default encoding in python2 is ascii, which can't handle most unicode

decode is used to convert a byte string into unicode code points
 * must specify encoding
 * will get errors if the byte string has invalid characters based on stated encoding
   * can specify how to handle errors:
     * ignore - don't output anything for the char
     * replace - replace with U+FFFD

= python 2 =
strings are by default byte strings (type 'str')
unicode strings can be declared with 'u' prefix
allows the use of 'b' prefix to declare byte strings

will automatically convert bytes strings to unicode, using
  an ascii encoding, when python system feels like it

this may raise exceptions that can be hard to understand

Need to avoid this by doing explicit conversions

= python 3 =
strings are by default unicode strings

There is no automatic type conversion between byte strings and unicode