File: NOTES-unicode.txt

package info (click to toggle)
grabserial 1.9.8-1
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 128 kB
  • sloc: python: 435; sh: 45; makefile: 2
file content (65 lines) | stat: -rw-r--r-- 2,674 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Here are some miscelaneous notes on unicode and byte strings.

Since grabserial tries to run with either python2 or python3,
and these languages handle strings, bytes, and unicode strings
differently, the distinctions matter.

Lots of this info comes from 2 resources:
 * https://nedbatchelder.com/text/unipain.html - talk by Net Batchhelder
   at pycon 2012

 * https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ - by Joel Spolsky

= facts = 
 * unicode must be converted to bytes for storage or transmittal
 * when dealing with bytes (something you got from somewhere) you have
   to know the encoding
   * this is hard for grabserial

= assumptions =
grabserial assumes that:
 * commands entered at the Linux command line are utf-8
   * this means arguments to grabserial are utf-8
 * commands entered at the Windows command line are utf-8
   * this means arguments to grabserial are utf-8
 * user input typed into grabserial (to its stdin) on Linux are utf-8
   * but we use sys.stdin.encoding to get it right
 * user input typed into grabserial (to its stdin) has unknown encoding
   * but we use sys.stdin.encoding to get it right
 * bytes coming from a Linux serial port are utf-8 (for the most part)
   but could be anything
   * grabserial treats these as byte strings
   * grabserial converts the byte strings to unicode, with the
     encoding of sys.stdout (this is usually utf8 in Linux)
     (on my Windows 7 system, this is cp437)
     for output to it's own stdout
 * if the user specifies and output file, it is openend "bw" to preserve
    the exact data that was received from the serial port

= info about python unicode handling =
encode is used to convert unicode into a byte string
 * must specify encoding
 * will get errors if unicode can't be put into requested encoding
 * default encoding in python2 is ascii, which can't handle most unicode

decode is used to convert a byte string into unicode code points
 * must specify encoding
 * will get errors if the byte string has invalid characters based on stated encoding

= python 2 =
strings are by default byte strings (type 'str')
unicode strings can be declared with 'u' prefix
allows the use of 'b' prefix to declare byte strings

will automatically convert bytes strings to unicode, using
  an ascii encoding, when python system feels like it 

this may raise exceptions that can be hard to understand

Need to avoid this by doing explicit conversions

= python 3 =
strings are by default unicode strings

There is no automatic type conversion between byte strings and unicode