File: Unicode%20support.md

package info (click to toggle)
jsoncons 1.3.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 17,584 kB
  • sloc: cpp: 136,382; sh: 33; makefile: 5
file content (182 lines) | stat: -rw-r--r-- 5,250 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
### Narrow character support for UTF8 encoding

In the Linux and web worlds, `UTF-8` is the dominant character encoding.

Note that (at least in MSVS) you cannot open a Windows file with a Unicode name using the standard 
```cpp
std::fstream fs(const char* filename)
```
Instead you need to use the non standard Microsoft extension
```cpp
std::fstream fs(const wchar_t* filename)
```

#### Unicode escaping
```cpp
string inputStr("[\"\\u0040\\u0040\\u0000\\u0011\"]");
std::cout << "Input:    " << inputStr << '\n';

json arr = json::parse(inputStr);
std::string str = arr[0].as<std::string>();
std::cout << "Hex dump: [";
for (std::size_t i = 0; i < str.size(); ++i)
{
    unsigned int val = static_cast<unsigned int>(str[i]);
    if (i != 0)
    {
        std::cout << " ";
    }
    std::cout << "0x" << std::setfill('0') << std::setw(2) << std::hex << val;
}
std::cout << "]" << '\n';

std::ostringstream os;
os << arr;
std::cout << "Output:   " << os.str() << '\n';
```

Output:

```
Input:    ["\u0040\u0040\u0000\u0011"]
Hex dump: [0x40 0x40 0x00 0x11]
Output:   ["@@\u0000\u0011"]
```
Note that just the two control characters are escaped on output.

#### Reading escaped unicode into utf8 encodings and writing back escaped unicode
```cpp
string inputStr("[\"\\u007F\\u07FF\\u0800\"]");
std::cout << "Input:    " << inputStr << '\n';

json arr = json::parse(inputStr);
std::string s = arr[0].as<string>();
std::cout << "Hex dump: [";
for (std::size_t i = 0; i < s.size(); ++i)
{
    if (i != 0)
        std::cout << " ";
    unsigned int u(s[i] >= 0 ? s[i] : 256 + s[i] );
    std::cout << "0x"  << std::hex<< std::setfill('0') << std::setw(2) << u;
}
std::cout << "]" << '\n';

std::ostringstream os;
auto options = json_options{}
    .escape_all_non_ascii(true);
os << print(arr,options);
std::string outputStr = os.str();
std::cout << "Output:   " << os.str() << '\n';

json arr2 = json::parse(outputStr);
std::string s2 = arr2[0].as<string>();
std::cout << "Hex dump: [";
for (std::size_t i = 0; i < s2.size(); ++i)
{
    if (i != 0)
        std::cout << " ";
    unsigned int u(s2[i] >= 0 ? s2[i] : 256 + s2[i] );
    std::cout << "0x"  << std::hex<< std::setfill('0') << std::setw(2) << u;
}
std::cout << "]" << '\n';
```

Output:

```
Input:    ["\u007F\u07FF\u0800"]
Hex dump: [0x7f 0xdf 0xbf 0xe0 0xa0 0x80]
Output:   ["\u007F\u07FF\u0800"]
Hex dump: [0x7f 0xdf 0xbf 0xe0 0xa0 0x80]
```
Since the escaped unicode consists of a control character (0x7f) and non-ascii, we get back the same text as what we started with.

#### Reading escaped unicode into utf8 encodings and writing back escaped unicode (with continuations)
```cpp
string input = "[\"\\u8A73\\u7D30\\u95B2\\u89A7\\uD800\\uDC01\\u4E00\"]";
json value = json::parse(input);
auto options = json_options{}
    .escape_all_non_ascii(true);
string output;
value.dump(output,options);

std::cout << "Input:" << '\n';
std::cout << input << '\n';
std::cout << '\n';
std::cout << "Output:" << '\n';
std::cout << output << '\n';
```
Since all of the escaped unicode is non-ascii, we get back the same text as what we started with.
```
Input:
["\u8A73\u7D30\u95B2\u89A7\uD800\uDC01\u4E00"]

Output:
["\u8A73\u7D30\u95B2\u89A7\uD800\uDC01\u4E00"]
```
### Wide character support for UTF16 and UTF32 encodings

jsoncons supports wide character strings with `wjson`. It assumes `UTF16` encoding if `wchar_t` has size 2 (Windows) and `UTF32` encoding if `wchar_t` has size 4.

It is necessary to deal with UTF-16 character encoding in the Windows world because of lack of UTF-8 support in the Windows system API. 

Even if you choose to use wide character streams and strings to interact with the Windows API, you can still read and write to files in the more widely supported, endiness independent, UTF-8 format. To handle that you need to imbue your streams with the facet `std::codecvt_utf8_utf16`, which encapsulates the conversion between `UTF-8` and `UTF-16`.

Note that (at least in MSVS) you cannot open a Windows file with a Unicode name using the standard 

    std::wfstream fs(const char* filename)

Instead you need to use the non standard Microsoft extension

    std::wfstream fs(const wchar_t* filename)

#### Constructing a wjson value
```cpp
using jsoncons::wjson;

wjson j;
j[L"field1"] = L"test";
j[L"field2"] = 3.9;
j[L"field3"] = true;
std::wcout << j << L"\n";
```
Output:
```
{"field1":"test","field2":3.9,"field3":true}
```
#### Escaped unicode
```cpp
wstring input = L"[\"\\u007F\\u07FF\\u0800\"]";
std::wistringstream is(input);

wjson val = wjson::parse(is);

wstring s = val[0].as<wstring>();
std::cout << "length=" << s.length() << '\n';
std::cout << "Hex dump: [";
for (std::size_t i = 0; i < s.size(); ++i)
{
    if (i != 0)
        std::cout << " ";
    uint32_t u(s[i] >= 0 ? s[i] : 256 + s[i] );
    std::cout << "0x"  << std::hex<< std::setfill('0') << std::setw(2) << u;
}
std::cout << "]" << '\n';

std::wofstream os("output/xxx.txt");
os.imbue(std::locale(os.getloc(), new std::codecvt_utf8_utf16<wchar_t>));

auto options = wjson_options{}
    .escape_all_non_ascii(true);

os << pretty_print(val,options) << L"\n";
```
Output:
```
length=3
Hex dump: [0x7f 0x7ff 0x800]
```
and the file `xxx.txt` contains
```    
["\u007F\u07FF\u0800"]    
```