1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
|
{0 Unicode}
A {{!minimal}minimal introduction} and {{!tips}OCaml tips}.
{1:minimal A minimal introduction}
This introduction aims at presenting the minimum one should know to be
able to work sanely with Unicode. It is not specific to OCaml.
{2:characters Characters, if they exist}
The purpose of Unicode is to have a universal way of representing
characters of writing systems known to the world in computer
systems. Defining the notion of character is a very complicated
question with both philosophical and political implications. To side
step these issues, we only talk about characters from a programmer's
point of view and simply say that the purpose of Unicode is to assign
meaning to the integers of a well-defined integer range.
This range is called the Unicode {e codespace}, it spans from [0x0000]
to [0x10FFFF] and its boundaries are cast in stone. Members of this
range are called Unicode {e code points}. Note that an OCaml [int]
value can represent them on both 32- and 64-bit platforms.
There's a lot of (non-exclusive)
{{:http://www.unicode.org/glossary/}terminology} predicates that
can be applied to code points. I will only mention the most useful
ones here.
First there are the {e reserved} or {e unassigned} code points, those
are the integers to which the standard doesn't assign any meaning {e
yet}. They are reserved for future assignment and may become
meaningful in newer versions of the standard. Be aware that once a
code point has been assigned (aka as {e encoded}) by the standard most
of its properties may never change again, see the
{{:http://www.unicode.org/policies/stability_policy.html}stability
policy} for details.
A very important subset of code points are the Unicode {e scalar
values}, these are the code points that belong to the ranges
[0x0000]…[0xD7FF] and [0xE000]…[0x10FFFF]. This is the complete
Unicode codespace minus the range [0xD800]…[0xDFFF] of so called {e
surrogate} code points, a hack to be able to encode all scalar values
in UTF-16 (more on that below).
Scalar values are what I call, by a {b total abuse of terminology},
the Unicode characters; it is what a proper [uchar] type should
represent. From a programmer's point of view they are the sole
integers you will have to deal with during processing and the only
code points that you are allowed to serialize and deserialize to valid
Unicode byte sequences. Since OCaml 4.03 the standard library defines
the {!Stdlib.Uchar.t} type to represent them.
Unicode uses a standard notation to denote code points in running
text. A code point is expressed as U+n where {e n} is four to six
uppercase hexadecimal digits with leading zeros omitted unless the
code point has fewer than four digits (in [printf] words: ["U+%04X"]).
For example the code point bounds are expressed by U+0000 and U+10FFFF
and the surrogate bounds by U+D800 and U+DFFF.
{2:assignments What is assigned ?}
Lots of the world's scripts are encoded in the standard. The
{{:http://www.unicode.org/charts/}code charts} give a precise idea of
the coverage.
In order to be successful Unicode decided to be inclusive and to
contain pre-existing international and national standards. For example
the scalar values from U+0000 to U+007F correspond exactly to the code
values of characters encoded by the US-ASCII standard, while those
from U+0000 to U+00FF correspond exactly to the code values of
ISO-8859-1 (latin1). Many other standard are injected into the
codespace but their map to Unicode scalar values may not be as
straightforward as the two examples given above.
One thing to be aware of is that because of the inclusive nature of
the standard the same abstract character may be represented in more
than one way by the standard. A simple example is the latin character
"é", which can either be represented by the single scalar value U+00E9
or by the {e sequence} of scalar values <U+0065, U+0301> that is a
latin small letter "e" followed by the combining acute accent
"´". This non uniqueness of representation is problematic, for example
whenever you want to test sequences of scalar values for
equality. Unicode solves this by defining equivalence classes between
sequences of scalar values, this is called Unicode normalization and
we will talk about it later.
Another issue is character spoofing. Many encoded characters resemble
each other when displayed but have different scalar values and
meaning. The {{:http://www.unicode.org/faq/security.html}Unicode
Security FAQ} has more information and pointers about these issues.
{2:serializing Serializing integers — UTF-8, UTF-16, …}
There is more than one way of representing a large integer as a
sequence of bytes. The Unicode standard defines seven {e encoding
schemes}, also known as Unicode transformation formats (UTF), that
precisely define how to encode and decode {e scalar values} — take
note, scalar values, {b not code points} — as byte sequences.
{ul
{- UTF-8, a scalar value is represented by a sequence of one to 4
bytes. One of the valuable property of UTF-8 is that it is
compatible with the encoding of US-ASCII: the one byte sequences
are solely used for encoding the 128 scalar value U+0000 to U+007F
which correspond exactly to the US-ASCII code values. Any scalar
value stricly greater than U+007F will use more than one byte.}
{- UTF-16BE, a scalar value is either represented by one 16 bit
big-endian integer if its scalar value fits or by two surrogate
code points encoded as 16 bit big-endian integers (how exactly is
beyond the scope of this introduction).}
{- UTF-16LE is like UTF-16BE but uses little-endian encoded integers.}
{- UTF-16 is either UTF-16BE or UTF-16LE. The endianness is determined
by looking at the two initial bytes of the data stream:
{ol
{- If they encode a byte order mark character (BOM, U+FEFF) they
will be either [(0xFF,0xFE)], indicating UTF-16LE, or
[(0xFE,0xFF)] indicating UTF-16BE.}
{- Otherwise UTF-16BE is assumed.}}}
{- UTF-32BE, a scalar value is represented by one 32 bit big-endian
integer.}
{- UTF-32LE is like UTF-32BE but uses little-endian encoded integers.}
{- UTF-32 is either UTF-32BE or UTF-32LE, using the same
byte order mark mechanism as UTF-16, looking at the four initial
bytes of the data stream.}}
The cost of using one representation over the other depends on the
character usage. For example UTF-8 is fine for latin scripts but
wasteful for east-asian scripts, while the converse is true for
UTF-16. I never saw any usage of UTF-32 on disk or wires, it is very
wasteful. However, in memory, UTF-32 has the advantage that
characters become directly indexable.
For more information see the
{{:http://www.unicode.org/faq/utf_bom.html}Unicode UTF-8, UTF-16,
UTF-32 and BOM FAQ}.
{2:useful_scalar_values Useful scalar values}
The following scalar values are useful to know:
{ul
{- U+FEFF, the byte order mark (BOM) character used to detect endianness on
byte order sensitive UTFs.}
{- U+FFFD, the replacement character. Can be used to: stand for
unrepresentable characters when transcoding from another
representation, indicate that something was lost in best-effort
UTF decoders, etc.}
{- U+1F42B, the emoji bactrian camel (🐫, since Unicode 6.0.0).}}
{2:equivalence Equivalence and normalization}
We mentioned above that concrete textual data may be represented by
more than one sequence of scalar values. Latin letters with diacritics
are a simple example of that. In order to be able to test two
sequences of scalar values for equality we should be able to ignore
these differences. The easiest way to do so is to convert them to a
normal form where these differences are removed and then use binary
equality to test them.
However first we need to define a notion of equality between
sequences. Unicode defines two of them, which one to use depends on
your processing context.
{ul
{- {e Canonical} equivalence. Equivalent sequences should display
and be interpreted the same way when printed. For example the
sequence "B", "Ä" (<U+0042, U+00C4>) is canonically equivalent to
"B", "A", "¨" (<U+0042, U+0041, U+0308>).}
{- {e Compatibility} equivalence. Equivalent sequences may have format
differences in display and may be interpreted differently in some
contexts. For example the sequence made of the latin small ligature
fi "fi" (<U+FB01>) is compatibility equivalent to the sequence "f",
"i" (<U+0066, U+0069>). These two sequences are however not
canonically equivalent.}}
Canonical equivalence is included in compatibility equivalence: two
canonically equivalent sequences are also compatibility equivalent,
but the converse may not be true. Compatibility equivalence
distinguishes less, it has more equalities.
A normal form is a function mapping a sequence of scalar values to
a sequence of scalar values. The Unicode standard defines four
different normal forms, the one to use depends on the equivalence
you want and your processing context:
{ul
{- Normalization form D (NFD). Removes any canonical difference
and decomposes characters. For example the sequence "é"
(<U+00E9>) will normalize to the sequence "e", "´" (<U+0065, U+0301>.)}
{- Normalization form C (NFC). Removes any canonical difference and
composes characters. For example the sequence "e", "´" (<U+0065,
U+0301>) will normalize to the sequence "é" (<U+00E9>)}
{- Normalization form KD (NFKD). Removes canonical and compatibility
differences and decomposes characters.}
{- Normalization form KC (NFKC). Removes canonical and compatibility
differences and composes characters.}}
Once you have two sequences in a known normal form you can compare
them using binary equality. If the normal form is NFD or NFC, binary
equality will entail canonical equivalence of the sequences. If the
normal form is NFKC or NFKD equality will entail compatibility
equivalence of the sequences. Note that normal forms are {b not}
closed under concatenation: if you concatenate two sequences of scalar
values you have to renormalize the result.
For more information about normalization, see the
{{:http://www.unicode.org/faq/normalization.html}Normalization FAQ}.
{2:collation Collation — sorting in alphabetical order}
Normalisation forms allow to define a total order between sequences of
scalar values using binary comparison. However this order is purely
arbitrary. It has no meaning because the magnitude of a scalar value
has, in general, no meaning.
The process of ordering sequences of scalar values in a standard order
like alphabetical order is called {e collation}. Unicode defines a
customizable algorithm to order two sequences of scalar values in a
meaningful way: the Unicode collation algorithm. For more information
and further pointers see the
{{:http://www.unicode.org/faq/collation.html}Unicode Collation FAQ}.
{1:tips OCaml tips}
{2:uchartype Characters as [Uchar.t] values.}
Since OCaml 4.03 the standard library defines the {!Stdlib.Uchar.t}
type which represents {{!characters}Unicode scalar values}.
{2:utf_8_strings Unicode text as UTF-8 encoded OCaml strings}
For most OCaml programs it is entirely sufficient to deal with Unicode
by just treating the byte sequence of regular OCaml [string] values as
{b valid} UTF-8 encoded data. Many libraries return Unicode text using
this representation.
Besides latin1 identifiers having been deprecated in OCaml 4.01, UTF-8
encoding your sources allows you to write UTF-8 encoded string
literals directly in your programs. Be aware though that as far as
OCaml's compiler is concerned these are just sequences of bytes and
you can't trust these strings to be valid UTF-8 as they depend on how
correctly your editor encodes them. That is you {b will need} to
validate and most likely normalize them unless you:
{ul
{- Escape their valid UTF-8 bytes explicitly. For example
["\xF0\x9F\x90\xAB"] is the correct encoding of U+1F42B.}
{- Or use Unicode escapes (since OCaml 4.06). For example ["\u{1F42B}"]
will UTF-8 encode the character U+1F42B in the string.}}
Checking the validity of UTF-8 strings should only be performed at the
boundaries of your program: on your string literals, on data input or
on the results of untrusted libraries (be careful, some libraries like
Yojson will happily return invalid UTF-8 strings). This allows you to
only deal with valid UTF-8 throughout your program and avoid redundant
validity checks, internally or on output. The following properties of
UTF-8 are useful to remember:
{ul
{- UTF-8 validity is closed under string concatenation:
concatenating two valid UTF-8 strings results in a valid UTF-8
string.}
{- Splitting a valid UTF-8 encoded string at UTF-8
encoded US-ASCII scalar values (i.e. at any byte < 128) will
result in valid UTF-8 encoded substrings.}}
{2:utf_decoding UTF encoding and decoding support}
UTF decoding and validity checking is available in the OCaml standard
library {{!Stdlib.String.utf}[String]} and
{{!Stdlib.Bytes.utf}[Bytes]} modules since OCaml 4.14.
UTF encoding is available both in the {{!Stdlib.Bytes.utf}[Bytes]}
(since 4.14) and {{!Stdlib.Buffer}[Buffer]} module (since 4.06).
If you need this functionality prior to 4.14 the third-party {!Uutf}
module can be used.
{2:utf_8_ascii UTF-8 and ASCII}
As mentioned in {!serializing}, each of the 128 US-ASCII characters is
represented by its own US-ASCII byte representation in UTF-8.
So if you want to look for a US-ASCII character in a UTF-8 encoded
string, you can just scan the bytes.
But beware on the nature of your data and the algorithm you need to
implement. For example to detect spaces in the string, looking for the
US-ASCII space U+0020 may not be sufficient, there are a lot of other
space characters like the no break space U+00A0 that are beyond the
US-ASCII repertoire.
Decoding the characters with {!String.get_utf_8_uchar} and checking
them with {!Uucp.White.is_white_space} is a better idea. Same holds
for line breaks, see for example {!Uutf.nln} and {!Uutf.readlines} for
more information about these issues.
{2:eqcmpnorm Equate, compare and normalize UTF-8 encoded OCaml strings}
If you understood well the above section about
{{!equivalence}equivalence and normalization} you should realise that
blindly comparing UTF-8 encoded OCaml strings using {!Stdlib.compare}
won't bring you anywhere if you don't normalize them before.
The {!Uunf_string} module can be used for that. Remember that concatenating
normalized strings does {b not} result in a normalized string.
Using {!Stdlib.compare} on {e normalized} UTF-8 encoded OCaml strings
defines a total order on them that you can use with the {!Map} or
{!Set} modules as long as you are not interested in the actual {e
meaning} of the order.
For case insensitive equality have a look at the
{{!Uucp.Case.caselesseq}sample code} of the {!Case} module.
{2:alphasort Sort strings alphabetically}
String collation can be performed using the
{{:https://github.com/paurkedal/confero}confero} package.
{2:boundaries Find user-perceived character, word, sentence and line
boundaries in Unicode text.}
The {!Uuseg} module implements the
{{:http://www.unicode.org/reports/tr29/} Unicode text segmentation
algorithms} to find user-perceived character, word and sentence
boundaries in Unicode text. It also provides an implementation of the
{{:http://www.unicode.org/reports/tr14/}Unicode Line Breaking
Algorithm} to find line breaks and line break opportunities.
Among other things the {!Uuseg_string} module uses these algorithms to
provide OCaml standard library {{!Stdlib.Format}formatters} for best-effort
formatting of UTF-8 encoded strings.
{2:readline Unicode readline}
A [readline] function as mandated by the Unicode standard is available
in {{!Uutf.readline}[Uutf]'s sample code}.
{2:noranges Range processing}
Forget about trying to process Unicode characters using hard coded
ranges of scalar values like it was possible to do with US-ASCII. The
Unicode standard is not closed, it is evolving, new characters are
being assigned. This makes it impossible to derive properties based
simply on their integer value or position in ranges of
characters.
This is the reason why we have the Unicode character database and
[Uucp] to access their properties. Using {!Uucp.White.is_white_space}
will be future proof should a new character deemed white be added to
the standard – both [Uucp] and your program will need a recompile
though.
{2:transcode Transcoding}
Transcoding from legacy encodings to Unicode may be quite involved,
use {{:https://github.com/yoriyuki/Camomile}Camomile} if you need to
do that. {{:https://github.com/mirage/rosetta}rosetta} may be another
option.
There is however one translation that is very easy and direct: it is
the one from ISO 8859-1 also known as latin1, the default encoding of
OCaml [char]s. latin1 having been encoded in Unicode in the range of
scalar values U+0000 to U+00FF which corresponds to latin1 code value,
the translation is trivial, it is the identity:
{[
let char_to_scalar_value c = Char.code c
let char_of_scalar_value s =
if s > 255 then invalid_arg "" (* can't represent *) else
Char.chr s
]}
{2:ppcp Pretty-printing code points in ASCII}
["U+%04X"] is an OCaml formatting string for printing a US-ASCII
representation of a Unicode code point according to the standards'
notational conventions. This is what the {!Fmt.Dump.uchar} formatter
does for {!Stdlib.Uchar.t} values.
{2:ocamllibs Writing OCaml libraries}
If you write a library that deals with textual data, you should,
unless technically impossible, always interact with the client of the
library using Unicode. If there are other encodings involved transcode
them to/from Unicode so that the client needs only to deal with
Unicode, the burden of dealing with the encoding mess has to be on the
library, not the client.
In this case there is no absolute need to depend on a Unicode text
data structure, just use {b valid} UTF-8 encoded data as OCaml
[string]s and the standard library {!Stdlib.Uchar.t} type.
Specify clearly in the documentation that all the [string]s returned
by or given to the library must be valid UTF-8 encoded data. This
validity contract is important for performance reasons: it allows both
the client and the library to trust the string and forgo redundant
validity checks. Remember that concatenating valid UTF-8 strings
results in valid UTF-8 string.
|