1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781
|
<pre>Network Working Group P. Hoffman
Request for Comments: 2781 Internet Mail Consortium
Category: Informational F. Yergeau
Alis Technologies
February 2000
<span class="h1">UTF-16, an encoding of ISO 10646</span>
Status of this Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2000). All Rights Reserved.
<span class="h2"><a class="selflink" id="section-1" href="#section-1">1</a>. Introduction</span>
This document describes the UTF-16 encoding of Unicode/ISO-10646,
addresses the issues of serializing UTF-16 as an octet stream for
transmission over the Internet, discusses MIME charset naming as
described in [<a href="#ref-CHARSET-REG" title=""IANA Charset Registration Procedures"">CHARSET-REG</a>], and contains the registration for three
MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE
(little-endian), and UTF-16.
<span class="h3"><a class="selflink" id="section-1.1" href="#section-1.1">1.1</a> Background and motivation</span>
The Unicode Standard [<a href="#ref-UNICODE" title=""The Unicode Standard -- Version 3.0"">UNICODE</a>] and ISO/IEC 10646 [<a href="#ref-ISO-10646" title="published as Amendment 1. Many other amendments are currently at various stages of standardization. A second edition is in preparation">ISO-10646</a>] jointly
define a coded character set (CCS), hereafter referred to as Unicode,
which encompasses most of the world's writing systems [<a href="#ref-WORKSHOP" title=""Report of the IAB Character Set Workshop"">WORKSHOP</a>].
UTF-16, the object of this specification, is one of the standard ways
of encoding Unicode character data; it has the characteristics of
encoding all currently defined characters (in plane 0, the BMP) in
exactly two octets and of being able to encode all other characters
likely to be defined (the next 16 planes) in exactly four octets.
The Unicode Standard further defines additional character properties
and other application details of great interest to implementors. Up
to the present time, changes in Unicode and amendments to ISO/IEC
10646 have tracked each other, so that the character repertoires and
code point assignments have remained in sync. The relevant
standardization committees have committed to maintain this very
useful synchronism, as well as not to assign characters outside of
the 17 planes accessible to UTF-16.
<span class="grey">Hoffman & Yergeau Informational [Page 1]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-2" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
The IETF policy on character sets and languages [<a href="#ref-CHARPOLICY" title=""IETF Policy on Character Sets and Languages"">CHARPOLICY</a>] says
that IETF protocols MUST be able to use the UTF-8 character encoding
scheme [<a href="#ref-UTF-8" title=""UTF-8, a transformation format of ISO 10646"">UTF-8</a>]. Some products and network standards already specify
UTF-16, making it an important encoding for the Internet. This
document is not an update to the [<a href="#ref-CHARPOLICY" title=""IETF Policy on Character Sets and Languages"">CHARPOLICY</a>] document, only a
description of the UTF-16 encoding.
<span class="h3"><a class="selflink" id="section-1.2" href="#section-1.2">1.2</a> Terminology</span>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <a href="./rfc2119">RFC 2119</a> [<a href="#ref-MUSTSHOULD" title=""Key words for use in RFCs to Indicate Requirement Levels"">MUSTSHOULD</a>].
Throughout this document, character values are shown in hexadecimal
notation. For example, "0x013C" is the character whose value is the
character assigned the integer value 316 (decimal) in the CCS.
<span class="h2"><a class="selflink" id="section-2" href="#section-2">2</a>. UTF-16 definition</span>
UTF-16 is described in the Unicode Standard, version 3.0 [<a href="#ref-UNICODE" title=""The Unicode Standard -- Version 3.0"">UNICODE</a>].
The definitive reference is Annex Q of ISO/IEC 10646-1 [<a href="#ref-ISO-10646" title="published as Amendment 1. Many other amendments are currently at various stages of standardization. A second edition is in preparation">ISO-10646</a>].
The rest of this section summarizes the definition is simple terms.
In ISO 10646, each character is assigned a number, which Unicode
calls the Unicode scalar value. This number is the same as the UCS-4
value of the character, and this document will refer to it as the
"character value" for brevity. In the UTF-16 encoding, characters are
represented using either one or two unsigned 16-bit integers,
depending on the character value. Serialization of these integers for
transmission as a byte stream is discussed in <a href="#section-3">Section 3</a>.
The rules for how characters are encoded in UTF-16 are:
- Characters with values less than 0x10000 are represented as a
single 16-bit integer with a value equal to that of the character
number.
- Characters with values between 0x10000 and 0x10FFFF are
represented by a 16-bit integer with a value between 0xD800 and
0xDBFF (within the so-called high-half zone or high surrogate
area) followed by a 16-bit integer with a value between 0xDC00 and
0xDFFF (within the so-called low-half zone or low surrogate area).
- Characters with values greater than 0x10FFFF cannot be encoded in
UTF-16.
Note: Values between 0xD800 and 0xDFFF are specifically reserved for
use with UTF-16, and don't have any characters assigned to them.
<span class="grey">Hoffman & Yergeau Informational [Page 2]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-3" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
<span class="h3"><a class="selflink" id="section-2.1" href="#section-2.1">2.1</a> Encoding UTF-16</span>
Encoding of a single character from an ISO 10646 character value to
UTF-16 proceeds as follows. Let U be the character number, no greater
than 0x10FFFF.
1) If U < 0x10000, encode U as a 16-bit unsigned integer and
terminate.
2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF,
U' must be less than or equal to 0xFFFFF. That is, U' can be
represented in 20 bits.
3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and
0xDC00, respectively. These integers each have 10 bits free to
encode the character value, for a total of 20 bits.
4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order
bits of W1 and the 10 low-order bits of U' to the 10 low-order
bits of W2. Terminate.
Graphically, steps 2 through 4 look like:
U' = yyyyyyyyyyxxxxxxxxxx
W1 = 110110yyyyyyyyyy
W2 = 110111xxxxxxxxxx
<span class="h3"><a class="selflink" id="section-2.2" href="#section-2.2">2.2</a> Decoding UTF-16</span>
Decoding of a single character from UTF-16 to an ISO 10646 character
value proceeds as follows. Let W1 be the next 16-bit integer in the
sequence of integers representing the text. Let W2 be the (eventual)
next integer following W1.
1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value
of W1. Terminate.
2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
is in error and no valid character can be obtained using W1.
Terminate.
3) If there is no W2 (that is, the sequence ends with W1), or if W2
is not between 0xDC00 and 0xDFFF, the sequence is in error.
Terminate.
4) Construct a 20-bit unsigned integer U', taking the 10 low-order
bits of W1 as its 10 high-order bits and the 10 low-order bits of
W2 as its 10 low-order bits.
<span class="grey">Hoffman & Yergeau Informational [Page 3]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-4" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
5) Add 0x10000 to U' to obtain the character value U. Terminate.
Note that steps 2 and 3 indicate errors. Error recovery is not
specified by this document. When terminating with an error in steps 2
and 3, it may be wise to set U to the value of W1 to help the caller
diagnose the error and not lose information. Also note that a string
decoding algorithm, as opposed to the single-character decoding
described above, need not terminate upon detection of an error, if
proper error reporting and/or recovery is provided.
<span class="h2"><a class="selflink" id="section-3" href="#section-3">3</a>. Labelling UTF-16 text</span>
<a href="#appendix-A">Appendix A</a> of this specification contains registrations for three
MIME charsets: "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets
represent the combination of a CCS (a coded character set) and a CES
(a character encoding scheme). Here the CCS is Unicode/ISO 10646 and
the CES is the same in all three cases, except for the serialization
order of the octets in each character, and the external determination
of which serialization is used.
This section describes which of the three labels to apply to a stream
of text. <a href="#section-4">Section 4</a> describes how to interpret the labels on a stream
of text.
<span class="h3"><a class="selflink" id="section-3.1" href="#section-3.1">3.1</a> Definition of big-endian and little-endian</span>
Historically, computer hardware has processed two-octet entities such
as 16-bit integers in one of two ways. So-called "big-endian"
hardware handles two-octet entities with the higher-order octet
first, that is at the lower address in memory; when written out to
disk or to a network interface (serializing), the high-order octet
thus appears first in the data stream. On the other hand, "Little-
endian" hardware handles two-octet entities with the lower-order
octet first. Hardware of both kinds is common today.
For example, the unsigned 16-bit integer that represents the decimal
number 258 is 0x0102. The big-endian serialization of that number is
the octet 0x01 followed by the octet 0x02. The little-endian
serialization of that number is the octet 0x02 followed by the octet
0x01. The following C code fragment demonstrates a way to write 16-
bit quantities to a file in big-endian order, irrespective of the
hardware's native byte order.
void write_be(unsigned short u, FILE f) /* assume short is 16 bits */
{
putc(u >> 8, f); /* output high-order byte */
putc(u & 0xFF, f); /* then low-order */
}
<span class="grey">Hoffman & Yergeau Informational [Page 4]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-5" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
The term "network byte order" has been used in many RFCs to indicate
big-endian serialization, although that term has yet to be formally
defined in a standards-track document. Although ISO 10646 prefers
big-endian serialization (section 6.3 of [<a href="#ref-ISO-10646" title="published as Amendment 1. Many other amendments are currently at various stages of standardization. A second edition is in preparation">ISO-10646</a>]), little-endian
order is also sometimes used on the Internet.
<span class="h3"><a class="selflink" id="section-3.2" href="#section-3.2">3.2</a> Byte order mark (BOM)</span>
The Unicode Standard and ISO 10646 define the character "ZERO WIDTH
NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE
ORDER MARK" (abbreviated "BOM"). The latter name hints at a second
possible usage of the character, in addition to its normal use as a
genuine "ZERO WIDTH NON-BREAKING SPACE" within text. This usage,
suggested by Unicode <a href="#section-2.4">section 2.4</a> and ISO 10646 Annex F (informative),
is to prepend a 0xFEFF character to a stream of Unicode characters as
a "signature"; a receiver of such a serialized stream may then use
the initial character both as a hint that the stream consists of
Unicode characters and as a way to recognize the serialization order.
In serialized UTF-16 prepended with such a signature, the order is
big-endian if the first two octets are 0xFE followed by 0xFF; if they
are 0xFF followed by 0xFE, the order is little-endian. Note that
0xFFFE is not a Unicode character, precisely to preserve the
usefulness of 0xFEFF as a byte-order mark.
It is important to understand that the character 0xFEFF appearing at
any position other than the beginning of a stream MUST be interpreted
with the semantics for the zero-width non-breaking space, and MUST
NOT be interpreted as a byte-order mark. The contrapositive of that
statement is not always true: the character 0xFEFF in the first
position of a stream MAY be interpreted as a zero-width non-breaking
space, and is not always a byte-order mark. For example, if a process
splits a UTF-16 string into many parts, a part might begin with
0xFEFF because there was a zero-width non-breaking space at the
beginning of that substring.
The Unicode standard further suggests than an initial 0xFEFF
character may be stripped before processing the text, the rationale
being that such a character in initial position may be an artifact of
the encoding (an encoding signature), not a genuine intended "ZERO
WIDTH NON-BREAKING SPACE". Note that such stripping might affect an
external process at a different layer (such as a digital signature or
a count of the characters) that is relying on the presence of all
characters in the stream.
In particular, in UTF-16 plain text it is likely, but not certain,
that an initial 0xFEFF is a signature. When concatenating two
strings, it is important to strip out those signatures, because
otherwise the resulting string may contain an unintended "ZERO WIDTH
<span class="grey">Hoffman & Yergeau Informational [Page 5]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-6" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
NON-BREAKING SPACE" at the connection point. Also, some
specifications mandate an initial 0xFEFF character in objects
labelled as UTF-16 and specify that this signature is not part of the
object.
<span class="h3"><a class="selflink" id="section-3.3" href="#section-3.3">3.3</a> Choosing a label for UTF-16 text</span>
Any labelling application that uses UTF-16 character encoding, and
explicitly labels the text, and knows the serialization order of the
characters in text, SHOULD label the text as either "UTF-16BE" or
"UTF-16LE", whichever is appropriate based on the endianness of the
text. This allows applications processing the text, but unable to
look inside the text, to know the serialization definitively.
Text in the "UTF-16BE" charset MUST be serialized with the octets
which make up a single 16-bit UTF-16 value in big-endian order.
Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text.
Text in the "UTF-16LE" charset MUST be serialized with the octets
which make up a single 16-bit UTF-16 value in little-endian order.
Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text.
Any labelling application that uses UTF-16 character encoding, and
puts an explicit charset label on the text, and does not know the
serialization order of the characters in text, MUST label the text as
"UTF-16", and SHOULD make sure the text starts with 0xFEFF.
An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE"
would occur with document formats that mandate a BOM in UTF-16 text,
thereby requiring the use of the "UTF-16" tag only.
<span class="h2"><a class="selflink" id="section-4" href="#section-4">4</a>. Interpreting text labels</span>
When a program sees text labelled as "UTF-16BE", "UTF-16LE", or
"UTF-16", it can make some assumptions, based on the labelling rules
given in the previous section. These assumptions allow the program to
then process the text.
<span class="h3"><a class="selflink" id="section-4.1" href="#section-4.1">4.1</a> Interpreting text labelled as UTF-16BE</span>
Text labelled "UTF-16BE" can always be interpreted as being big-
endian. The detection of an initial BOM does not affect de-
serialization of text labelled as UTF-16BE. Finding 0xFF followed by
0xFE is an error since there is no Unicode character 0xFFFE.
<span class="grey">Hoffman & Yergeau Informational [Page 6]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-7" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
<span class="h3"><a class="selflink" id="section-4.2" href="#section-4.2">4.2</a> Interpreting text labelled as UTF-16LE</span>
Text labelled "UTF-16LE" can always be interpreted as being little-
endian. The detection of an initial BOM does not affect de-
serialization of text labelled as UTF-16LE. Finding 0xFE followed by
0xFF is an error since there is no Unicode character 0xFFFE, which
would be the interpretation of those octets under little-endian
order.
<span class="h3"><a class="selflink" id="section-4.3" href="#section-4.3">4.3</a> Interpreting text labelled as UTF-16</span>
Text labelled with the "UTF-16" charset might be serialized in either
big-endian or little-endian order. If the first two octets of the
text is 0xFE followed by 0xFF, then the text can be interpreted as
being big-endian. If the first two octets of the text is 0xFF
followed by 0xFE, then the text can be interpreted as being little-
endian. If the first two octets of the text is not 0xFE followed by
0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
interpreted as being big-endian.
All applications that process text with the "UTF-16" charset label
MUST be able to read at least the first two octets of the text and be
able to process those octets in order to determine the serialization
order of the text. Applications that process text with the "UTF-16"
charset label MUST NOT assume the serialization without first
checking the first two octets to see if they are a big-endian BOM, a
little-endian BOM, or not a BOM. All applications that process text
with the "UTF-16" charset label MUST be able to interpret both big-
endian and little-endian text.
<span class="h2"><a class="selflink" id="section-5" href="#section-5">5</a>. Examples</span>
For the sake of example, let's suppose that there is a hieroglyphic
character representing the Egyptian god Ra with character value
0x12345 (this character does not exist at present in Unicode).
The examples here all evaluate to the phrase:
*=Ra
where the "*" represents the Ra hieroglyph (0x12345).
Text labelled with UTF-16BE, without a BOM:
D8 08 DF 45 00 3D 00 52 00 61
Text labelled with UTF-16LE, without a BOM:
08 D8 45 DF 3D 00 52 00 61 00
<span class="grey">Hoffman & Yergeau Informational [Page 7]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-8" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
Big-endian text labelled with UTF-16, with a BOM:
FE FF D8 08 DF 45 00 3D 00 52 00 61
Little-endian text labelled with UTF-16, with a BOM:
FF FE 08 D8 45 DF 3D 00 52 00 61 00
<span class="h2"><a class="selflink" id="section-6" href="#section-6">6</a>. Versions of the standards</span>
ISO/IEC 10646 is updated from time to time by published amendments;
similarly, different versions of the Unicode standard exist: 1.0,
1.1, 2.0, 2.1, and 3.0 as of this writing. Each new version replaces
the previous one, but implementations, and more significantly data,
are not updated instantly.
In general, the changes amount to adding new characters, which does
not pose particular problems with old data. Amendment 5 to ISO/IEC
10646, however, has moved and expanded the Korean Hangul block,
thereby making any previous data containing Hangul characters invalid
under the new version. Unicode 2.0 has the same difference from
Unicode 1.1. The official justification for allowing such an
incompatible change was that no significant implementations and data
containing Hangul existed, a statement that is likely to be true but
remains unprovable. The incident has been dubbed the "Korean mess",
and the relevant committees have pledged to never, ever again make
such an incompatible change.
New versions, and in particular any incompatible changes, have
consequences regarding MIME character encoding labels, to be
discussed in <a href="#appendix-A">Appendix A</a>.
<span class="h2"><a class="selflink" id="section-7" href="#section-7">7</a>. IANA Considerations</span>
IANA is to register the character sets found in Appendixes A.1, A.2,
and A.3 according to <a href="./rfc2278">RFC 2278</a>, using registration templates found in
those appendixes.
<span class="h2"><a class="selflink" id="section-8" href="#section-8">8</a>. Security Considerations</span>
UTF-16 is based on the ISO 10646 character set, which is frequently
being added to, as described in <a href="#section-6">Section 6</a> and <a href="#appendix-A">Appendix A</a> of this
document. Processors must be able to handle characters that are not
defined at the time that the processor was created in such a way as
to not allow an attacker to harm a recipient by including unknown
characters.
Processors that handle any type of text, including text encoded as
UTF-16, must be vigilant in checking for control characters that
might reprogram a display terminal or keyboard. Similarly, processors
<span class="grey">Hoffman & Yergeau Informational [Page 8]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-9" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
that interpret text entities (such as looking for embedded
programming code), must be careful not to execute the code without
first alerting the recipient.
Text in UTF-16 may contain special characters, such as the OBJECT
REPLACEMENT CHARACTER (0xFFFC), that might cause external processing,
depending on the interpretation of the processing program and the
availability of an external data stream that would be executed. This
external processing may have side-effects that allow the sender of a
message to attack the receiving system.
Implementors of UTF-16 need to consider the security aspects of how
they handle illegal UTF-16 sequences (that is, sequences involving
surrogate pairs that have illegal values or unpaired surrogates). It
is conceivable that in some circumstances an attacker would be able
to exploit an incautious UTF-16 parser by sending it an octet
sequence that is not permitted by the UTF-16 syntax, causing it to
behave in some anomalous fashion.
<span class="h2"><a class="selflink" id="section-9" href="#section-9">9</a>. References</span>
[<a id="ref-CHARPOLICY">CHARPOLICY</a>] Alvestrand, H., "IETF Policy on Character Sets and
Languages", <a href="https://www.rfc-editor.org/bcp/bcp18">BCP 18</a>, <a href="./rfc2277">RFC 2277</a>, January 1998.
[<a id="ref-CHARSET-REG">CHARSET-REG</a>] Freed, N. and J. Postel, "IANA Charset Registration
Procedures", <a href="https://www.rfc-editor.org/bcp/bcp19">BCP 19</a>, <a href="./rfc2278">RFC 2278</a>, January 1998.
[<a id="ref-HTTP-1.1">HTTP-1.1</a>] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext
Transfer Protocol -- HTTP/1.1", <a href="./rfc2616">RFC 2616</a>, June 1999.
[<a id="ref-ISO-10646">ISO-10646</a>] ISO/IEC 10646-1:1993. International Standard --
Information technology -- Universal Multiple-Octet
Coded Character Set (UCS) -- Part 1: Architecture and
Basic Multilingual Plane. 22 amendments and two
technical corrigenda have been published up to now.
UTF-16 is described in Annex Q, published as Amendment
1. Many other amendments are currently at various
stages of standardization. A second edition is in
preparation, probably to be published in 2000; in this
new edition, UTF-16 will probably be described in Annex
C.
[<a id="ref-MUSTSHOULD">MUSTSHOULD</a>] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", <a href="https://www.rfc-editor.org/bcp/bcp14">BCP 14</a>, <a href="./rfc2119">RFC 2119</a>, March 1997.
[<a id="ref-UNICODE">UNICODE</a>] The Unicode Consortium, "The Unicode Standard --
Version 3.0", ISBN 0-201-61633-5. Described at
<span class="grey">Hoffman & Yergeau Informational [Page 9]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-10" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
<<a href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html">http://www.unicode.org/unicode/standard/versions/Unicode3.0.html</a>>.
[<a id="ref-UTF-8">UTF-8</a>] Yergeau, F., "UTF-8, a transformation format of ISO
10646", <a href="./rfc2279">RFC 2279</a>, January 1998.
[<a id="ref-WORKSHOP">WORKSHOP</a>] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin., M. and P. Svanberg, "Report of
the IAB Character Set Workshop", <a href="./rfc2130">RFC 2130</a>, April 1997.
<span class="h2"><a class="selflink" id="section-10" href="#section-10">10</a>. Acknowledgments</span>
Deborah Goldsmith wrote a great deal of the initial wording for this
specification. Martin Duerst proposed numerous significant changes.
Other significant contributors include:
Mati Allouche
Walt Daniels
Mark Davis
Ned Freed
Asmus Freytag
Lloyd Honomichl
Dan Kegel
Murata Makoto
Larry Masinter
Markus Scherer
Keld Simonsen
Ken Whistler
Some of the text in this specification was copied from [<a href="#ref-UTF-8" title=""UTF-8, a transformation format of ISO 10646"">UTF-8</a>], and
that document was worked on by many people. Please see the
acknowledgments section in that document for more people who may have
contributed indirectly to this document.
<span class="grey">Hoffman & Yergeau Informational [Page 10]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-11" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
<span class="h2"><a class="selflink" id="appendix-A" href="#appendix-A">A</a>. Charset registrations</span>
This memo is meant to serve as the basis for registration of three
MIME charsets [<a href="#ref-CHARSET-REG" title=""IANA Charset Registration Procedures"">CHARSET-REG</a>]. The proposed charsets are "UTF-16BE",
"UTF-16LE", and "UTF-16". These strings label objects containing text
consisting of characters from the repertoire of ISO/IEC 10646
including all amendments at least up to amendment 5 (Korean block),
encoded to a sequence of octets using the encoding and serialization
schemes outlined above.
Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for
use in media types under the "text" top-level type, because they do
not encode line endings in the way required for MIME "text" media
types. An exception to this is HTTP, which uses a MIME-like
mechanism, but is exempt from the restrictions on the text top-level
type (see <a href="#section-19.4.2">section 19.4.2</a> of HTTP 1.1 [<a href="#ref-HTTP-1.1" title=""Hypertext Transfer Protocol -- HTTP/1.1"">HTTP-1.1</a>]).
It is noteworthy that the labels described here do not contain a
version identification, referring generically to ISO/IEC 10646. This
is intentional, the rationale being as follows:
A MIME charset is designed to give just the information needed to
interpret a sequence of bytes received on the wire into a sequence of
characters, nothing more (see <a href="./rfc2045#section-2.2">RFC 2045, section 2.2</a>, in [MIME]). As
long as a character set standard does not change incompatibly,
version numbers serve no purpose, because one gains nothing by
learning from the tag that newly assigned characters may be received
that one doesn't know about. The tag itself doesn't teach anything
about the new characters, which are going to be received anyway.
Hence, as long as the standards evolve compatibly, the apparent
advantage of having labels that identify the versions is only that,
apparent. But there is a disadvantage to such version-dependent
labels: when an older application receives data accompanied by a
newer, unknown label, it may fail to recognize the label and be
completely unable to deal with the data, whereas a generic, known
label would have triggered mostly correct processing of the data,
which may well not contain any new characters.
The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible
change, in principle contradicting the appropriateness of a version
independent MIME charset as described above. But the compatibility
problem can only appear with data containing Korean Hangul characters
encoded according to Unicode 1.1 (or equivalently ISO/IEC 10646
before amendment 5), and there is arguably no such data to worry
about, this being the very reason the incompatible change was deemed
acceptable.
<span class="grey">Hoffman & Yergeau Informational [Page 11]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-12" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
In practice, then, a version-independent label is warranted, provided
the label is understood to refer to all versions after Amendment 5,
and provided no incompatible change actually occurs. Should
incompatible changes occur in a later version of ISO/IEC 10646, the
MIME charsets defined here will stay aligned with the previous
version until and unless the IETF specifically decides otherwise.
<span class="h3"><a class="selflink" id="appendix-A.1" href="#appendix-A.1">A.1</a> Registration for UTF-16BE</span>
To: ietf-charsets@iana.org
Subject: Registration of new charset
Charset name(s): UTF-16BE
Published specification(s): This specification
Suitable for use in MIME content types under the
"text" top-level type: No
Person & email address to contact for further information:
Paul Hoffman <phoffman@imc.org>
Francois Yergeau <fyergeau@alis.com>
<span class="h3"><a class="selflink" id="appendix-A.2" href="#appendix-A.2">A.2</a> Registration for UTF-16LE</span>
To: ietf-charsets@iana.org
Subject: Registration of new charset
Charset name(s): UTF-16LE
Published specification(s): This specification
Suitable for use in MIME content types under the
"text" top-level type: No
Person & email address to contact for further information:
Paul Hoffman <phoffman@imc.org>
Francois Yergeau <fyergeau@alis.com>
<span class="h3"><a class="selflink" id="appendix-A.3" href="#appendix-A.3">A.3</a> Registration for UTF-16</span>
To: ietf-charsets@iana.org
Subject: Registration of new charset
Charset name(s): UTF-16
Published specification(s): This specification
<span class="grey">Hoffman & Yergeau Informational [Page 12]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-13" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
Suitable for use in MIME content types under the
"text" top-level type: No
Person & email address to contact for further information:
Paul Hoffman <phoffman@imc.org>
Francois Yergeau <fyergeau@alis.com>
Authors' Addresses
Paul Hoffman
Internet Mail Consortium
127 Segre Place
Santa Cruz, CA 95060 USA
EMail: phoffman@imc.org
Francois Yergeau
Alis Technologies
100, boul. Alexis-Nihon, Suite 600
Montreal QC H4M 2P2 Canada
EMail: fyergeau@alis.com
<span class="grey">Hoffman & Yergeau Informational [Page 13]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-14" ></span>
<span class="grey"><a href="./rfc2781">RFC 2781</a> UTF-16, an encoding of ISO 10646 February 2000</span>
Full Copyright Statement
Copyright (C) The Internet Society (2000). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Internet Society.
Hoffman & Yergeau Informational [Page 14]
</pre>
|