1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341
|
<pre>Internet Architecture Board (IAB) D. Thaler
Request for Comments: 6055 Microsoft
Updates: <a href="./rfc2130">2130</a> J. Klensin
Category: Informational
ISSN: 2070-1721 S. Cheshire
Apple
February 2011
<span class="h1">IAB Thoughts on Encodings for Internationalized Domain Names</span>
Abstract
This document explores issues with Internationalized Domain Names
(IDNs) that result from the use of various encoding schemes such as
UTF-8 and the ASCII-Compatible Encoding produced by the Punycode
algorithm. It focuses on the importance of agreeing on a single
encoding and how complicated the state of affairs ends up being as a
result of using different encodings today.
Status of This Memo
This document is not an Internet Standards Track specification; it is
published for informational purposes.
This document is a product of the Internet Architecture Board (IAB)
and represents information that the IAB has deemed valuable to
provide for permanent record. Documents approved for publication by
the IAB are not a candidate for any level of Internet Standard; see
<a href="./rfc5741#section-2">Section 2 of RFC 5741</a>.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
<a href="http://www.rfc-editor.org/info/rfc6055">http://www.rfc-editor.org/info/rfc6055</a>.
Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to <a href="https://www.rfc-editor.org/bcp/bcp78">BCP 78</a> and the IETF Trust's Legal
Provisions Relating to IETF Documents
(<a href="http://trustee.ietf.org/license-info">http://trustee.ietf.org/license-info</a>) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document.
<span class="grey">Thaler, et al. Informational [Page 1]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-2" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
Table of Contents
<a href="#section-1">1</a>. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . <a href="#page-2">2</a>
<a href="#section-1.1">1.1</a>. APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . <a href="#page-8">8</a>
<a href="#section-2">2</a>. Use of Non-DNS Protocols . . . . . . . . . . . . . . . . . . . <a href="#page-9">9</a>
<a href="#section-3">3</a>. Use of Non-ASCII in DNS . . . . . . . . . . . . . . . . . . . <a href="#page-10">10</a>
<a href="#section-3.1">3.1</a>. Examples . . . . . . . . . . . . . . . . . . . . . . . . . <a href="#page-14">14</a>
<a href="#section-4">4</a>. Recommendations . . . . . . . . . . . . . . . . . . . . . . . <a href="#page-16">16</a>
<a href="#section-5">5</a>. Security Considerations . . . . . . . . . . . . . . . . . . . <a href="#page-18">18</a>
<a href="#section-6">6</a>. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . <a href="#page-19">19</a>
<a href="#section-7">7</a>. IAB Members at the Time of Approval . . . . . . . . . . . . . <a href="#page-19">19</a>
<a href="#section-8">8</a>. References . . . . . . . . . . . . . . . . . . . . . . . . . . <a href="#page-20">20</a>
<a href="#section-8.1">8.1</a>. Normative References . . . . . . . . . . . . . . . . . . . <a href="#page-20">20</a>
<a href="#section-8.2">8.2</a>. Informative References . . . . . . . . . . . . . . . . . . <a href="#page-20">20</a>
<span class="h2"><a class="selflink" id="section-1" href="#section-1">1</a>. Introduction</span>
The goal of this document is to explore what can be learned from some
current difficulties in implementing Internationalized Domain Names
(IDNs).
A domain name consists of a sequence of labels, conventionally
written separated by dots. An IDN is a domain name that contains one
or more labels that, in turn, contain one or more non-ASCII
characters. Just as with plain ASCII domain names, each IDN label
must be encoded using some mechanism before it can be transmitted in
network packets, stored in memory, stored on disk, etc. These
encodings need to be reversible, but they need not store domain names
the same way humans conventionally write them on paper. For example,
when transmitted over the network in DNS packets, domain name labels
are *not* separated with dots.
Internationalized Domain Names for Applications (IDNA), discussed
later in this document, is the standard that defines the use and
coding of internationalized domain names for use on the public
Internet [<a href="./rfc5890" title=""Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework"">RFC5890</a>]. An earlier version of IDNA [<a href="./rfc3490" title=""Internationalizing Domain Names in Applications (IDNA)"">RFC3490</a>] is now
being phased out. Except where noted, the two versions are
approximately the same with regard to the issues discussed in this
document. However, some explanations appeared in the earlier
documents that were no longer considered useful when the later
revision was created; they are quoted here from the documents in
which they appear. In addition, the terminology of the two versions
differ somewhat; this document reflects the terminology of the
current version.
Unicode [<a href="#ref-Unicode" title=""The Unicode Standard, Version 5.0"">Unicode</a>] is a list of characters (including non-spacing
marks that are used to form some other characters), where each
character is assigned an integer value, called a code point. In
<span class="grey">Thaler, et al. Informational [Page 2]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-3" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
simple terms a Unicode string is a string of integer code point
values in the range 0 to 1,114,111 (10FFFF in base 16). These
integer code points must be encoded using some mechanism before they
can be transmitted in network packets, stored in memory, stored on
disk, etc. Some common ways of encoding these integer code point
values in computer systems include UTF-8, UTF-16, and UTF-32. In
addition to the material below, those forms and the tradeoffs among
them are discussed in Chapter 2 of The Unicode Standard [<a href="#ref-Unicode" title=""The Unicode Standard, Version 5.0"">Unicode</a>].
UTF-8 is a mechanism for encoding a Unicode code point in a variable
number of 8-bit octets, where an ASCII code point is preserved as-is.
Those octets encode a string of integer code point values, which
represent a string of Unicode characters. The authoritative
definition of UTF-8 is in Sections <a href="#section-3.9">3.9</a> and <a href="#section-3.10">3.10</a> of The Unicode
Standard [<a href="#ref-Unicode" title=""The Unicode Standard, Version 5.0"">Unicode</a>], but a description of UTF-8 encoding can also be
found in <a href="./rfc3629">RFC 3629</a> [<a href="./rfc3629" title=""UTF-8, a transformation format of ISO 10646"">RFC3629</a>]. Descriptions and formulae can also be
found in Annex D of ISO/IEC 10646-1 [<a href="#ref-10646" title=""Information Technology - Universal Multiple-octet coded Character Set (UCS)"">10646</a>].
UTF-16 is a mechanism for encoding a Unicode code point in one or two
16-bit integers, described in detail in Sections <a href="#section-3.9">3.9</a> and <a href="#section-3.10">3.10</a> of The
Unicode Standard [<a href="#ref-Unicode" title=""The Unicode Standard, Version 5.0"">Unicode</a>]. A UTF-16 string encodes a string of
integer code point values that represent a string of Unicode
characters.
UTF-32 (formerly UCS-4), also described in Sections <a href="#section-3.9">3.9</a> and <a href="#section-3.10">3.10</a> of
The Unicode Standard [<a href="#ref-Unicode" title=""The Unicode Standard, Version 5.0"">Unicode</a>], is a mechanism for encoding a Unicode
code point in a single 32-bit integer. A UTF-32 string is thus a
string of 32-bit integer code point values, which represent a string
of Unicode characters.
Note that UTF-16 results in some all-zero octets when code points
occur early in the Unicode sequence, and UTF-32 always has all-zero
octets.
IDNA specifies validity of a label, such as what characters it can
contain, relationships among them, and so on, in Unicode terms.
Valid labels can be in either "U-label" or "A-label" form, with the
appropriate one determined by particular protocols or by context.
U-label form is a direct representation of the Unicode characters
using one of the encoding forms discussed above. This document
discusses UTF-8 strings in many places. While all U-labels can be
represented by UTF-8 strings, not all UTF-8 strings are valid
U-labels (see <a href="#section-2.3.2">Section 2.3.2</a> of the IDNA Definitions document
[<a href="./rfc5890" title=""Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework"">RFC5890</a>] for a discussion of these distinctions). A-label form uses
a compressed, ASCII-compatible encoding (an "ACE" in IDNA and other
terminology) produced by an algorithm called Punycode. U-labels and
<span class="grey">Thaler, et al. Informational [Page 3]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-4" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
A-labels are duals of each other: transformations from one to the
other do not lose information. The transformation mechanisms are
specified in the IDNA Protocol document [<a href="./rfc5891" title=""Internationalized Domain Names in Applications (IDNA): Protocol"">RFC5891</a>].
Punycode [<a href="./rfc3492" title=""Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)"">RFC3492</a>] is thus a mechanism for encoding a Unicode string
in an ASCII-compatible encoding, i.e., using only letters, digits,
and hyphens from the ASCII character set. When a Unicode label that
is valid under the IDNA rules (a U-label) is encoded with Punycode
for IDNA purposes, it is prefixed with "xn--"; the result is called
an A-label. The prefix convention assumes that no other DNS labels
(at least no other DNS labels in IDNA-aware applications) are allowed
to start with these four characters. Consequently, when A-label
encoding is assumed, any DNS labels beginning with "xn--" now have a
different meaning (the Punycode encoding of a label containing one or
more non-ASCII characters) or no defined meaning at all (in the case
of labels that are not IDNA-compliant, i.e., are not well-formed
A-labels).
ISO-2022-JP [<a href="./rfc1468" title=""Japanese Character Encoding for Internet Messages"">RFC1468</a>] is a mechanism for encoding a string of ASCII
and Japanese characters, where an ASCII character is preserved as-is.
ISO-2022-JP is stateful: special sequences are used to switch between
character coding tables. As a result, if there are lost or mangled
characters in a character stream, it is extremely difficult to
recover the original stream after such a lost character encoding
shift.
Comparison of Unicode strings is not as easy as comparing ASCII
strings. First, there are a multitude of ways to represent a string
of Unicode characters. Second, in many languages and scripts, the
actual definition of "same" is very context-dependent. Because of
this, comparison of two Unicode strings must take into account how
the Unicode strings are encoded. Regardless of the encoding,
however, comparison cannot simply be done by comparing the encoded
Unicode strings byte by byte. The only time that is possible is when
the strings are both mapped into some canonical form and encoded the
same way.
In 1996 the IAB sponsored a workshop on character sets and encodings
[<a href="./rfc2130" title=""The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996"">RFC2130</a>]. This document adds to that discussion and focuses on the
importance of agreeing on a single encoding and how complicated the
state of affairs ends up being as a result of using different
encodings today.
Different applications, APIs, and protocols use different encoding
schemes today. Many of them were originally defined to use only
ASCII. Internationalizing Domain Names in Applications (IDNA)
[<a href="./rfc5890" title=""Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework"">RFC5890</a>] defines a mechanism that requires changes to applications,
but in an attempt not to change APIs or servers, specifies that the
<span class="grey">Thaler, et al. Informational [Page 4]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-5" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
A-label format is to be used in many contexts. In some ways this
could be seen as not changing the existing APIs, in the sense that
the strings being passed to and from the APIs are still apparently
ASCII strings. In other ways it is a very profound change to the
existing APIs, because while those strings are still syntactically
valid ASCII strings, they no longer mean the same thing that they
used to. What looks like a plain ASCII string to one piece of
software or library could be seen by another piece of software or
library (with the application of out-of-band information) to be in
fact an encoding of a Unicode string.
<a href="#section-1.3">Section 1.3</a> of the original IDNA specification [<a href="./rfc3490" title=""Internationalizing Domain Names in Applications (IDNA)"">RFC3490</a>] states:
The IDNA protocol is contained completely within applications. It
is not a client-server or peer-to-peer protocol: everything is
done inside the application itself. When used with a DNS resolver
library, IDNA is inserted as a "shim" between the application and
the resolver library. When used for writing names into a DNS
zone, IDNA is used just before the name is committed to the zone.
Figure 1 depicts a simplistic architecture that a naive reader might
assume from the paragraph quoted above. (A variant of this same
picture appears in <a href="#section-6">Section 6</a> of the original IDNA specification
[<a href="./rfc3490" title=""Internationalizing Domain Names in Applications (IDNA)"">RFC3490</a>], further strengthening this assumption.)
<span class="grey">Thaler, et al. Informational [Page 5]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-6" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
+-----------------------------------------+
|Host |
| +-------------+ |
| | Application | |
| +------+------+ |
| | |
| +----+----+ |
| | DNS | |
| | Resolver| |
| | Library | |
| +----+----+ |
| | |
+-----------------------------------------+
|
_________|_________
/ \
/ \
/ \
| Internet |
\ /
\ /
\___________________/
Simplistic Architecture
Figure 1
There are, however, two problems with this simplistic architecture
that cause it to differ from reality.
First, resolver APIs on Operating Systems (OSs) today (Mac OS,
Windows, Linux, etc.) are not DNS-specific. They typically provide a
layer of indirection so that the application can work independent of
the name resolution mechanism, which could be DNS, mDNS
[<a href="#ref-DNS-MULTICAST" title=""Multicast DNS"">DNS-MULTICAST</a>], LLMNR [<a href="./rfc4795" title=""Link-local Multicast Name Resolution (LLMNR)"">RFC4795</a>], NetBIOS-over-TCP
[<a href="./rfc1001" title=""Protocol standard for a NetBIOS service on a TCP/UDP transport: Concepts and methods"">RFC1001</a>][RFC1002], hosts table [<a href="./rfc0952" title=""DoD Internet host table specification"">RFC0952</a>], NIS [<a href="#ref-NIS" title=""System and Network Administration"">NIS</a>], or anything
else. For example, "Basic Socket Interface Extensions for IPv6"
[<a href="./rfc3493" title=""Basic Socket Interface Extensions for IPv6"">RFC3493</a>] specifies the getaddrinfo() API and contains many phrases
like "For example, when using the DNS" and "any type of name
resolution service (for example, the DNS)". Importantly, DNS is
mentioned only as an example, and the application has no knowledge as
to whether DNS or some other protocol will be used.
Second, even with the DNS protocol, private namespaces (sometimes
including private uses of the DNS) do not necessarily use the same
character set encoding scheme as the public Internet namespace.
<span class="grey">Thaler, et al. Informational [Page 6]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-7" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
We will discuss each of the above issues in subsequent sections. For
reference, Figure 2 depicts a more realistic architecture on typical
hosts today (which don't have IDNA inserted as a shim immediately
above the DNS resolver library). More generally, the host may be
attached to one or more local networks, each of which may or may not
be connected to the public Internet and may or may not have a private
namespace.
+-----------------------------------------+
|Host |
| +-------------+ |
| | Application | |
| +------+------+ |
| | |
| +------+------+ |
| | Generic | |
| | Name | |
| | Resolution | |
| | API | |
| +------+------+ |
| | |
| +-----+------+---+--+-------+-----+ |
| | | | | | | |
| +-+-++--+--++--+-++---+---++--+--++-+-+ |
| |DNS||LLMNR||mDNS||NetBIOS||hosts||...| |
| +---++-----++----++-------++-----++---+ |
| |
+-----------------------------------------+
|
______|______
/ \
/ \
/ local \
\ network /
\ /
\_____________/
|
_________|_________
/ \
/ \
/ \
| Internet |
\ /
\ /
\___________________/
Realistic Architecture
Figure 2
<span class="grey">Thaler, et al. Informational [Page 7]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-8" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
<span class="h3"><a class="selflink" id="section-1.1" href="#section-1.1">1.1</a>. APIs</span>
<a href="#section-6.2">Section 6.2</a> of the original IDNA specification [<a href="./rfc3490" title=""Internationalizing Domain Names in Applications (IDNA)"">RFC3490</a>] states
(where ToASCII and ToUnicode below refer to conversions using the
Punycode algorithm):
It is expected that new versions of the resolver libraries in the
future will be able to accept domain names in other charsets than
ASCII, and application developers might one day pass not only
domain names in Unicode, but also in local script to a new API for
the resolver libraries in the operating system. Thus the ToASCII
and ToUnicode operations might be performed inside these new
versions of the resolver libraries.
Resolver APIs such as getaddrinfo() and its predecessor
gethostbyname() were defined to accept C-Language "char *" arguments,
meaning they accept a string of bytes, terminated with a NULL (0)
byte. Because of the use of a NULL octet as a string terminator,
this is sufficient for ASCII strings (including A-labels) and even
ISO-2022-JP [<a href="./rfc1468" title=""Japanese Character Encoding for Internet Messages"">RFC1468</a>] and UTF-8 strings (unless an implementation
artificially precludes them), but not UTF-16 or UTF-32 strings
because a NULL octet could appear in the middle of strings using
these encodings. Several operating systems historically used in
Japan will accept (and expect) ISO-2022-JP strings in such APIs.
Some platforms used worldwide also have new versions of the APIs
(e.g., GetAddrInfoW() on Windows) that accept other encoding schemes
such as UTF-16.
It is worth noting that an API using C-Language "char *" arguments
can distinguish between conventional ASCII "hostname" labels,
A-labels, ISO-2022-JP, and UTF-8 labels in names if the coding is
known to be one of those four, and the label is intact (no lost or
mangled characters). If a stateful encoding like ISO-2022-JP is
used, applications extracting labels from text must take special
precautions to be sure that the appropriate state-setting characters
are included in the string passed to the API.
An example method for distinguishing among such codings is as
follows:
o if the label contains an ESC (0x1B) byte, the label is
ISO-2022-JP; otherwise,
o if any byte in the label has the high bit set, the label is UTF-8;
otherwise,
o if the label starts with "xn--", then it is presumed to be an
A-label; otherwise,
<span class="grey">Thaler, et al. Informational [Page 8]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-9" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
o the label is ASCII (and therefore, by definition, the label is
also UTF-8, since ASCII is a subset of UTF-8).
Again this assumes that ASCII labels never start with "xn--", and
also that UTF-8 strings never contain an ESC character. Also the
above is merely an illustration; UTF-8 can be detected and
distinguished from other 8-bit encodings with good accuracy [<a href="#ref-MJD" title=""The Properties and Promizes of UTF-8"">MJD</a>].
It is more difficult or impossible to distinguish the ISO 8859
character sets [<a href="#ref-ISO8859" title=""Information technology -- 8-bit single-byte coded graphic character sets"">ISO8859</a>] from each other, because they differ in up
to about 90 characters that have exactly the same encodings, and a
short string is very unlikely to contain enough characters to allow a
receiver to deduce the character set. Similarly, it is not possible
in general to distinguish between ISO-2022-JP and any other encoding
based on ISO 2022 code table switching.
Although it is possible (as in the example above) to distinguish some
encodings when not explicitly specified, it is cleaner to have the
encodings specified explicitly, such as specifying UTF-16 for
GetAddrInfoW(), or specifying explicitly which APIs expect UTF-8
strings.
<span class="h2"><a class="selflink" id="section-2" href="#section-2">2</a>. Use of Non-DNS Protocols</span>
As noted earlier, typical name resolution libraries are not
DNS-specific. Furthermore, some protocols are defined to use
encoding forms other than IDNA A-labels. For example, mDNS
[<a href="#ref-DNS-MULTICAST" title=""Multicast DNS"">DNS-MULTICAST</a>] specifies that UTF-8 be used. Indeed, the IETF
policy on character sets and languages [<a href="./rfc2277" title=""IETF Policy on Character Sets and Languages"">RFC2277</a>] (which followed the
1996 IAB-sponsored workshop [<a href="./rfc2130" title=""The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996"">RFC2130</a>]) states:
Protocols MUST be able to use the UTF-8 charset, which consists of
the ISO 10646 coded character set combined with the UTF-8
character encoding scheme, as defined in [<a href="#ref-10646" title=""Information Technology - Universal Multiple-octet coded Character Set (UCS)"">10646</a>] Annex R
(published in Amendment 2), for all text.
Protocols MAY specify, in addition, how to use other charsets or
other character encoding schemes for ISO 10646, such as UTF-16,
but lack of an ability to use UTF-8 is a violation of this policy;
such a violation would need a variance procedure ([BCP9] <a href="#section-9">section</a>
<a href="#section-9">9</a>) with clear and solid justification in the protocol
specification document before being entered into or advanced upon
the standards track.
For existing protocols or protocols that move data from existing
datastores, support of other charsets, or even using a default
other than UTF-8, may be a requirement. This is acceptable, but
UTF-8 support MUST be possible.
<span class="grey">Thaler, et al. Informational [Page 9]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-10" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
Applications that convert an IDN to A-label form before calling
getaddrinfo() will result in name resolution failures if the Punycode
name is directly used in such protocols. Having libraries or
protocols to convert from A-labels to the encoding scheme defined by
the protocol (e.g., UTF-8) would require changes to APIs and/or
servers, which IDNA was intended to avoid.
As a result, applications that assume that non-ASCII names are
resolved using the public DNS and blindly convert them to A-labels
without knowledge of what protocol will be selected by the name
resolution library, have problems. Furthermore, name resolution
libraries often try multiple protocols until one succeeds, because
they are defined to use a common namespace. For example, the hosts
file [<a href="./rfc0952" title=""DoD Internet host table specification"">RFC0952</a>], NetBIOS-over-TCP [<a href="./rfc1001" title=""Protocol standard for a NetBIOS service on a TCP/UDP transport: Concepts and methods"">RFC1001</a>], and DNS [<a href="./rfc1034" title=""Domain names - concepts and facilities"">RFC1034</a>], are
all defined to be able to share a common syntax. This means that
when an application passes a name to be resolved, resolution may in
fact be attempted using multiple protocols, each with a potentially
different encoding scheme. For this to work successfully, the name
must be converted to the appropriate encoding scheme only after the
choice is made to use that protocol. In general, this cannot be done
by the application since the choice of protocol is not made by the
application.
<span class="h2"><a class="selflink" id="section-3" href="#section-3">3</a>. Use of Non-ASCII in DNS</span>
A common misconception is that DNS only supports names that can be
expressed using letters, digits, and hyphens.
This misconception originally stems from the 1985 definition of an
"Internet hostname" (and net, gateway, and domain name) for use in
the "hosts" file [<a href="./rfc0952" title=""DoD Internet host table specification"">RFC0952</a>]. An Internet hostname was defined therein
as including only letters, digits, and hyphens, where uppercase and
lowercase letters were to be treated as identical. The DNS
specification <a href="./rfc1034#section-3.5">[RFC1034], Section 3.5</a> entitled "Preferred name syntax"
then repeated this definition in 1987, saying that this "syntax will
result in fewer problems with many applications that use domain names
(e.g., mail, TELNET)".
The confusion was thus left as to whether the "preferred" name syntax
was a mandatory restriction in DNS, or merely "preferred".
The definition of an Internet hostname was updated in 1989
(<a href="./rfc1123#section-2.1">[RFC1123], Section 2.1</a>) to allow names starting with a digit.
However, it did not address the increasing confusion as to whether
all names in DNS are "hostnames", or whether a "hostname" is merely a
special case of a DNS name.
<span class="grey">Thaler, et al. Informational [Page 10]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-11" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
By 1997, things had progressed to a state where it was necessary to
clarify these areas of confusion. "Clarifications to the DNS
Specification" <a href="./rfc2181#section-11">[RFC2181], Section 11</a> states:
The DNS itself places only one restriction on the particular
labels that can be used to identify resource records. That one
restriction relates to the length of the label and the full name.
The length of any one label is limited to between 1 and 63 octets.
A full domain name is limited to 255 octets (including the
separators). The zero length full name is defined as representing
the root of the DNS tree, and is typically written and displayed
as ".". Those restrictions aside, any binary string whatever can
be used as the label of any resource record. Similarly, any
binary string can serve as the value of any record that includes a
domain name as some or all of its value (SOA, NS, MX, PTR, CNAME,
and any others that may be added). Implementations of the DNS
protocols must not place any restrictions on the labels that can
be used.
Hence, it clarified that the restriction to letters, digits, and
hyphens does not apply to DNS names in general, nor to records that
include "domain names". Hence, the "preferred" name syntax described
in the original DNS specification [<a href="./rfc1034" title=""Domain names - concepts and facilities"">RFC1034</a>] is indeed merely
"preferred", not mandatory.
Since there is no restriction even to ASCII, let alone letter-digit-
hyphen use, DNS does not violate the subsequent IETF requirement to
allow UTF-8 [<a href="./rfc2277" title=""IETF Policy on Character Sets and Languages"">RFC2277</a>].
Using UTF-16 or UTF-32 encoding, however, would not be ideal for use
in DNS packets or C-Language "char *" APIs because existing software
already uses ASCII, and UTF-16 and UTF-32 strings can contain
all-zero octets that existing software will interpret as the end of
the string. To use UTF-16 or UTF-32, one would need some way of
knowing whether the string was encoded using ASCII, UTF-16, or
UTF-32, and indeed for UTF-16 or UTF-32 whether it was big-endian or
little-endian encoding. In contrast, UTF-8 works well because any
7-bit ASCII string is also a UTF-8 string representing the same
characters.
If a private namespace is defined to use UTF-8 (and not other
encodings such as UTF-16 or UTF-32), there's no need for a mechanism
to know whether a string was encoded using ASCII or UTF-8, because
(for any string that can be represented using ASCII) the
representations are exactly the same. In other words, for any string
that can be represented using ASCII, it doesn't matter whether it is
interpreted as ASCII or UTF-8 because both encodings are the same,
and for any string that can't be represented using ASCII, it's
<span class="grey">Thaler, et al. Informational [Page 11]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-12" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
obviously UTF-8. In addition, unlike UTF-16 and UTF-32, ASCII and
UTF-8 are both byte-oriented encodings so the question of big-endian
or little-endian encoding doesn't apply.
While implementations of the DNS protocol must not place any
restrictions on the labels that can be used, applications that use
the DNS are free to impose whatever restrictions they like, and many
have. The above rules permit a domain name label that contains
unusual characters, such as embedded spaces, which many applications
consider a bad idea. For example, the original specification
[<a href="./rfc0821" title=""Simple Mail Transfer Protocol"">RFC0821</a>] of the SMTP protocol [<a href="./rfc5321" title=""Simple Mail Transfer Protocol"">RFC5321</a>] constrains the character set
usable in email addresses. There is now an effort underway to define
an extension to SMTP to support internationalized email addresses and
headers. See the EAI framework [<a href="./rfc4952" title=""Overview and Framework for Internationalized Email"">RFC4952</a>] for more discussion on this
topic.
Shortly after the DNS Clarifications [<a href="./rfc2181" title=""Clarifications to the DNS Specification"">RFC2181</a>] and IETF character
sets and languages policy [<a href="./rfc2277" title=""IETF Policy on Character Sets and Languages"">RFC2277</a>] were published, the need for
internationalized names within private namespaces (i.e., within
enterprises) arose. The current (and past, predating IDNA and the
prefixed ACE conventions) practice within enterprises that support
other languages is to put UTF-8 names in their internal DNS servers
in a private namespace. For example, "Using the UTF-8 Character Set
in the Domain Name System" [<a href="#ref-UTF8-DNS" title=""Using the UTF-8 Character Set in the Domain Name System"">UTF8-DNS</a>] was first written in 1997, and
was then widely deployed in Windows. The use of UTF-8 names in DNS
was similarly implemented and deployed in Mac OS, simply by virtue of
the fact that applications blindly passed UTF-8 strings to the name
resolution APIs, the name resolution APIs blindly passed those UTF-8
strings to the DNS servers, and the DNS servers correctly answered
those queries. From the user's point of view, everything worked
properly without any special new code being written, except that
ASCII is matched case-insensitively whereas UTF-8 is not (although
some enterprise DNS servers reportedly attempt to do case-insensitive
matching on UTF-8 within private namespaces, an action that causes
other problems and violates a subsequent prohibition [<a href="./rfc4343" title=""Domain Name System (DNS) Case Insensitivity Clarification"">RFC4343</a>]).
Within a private namespace, and especially in light of the IETF UTF-8
policy [<a href="./rfc2277" title=""IETF Policy on Character Sets and Languages"">RFC2277</a>], it was reasonable to assume that binary strings
were encoded in UTF-8.
As implied earlier, there are also issues with mapping strings to
some canonical form, independent of the encoding. Such issues are
not discussed in detail in this document. They are discussed to some
extent in, for example, <a href="#section-3">Section 3</a> of "Unicode Format for Network
Interchange" [<a href="./rfc5198" title=""Unicode Format for Network Interchange"">RFC5198</a>], and are left as opportunities for elaboration
in other documents.
A few years after UTF-8 was already in use in private namespaces in
DNS, the strategy of using a reserved prefix and an ASCII-compatible
<span class="grey">Thaler, et al. Informational [Page 12]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-13" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
encoding (ACE) was developed for IDNA. That strategy included the
Punycode algorithm, which began to be developed (during the period
from 2002 [<a href="#ref-IDN-PUNYCODE" title=""Punycode version 0.3.3"">IDN-PUNYCODE</a>] to 2003 [<a href="./rfc3492" title=""Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)"">RFC3492</a>]) for use in the public DNS
namespace. There were a number of reasons for this. One such reason
the prefixed ACE strategy was selected for the public DNS namespace
had to do with the fact that other encodings such as ISO 8859-1 were
also in use in DNS and the various encodings were not necessarily
distinguishable from each other. Another reason had to do with
concerns about whether the details of IDNA, including the use of the
Punycode algorithm, were an adequate solution to the problems that
were posed. If either the Punycode algorithm or fundamental aspects
of character handling were wrong, and had to be changed to something
incompatible, it would be possible to switch to a new prefix or adopt
another model entirely. Only the part of the public DNS namespace
that starts a label with "xn--" would be polluted.
Today the algorithm is seen as being about as good as it can
realistically be, so moving to a different encoding (UTF-8 as
suggested in this document) that can be viewed as "native" would not
be as risky as it would have been in 2002.
In any case, the publication of Punycode [<a href="./rfc3492" title=""Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)"">RFC3492</a>] and the
dependencies on it in the IDNA Protocol document [<a href="./rfc5891" title=""Internationalized Domain Names in Applications (IDNA): Protocol"">RFC5891</a>] and the
earlier IDNA specification [<a href="./rfc3490" title=""Internationalizing Domain Names in Applications (IDNA)"">RFC3490</a>] thus resulted in having to use
different encodings for different namespaces (where UTF-8 for private
namespaces was already deployed). Hence, referring back to Figure 2,
a different encoding scheme may be in use on the Internet vs. a local
network.
In general, a host may be connected to zero or more networks using
private namespaces, plus potentially the public namespace.
Applications that convert a U-label form IDN to an A-label before
calling getaddrinfo() will incur name resolution failures if the name
is actually registered in a private namespace in some other encoding
(e.g., UTF-8). Having libraries or protocols convert from A-labels
to the encoding used by a private namespace (e.g., UTF-8) would
require changes to APIs and/or servers, which IDNA was intended to
avoid.
Also, a fully-qualified domain name (FQDN) to be resolved may be
obtained directly from an application, or it may be composed by the
DNS resolver itself from a single label obtained from an application
by using a configured suffix search list, and the resulting FQDN may
use multiple encodings in different labels. For more information on
the suffix search list, see <a href="#section-6">Section 6</a> of "Common DNS Implementation
Errors and Suggested Fixes" [<a href="./rfc1536" title=""Common DNS Implementation Errors and Suggested Fixes"">RFC1536</a>], the DHCP Domain Search Option
[<a href="./rfc3397" title=""Dynamic Host Configuration Protocol (DHCP) Domain Search Option"">RFC3397</a>], and <a href="#section-4">Section 4</a> of "DNS Configuration options for DHCPv6"
[<a href="./rfc3646" title=""DNS Configuration options for Dynamic Host Configuration Protocol for IPv6 (DHCPv6)"">RFC3646</a>].
<span class="grey">Thaler, et al. Informational [Page 13]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-14" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
As noted in <a href="#section-6">Section 6</a> of "Common DNS Implementation Errors and
Suggested Fixes" [<a href="./rfc1536" title=""Common DNS Implementation Errors and Suggested Fixes"">RFC1536</a>], the community has had bad experiences
(e.g., security problems [<a href="./rfc1535" title=""A Security Problem and Proposed Correction With Widely Deployed DNS Software"">RFC1535</a>]) with "searching" for domain names
by trying multiple variations or appending different suffixes. Such
searching can yield inconsistent results depending on the order in
which alternatives are tried. Nonetheless, the practice is
widespread and must be considered.
The practice of searching for names, whether by the use of a suffix
search list or by searching in different namespaces, can yield
inconsistent results. For example, even when a suffix search list is
only used when an application provides a name containing no dots, two
clients with different configured suffix search lists can get
different answers, and the same client could get different answers at
different times if it changes its configuration (e.g., when moving to
another network). A deeper discussion of this topic is outside the
scope of this document.
<span class="h3"><a class="selflink" id="section-3.1" href="#section-3.1">3.1</a>. Examples</span>
Some examples of cases that can happen in existing implementations
today (where {non-ASCII} below represents some user-entered non-ASCII
string) are:
o User types in {non-ASCII}.{non-ASCII}.com, and the application
passes it, in the form of a UTF-8 string, to getaddrinfo() or
gethostbyname() or equivalent.
1. The DNS resolver passes the (UTF-8) string unmodified to a DNS
server.
o User types in {non-ASCII}.{non-ASCII}.com, and the application
passes it to a name resolution API that accepts strings in some
other encoding such as UTF-16, e.g., GetAddrInfoW() on Windows.
1. The name resolution API decides to pass the string to DNS (and
possibly other protocols).
2. The DNS resolver converts the name from UTF-16 to UTF-8 and
passes the query to a DNS server.
o User types in {non-ASCII}.{non-ASCII}.com, but the application
first converts it to A-label form such that the name that is
passed to name resolution APIs is (say)
xn--e1afmkfd.xn--80akhbyknj4f.com.
1. The name resolution API decides to pass the string to DNS (and
possibly other protocols).
<span class="grey">Thaler, et al. Informational [Page 14]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-15" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
2. The DNS resolver passes the string unmodified to a DNS server.
3. If the name is not found in DNS, the name resolution API
decides to try another protocol, say mDNS.
4. The query goes out in mDNS, but since mDNS specified that
names are to be registered in UTF-8, the name isn't found
since it was encoded as an A-label in the query.
o User types in {non-ASCII}, and the application passes it, in the
form of a UTF-8 string, to getaddrinfo() or equivalent.
1. The name resolution API decides to pass the string to DNS (and
possibly other protocols).
2. The DNS resolver will append suffixes in the suffix search
list, which may contain UTF-8 characters if the local network
uses a private namespace.
3. Each FQDN in turn will then be sent in a query to a DNS
server, until one succeeds.
o User types in {non-ASCII}, but the application first converts it
to an A-label, such that the name that is passed to getaddrinfo()
or equivalent is (say) xn--e1afmkfd.
1. The name resolution API decides to pass the string to DNS (and
possibly other protocols).
2. The DNS stub resolver will append suffixes in the suffix
search list, which may contain UTF-8 characters if the local
network uses a private namespace, resulting in (say)
xn--e1afmkfd.{non-ASCII}.com
3. Each FQDN in turn will then be sent in a query to a DNS
server, until one succeeds.
4. Since the private namespace in this case uses UTF-8, the above
queries fail, since the A-label version of the name was not
registered in that namespace.
o User types in {non-ASCII1}.{non-ASCII2}.{non-ASCII3}.com, where
{non-ASCII3}.com is a public namespace using IDNA and A-labels,
but {non-ASCII2}.{non-ASCII3}.com is a private namespace using
UTF-8, which is accessible to the user. The application passes
the name, in the form of a UTF-8 string, to getaddrinfo() or
equivalent.
<span class="grey">Thaler, et al. Informational [Page 15]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-16" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
1. The name resolution API decides to pass the string to DNS (and
possibly other protocols).
2. The DNS resolver tries to locate the authoritative server, but
fails the lookup because it cannot find a server for the UTF-8
encoding of {non-ASCII3}.com, even though it would have access
to the private namespace. (To make this work, the private
namespace would need to include the UTF-8 encoding of
{non-ASCII3}.com.)
When users use multiple applications, some of which do A-label
conversion prior to passing a name to name resolution APIs, and some
of which do not, odd behavior can result which at best violates the
Principle of Least Surprise, and at worst can result in security
vulnerabilities.
First consider two competing applications, such as web browsers, that
are designed to achieve the same task. If the user types the same
name into each browser, one may successfully resolve the name (and
hence access the desired content) because the encoding scheme is
correct, while the other may fail name resolution because the
encoding scheme is incorrect. Hence the issue can incent users to
switch to another application (which in some cases means switching to
an IDNA application, and in other cases means switching away from an
IDNA application).
Next consider two separate applications where one is designed to be
launched from the other, for example a web browser launching a media
player application when the link to a media file is clicked. If both
types of content (web pages and media files in this example) are
hosted at the same IDN in a private namespace, but one application
converts to A-labels before calling name resolution APIs and the
other does not, the user may be able to access a web page, click on
the media file causing the media player to launch and attempt to
retrieve the media file, which will then fail because the IDN
encoding scheme was incorrect. Or even worse, if an attacker is able
to register the same name in the other encoding scheme, the user may
get the content from the attacker's machine. This is similar to a
normal phishing attack, except that the two names represent exactly
the same Unicode characters.
<span class="h2"><a class="selflink" id="section-4" href="#section-4">4</a>. Recommendations</span>
On many platforms, the name resolution library will automatically use
a variety of protocols to search a variety of namespaces, which might
be using UTF-8 or other encodings. In addition, even when only the
DNS protocol is used, in many operational environments, a private DNS
<span class="grey">Thaler, et al. Informational [Page 16]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-17" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
namespace using UTF-8 is also deployed and is automatically searched
by the name resolution library.
As explained earlier, using multiple canonical formats, and multiple
encodings in different protocols or even in different places in the
same namespace creates problems. Because of this, and the fact that
both IDNA A-labels and UTF-8 are in use as encoding mechanisms for
domain names today, we make the recommendations described below.
It is inappropriate for an application that calls a general-purpose
name resolution library to convert a name to an A-label unless the
application is absolutely certain that, in all environments where the
application might be used, only the global DNS that uses IDNA
A-labels actually will be used to resolve the name.
Instead, conversion to A-label form, or any other special encoding
required by a particular name-lookup protocol, should be done only by
an entity that knows which protocol will be used (e.g., the DNS
resolver, or getaddrinfo() upon deciding to pass the name to DNS),
rather than by general applications that call protocol-independent
name resolution APIs. (Of course, applications that store strings
internally in a different format than that required by those APIs,
need to convert strings from their own internal format to the format
required by the API.) Similarly, even if an application can know
that DNS is to be used, the conversion to A-labels should be done
only by an entity that knows which part of the DNS namespace will be
used.
That is, a more intelligent DNS resolver would be more liberal in
what it would accept from an application and be able to query for
both a name in A-label form (e.g., over the Internet) and a UTF-8
name (e.g., over a corporate network with a private namespace) in
case the server only recognizes one. However, we might also take
into account that the various resolution behaviors discussed earlier
could also occur with record updates (e.g., with Dynamic Update
[<a href="./rfc2136" title=""Dynamic Updates in the Domain Name System (DNS UPDATE)"">RFC2136</a>]), resulting in some names being registered in a local
network's private namespace by applications doing conversion to
A-labels, and other names being registered using UTF-8. Hence, a
name might have to be queried with both encodings to be sure to
succeed without changes to DNS servers.
Similarly, a more intelligent stub resolver would also be more
liberal in what it would accept from a response as the value of a
record (e.g., PTR) in that it would accept either UTF-8 (U-labels in
the case of IDNA) or A-labels and convert them to whatever encoding
is used by the application APIs to return strings to applications.
<span class="grey">Thaler, et al. Informational [Page 17]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-18" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
Indeed the choice of conversion within the resolver libraries is
consistent with the quote from <a href="#section-6.2">Section 6.2</a> of the original IDNA
specification [<a href="./rfc3490" title=""Internationalizing Domain Names in Applications (IDNA)"">RFC3490</a>] stating that conversion using the Punycode
algorithm (i.e., to A-labels) "might be performed inside these new
versions of the resolver libraries".
That said, some application-layer protocols (e.g., EPP Domain Name
Mapping [<a href="./rfc5731" title=""Extensible Provisioning Protocol (EPP) Domain Name Mapping"">RFC5731</a>]) are defined to use A-labels rather than simply
using UTF-8 as recommended by the IETF character sets and languages
policy [<a href="./rfc2277" title=""IETF Policy on Character Sets and Languages"">RFC2277</a>]. In this case, an application may receive a string
containing A-labels and want to pass it to name resolution APIs.
Again the recommendation that a resolver library be more liberal in
what it would accept from an application would mean that such a name
would be accepted and re-encoded as needed, rather than requiring the
application to do so.
It is important that any APIs used by applications to pass names
specify what encoding(s) the API uses. For example, GetAddrInfoW()
on Windows specifies that it accepts UTF-16 and only UTF-16. In
contrast, the original specification of getaddrinfo() [<a href="./rfc3493" title=""Basic Socket Interface Extensions for IPv6"">RFC3493</a>] does
not, and hence platforms vary in what they use (e.g., Mac OS uses
UTF-8 whereas Windows uses Windows code pages).
Finally, the question remains about what, if anything, a DNS server
should do to handle cases where some existing applications or hosts
do IDNA queries using A-labels within the local network using a
private namespace, and other existing applications or hosts send
UTF-8 queries. It is undesirable to store different records for
different encodings of the same name, since this introduces the
possibility for inconsistency between them. Instead, a new DNS
server serving a private namespace using UTF-8 could potentially
treat encoding-conversion in the same way as case-insensitive
comparison which a DNS server is already required to do, as long the
DNS server has some way to know what the encoding is. Two encodings
are, in this sense, two representations of the same name, just as two
case-different strings are. However, whereas case comparison of
non-ASCII characters is complicated by ambiguities (as explained in
the IAB's Review and Recommendations for Internationalized Domain
Names [<a href="./rfc4690" title=""Review and Recommendations for Internationalized Domain Names (IDNs)"">RFC4690</a>]), encoding conversion between A-labels and U-labels
is unambiguous.
<span class="h2"><a class="selflink" id="section-5" href="#section-5">5</a>. Security Considerations</span>
Having applications convert names to prefixed ACE format (A-labels)
before calling name resolution can result in security
vulnerabilities. If the name is resolved by protocols or in zones
for which records are registered using other encoding schemes, an
attacker can claim the A-label version of the same name and hence
<span class="grey">Thaler, et al. Informational [Page 18]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-19" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
trick the victim into accessing a different destination. This can be
done for any non-ASCII name, even when there is no possible confusion
due to case, language, or other issues. Other types of confusion
beyond those resulting simply from the choice of encoding scheme are
discussed in "Review and Recommendations for IDNs" [<a href="./rfc4690" title=""Review and Recommendations for Internationalized Domain Names (IDNs)"">RFC4690</a>].
Designers and users of encodings that represent Unicode strings in
terms of ASCII should also consider whether trademark protection or
phishing are issues, e.g., if one name would be encoded in a way that
would be naturally associated with another organization or product.
<span class="h2"><a class="selflink" id="section-6" href="#section-6">6</a>. Acknowledgements</span>
The authors wish to thank Patrik Faltstrom, Martin Duerst, JFC
Morfin, Ran Atkinson, S. Moonesamy, Paul Hoffman, and Stephane
Bortzmeyer for their careful review and helpful suggestions. It is
also interesting to note that none of the first three individuals'
names above can be spelled out and written correctly in ASCII text.
Furthermore, one of the IAB member's names below (Andrei Robachevsky)
cannot be written in the script as it appears on his birth
certificate.
<span class="h2"><a class="selflink" id="section-7" href="#section-7">7</a>. IAB Members at the Time of Approval</span>
Bernard Aboba
Marcelo Bagnulo
Ross Callon
Spencer Dawkins
Vijay Gill
Russ Housley
John Klensin
Olaf Kolkman
Danny McPherson
Jon Peterson
Andrei Robachevsky
Dave Thaler
Hannes Tschofenig
<span class="grey">Thaler, et al. Informational [Page 19]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-20" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
<span class="h2"><a class="selflink" id="section-8" href="#section-8">8</a>. References</span>
<span class="h3"><a class="selflink" id="section-8.1" href="#section-8.1">8.1</a>. Normative References</span>
[<a id="ref-10646">10646</a>] International Organization for Standardization,
"Information Technology - Universal Multiple-octet
coded Character Set (UCS)".
ISO/IEC Standard 10646, comprised of ISO/IEC 10646-
1:2000, "Information technology -- Universal
Multiple-Octet Coded Character Set (UCS) -- Part 1:
Architecture and Basic Multilingual Plane", ISO/IEC
10646-2:2001, "Information technology -- Universal
Multiple-Octet Coded Character Set (UCS) -- Part 2:
Supplementary Planes" and ISO/IEC 10646- 1:2000/Amd
1:2002, "Mathematical symbols and other characters".
[<a id="ref-Unicode">Unicode</a>] The Unicode Consortium. The Unicode Standard,
Version 5.1.0, defined by: "The Unicode Standard,
Version 5.0", Boston, MA, Addison-Wesley, 2007, ISBN
0-321-48091-0, as amended by Unicode 5.1.0
(<a href="http://www.unicode.org/versions/Unicode5.1.0/">http://www.unicode.org/versions/Unicode5.1.0/</a>).
<span class="h3"><a class="selflink" id="section-8.2" href="#section-8.2">8.2</a>. Informative References</span>
[<a id="ref-DNS-MULTICAST">DNS-MULTICAST</a>] Cheshire, S. and M. Krochmal, <a style="text-decoration: none" href='https://www.google.com/search?sitesearch=datatracker.ietf.org%2Fdoc%2Fhtml%2F&q=inurl:draft-+%22Multicast+DNS%22'>"Multicast DNS"</a>, Work
in Progress, February 2011.
[<a id="ref-IDN-PUNYCODE">IDN-PUNYCODE</a>] Costello, A., <a style="text-decoration: none" href='https://www.google.com/search?sitesearch=datatracker.ietf.org%2Fdoc%2Fhtml%2F&q=inurl:draft-+%22Punycode+version+0.3.3%22'>"Punycode version 0.3.3"</a>, Work
in Progress, January 2002.
[<a id="ref-ISO8859">ISO8859</a>] International Organization for Standardization,
"Information technology -- 8-bit single-byte coded
graphic character sets".
ISO/IEC Standard 8859, comprised of ISO/IEC 8859-
1:1998, Part 1: Latin alphabet No. 1 - ISO/IEC 8859-
2:1999, Part 2: Latin alphabet No. 2 - ISO/IEC 8859-
3:1999, Part 3: Latin alphabet No. 3 - ISO/IEC 8859-
4:1998, Part 4: Latin alphabet No. 4 - ISO/IEC 8859-
5:1999, Part 5: Latin/Cyrillic alphabet - ISO/IEC
8859-6:1999, Part 6: Latin/Arabic alphabet - ISO/IEC
8859-7:2003, Part 7: Latin/Greek alphabet - ISO/IEC
8859-8:1999, Part 8: Latin/Hebrew alphabet - ISO/IEC
8859-9:1999, Part 9: Latin alphabet No. 5 - ISO/IEC
8859-10:1998, Part 10: Latin alphabet No. 6 - ISO/
IEC 8859-11:2001, Part 11: Latin/Thai alphabet -
ISO/IEC 8859-13:1998, Part 13: Latin alphabet No. 7
<span class="grey">Thaler, et al. Informational [Page 20]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-21" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
- ISO/IEC 8859-14:1998, Part 14: Latin alphabet No.
8 (Celtic) - ISO/IEC 8859-15:1999, Part 15: Latin
alphabet No. 9 - ISO/IEC 8859-16:2001, Part 16:
Latin alphabet No. 10.
[<a id="ref-MJD">MJD</a>] Duerst, M., "The Properties and Promizes of UTF-8",
11th International Unicode Conference, San Jose ,
September 1997, <<a href="http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf">http://www.ifi.unizh.ch/mml/</a>
<a href="http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf">mduerst/papers/PDF/IUC11-UTF-8.pdf</a>>.
[<a id="ref-NIS">NIS</a>] Sun Microsystems, "System and Network
Administration", March 1990.
[<a id="ref-RFC0821">RFC0821</a>] Postel, J., "Simple Mail Transfer Protocol", STD 10,
<a href="./rfc821">RFC 821</a>, August 1982.
[<a id="ref-RFC0952">RFC0952</a>] Harrenstien, K., Stahl, M., and E. Feinler, "DoD
Internet host table specification", <a href="./rfc952">RFC 952</a>,
October 1985.
[<a id="ref-RFC1001">RFC1001</a>] NetBIOS Working Group, "Protocol standard for a
NetBIOS service on a TCP/UDP transport: Concepts and
methods", STD 19, <a href="./rfc1001">RFC 1001</a>, March 1987.
[<a id="ref-RFC1002">RFC1002</a>] NetBIOS Working Group, "Protocol standard for a
NetBIOS service on a TCP/UDP transport: Detailed
specifications", STD 19, <a href="./rfc1002">RFC 1002</a>, March 1987.
[<a id="ref-RFC1034">RFC1034</a>] Mockapetris, P., "Domain names - concepts and
facilities", STD 13, <a href="./rfc1034">RFC 1034</a>, November 1987.
[<a id="ref-RFC1123">RFC1123</a>] Braden, R., "Requirements for Internet Hosts -
Application and Support", STD 3, <a href="./rfc1123">RFC 1123</a>,
October 1989.
[<a id="ref-RFC1468">RFC1468</a>] Murai, J., Crispin, M., and E. van der Poel,
"Japanese Character Encoding for Internet Messages",
<a href="./rfc1468">RFC 1468</a>, June 1993.
[<a id="ref-RFC1535">RFC1535</a>] Gavron, E., "A Security Problem and Proposed
Correction With Widely Deployed DNS Software",
<a href="./rfc1535">RFC 1535</a>, October 1993.
[<a id="ref-RFC1536">RFC1536</a>] Kumar, A., Postel, J., Neuman, C., Danzig, P., and
S. Miller, "Common DNS Implementation Errors and
Suggested Fixes", <a href="./rfc1536">RFC 1536</a>, October 1993.
<span class="grey">Thaler, et al. Informational [Page 21]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-22" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
[<a id="ref-RFC2130">RFC2130</a>] Weider, C., Preston, C., Simonsen, K., Alvestrand,
H., Atkinson, R., Crispin, M., and P. Svanberg, "The
Report of the IAB Character Set Workshop held 29
February - 1 March, 1996", <a href="./rfc2130">RFC 2130</a>, April 1997.
[<a id="ref-RFC2136">RFC2136</a>] Vixie, P., Thomson, S., Rekhter, Y., and J. Bound,
"Dynamic Updates in the Domain Name System (DNS
UPDATE)", <a href="./rfc2136">RFC 2136</a>, April 1997.
[<a id="ref-RFC2181">RFC2181</a>] Elz, R. and R. Bush, "Clarifications to the DNS
Specification", <a href="./rfc2181">RFC 2181</a>, July 1997.
[<a id="ref-RFC2277">RFC2277</a>] Alvestrand, H., "IETF Policy on Character Sets and
Languages", <a href="https://www.rfc-editor.org/bcp/bcp18">BCP 18</a>, <a href="./rfc2277">RFC 2277</a>, January 1998.
[<a id="ref-RFC3397">RFC3397</a>] Aboba, B. and S. Cheshire, "Dynamic Host
Configuration Protocol (DHCP) Domain Search Option",
<a href="./rfc3397">RFC 3397</a>, November 2002.
[<a id="ref-RFC3490">RFC3490</a>] Faltstrom, P., Hoffman, P., and A. Costello,
"Internationalizing Domain Names in Applications
(IDNA)", <a href="./rfc3490">RFC 3490</a>, March 2003.
[<a id="ref-RFC3492">RFC3492</a>] Costello, A., "Punycode: A Bootstring encoding of
Unicode for Internationalized Domain Names in
Applications (IDNA)", <a href="./rfc3492">RFC 3492</a>, March 2003.
[<a id="ref-RFC3493">RFC3493</a>] Gilligan, R., Thomson, S., Bound, J., McCann, J.,
and W. Stevens, "Basic Socket Interface Extensions
for IPv6", <a href="./rfc3493">RFC 3493</a>, February 2003.
[<a id="ref-RFC3629">RFC3629</a>] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, <a href="./rfc3629">RFC 3629</a>, November 2003.
[<a id="ref-RFC3646">RFC3646</a>] Droms, R., "DNS Configuration options for Dynamic
Host Configuration Protocol for IPv6 (DHCPv6)",
<a href="./rfc3646">RFC 3646</a>, December 2003.
[<a id="ref-RFC4343">RFC4343</a>] Eastlake, D., "Domain Name System (DNS) Case
Insensitivity Clarification", <a href="./rfc4343">RFC 4343</a>,
January 2006.
[<a id="ref-RFC4690">RFC4690</a>] Klensin, J., Faltstrom, P., Karp, C., and IAB,
"Review and Recommendations for Internationalized
Domain Names (IDNs)", <a href="./rfc4690">RFC 4690</a>, September 2006.
<span class="grey">Thaler, et al. Informational [Page 22]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-23" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
[<a id="ref-RFC4795">RFC4795</a>] Aboba, B., Thaler, D., and L. Esibov, "Link-local
Multicast Name Resolution (LLMNR)", <a href="./rfc4795">RFC 4795</a>,
January 2007.
[<a id="ref-RFC4952">RFC4952</a>] Klensin, J. and Y. Ko, "Overview and Framework for
Internationalized Email", <a href="./rfc4952">RFC 4952</a>, July 2007.
[<a id="ref-RFC5198">RFC5198</a>] Klensin, J. and M. Padlipsky, "Unicode Format for
Network Interchange", <a href="./rfc5198">RFC 5198</a>, March 2008.
[<a id="ref-RFC5321">RFC5321</a>] Klensin, J., "Simple Mail Transfer Protocol",
<a href="./rfc5321">RFC 5321</a>, October 2008.
[<a id="ref-RFC5731">RFC5731</a>] Hollenbeck, S., "Extensible Provisioning Protocol
(EPP) Domain Name Mapping", STD 69, <a href="./rfc5731">RFC 5731</a>,
August 2009.
[<a id="ref-RFC5890">RFC5890</a>] Klensin, J., "Internationalized Domain Names for
Applications (IDNA): Definitions and Document
Framework", <a href="./rfc5890">RFC 5890</a>, August 2010.
[<a id="ref-RFC5891">RFC5891</a>] Klensin, J., "Internationalized Domain Names in
Applications (IDNA): Protocol", <a href="./rfc5891">RFC 5891</a>,
August 2010.
[<a id="ref-UTF8-DNS">UTF8-DNS</a>] Kwan, S. and J. Gilroy, "Using the UTF-8 Character
Set in the Domain Name System", Work in Progress,
November 1997.
<span class="grey">Thaler, et al. Informational [Page 23]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-24" ></span>
<span class="grey"><a href="./rfc6055">RFC 6055</a> IDN Encodings February 2011</span>
Authors' Addresses
Dave Thaler
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
USA
Phone: +1 425 703 8835
EMail: dthaler@microsoft.com
John C Klensin
1770 Massachusetts Ave, Ste 322
Cambridge, MA 02140
Phone: +1 617 245 1457
EMail: john+ietf@jck.com
Stuart Cheshire
Apple Inc.
1 Infinite Loop
Cupertino, CA 95014
Phone: +1 408 974 3207
EMail: cheshire@apple.com
Thaler, et al. Informational [Page 24]
</pre>
|