1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509
|
<pre>Network Working Group B. Curtin
Request for Comments: 2640 Defense Information Systems Agency
Updates: <a href="./rfc959">959</a> July 1999
Category: Proposed Standard
Internationalization of the File Transfer Protocol
Status of this Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (1999). All Rights Reserved.
Abstract
The File Transfer Protocol, as defined in <a href="./rfc959">RFC 959</a> [<a href="./rfc959" title=""File Transfer Protocol (FTP)"">RFC959</a>] and <a href="./rfc1123">RFC</a>
<a href="./rfc1123">1123</a> <a href="./rfc1123#section-4">Section 4 [RFC1123]</a>, is one of the oldest and widely used
protocols on the Internet. The protocol's primary character set, 7
bit ASCII, has served the protocol well through the early growth
years of the Internet. However, as the Internet becomes more global,
there is a need to support character sets beyond 7 bit ASCII.
This document addresses the internationalization (I18n) of FTP, which
includes supporting the multiple character sets and languages found
throughout the Internet community. This is achieved by extending the
FTP specification and giving recommendations for proper
internationalization support.
Table of Contents
ABSTRACT.......................................................<a href="#page-1">1</a>
<a href="#section-1">1</a> INTRODUCTION.................................................<a href="#page-2">2</a>
<a href="#section-1.1">1.1</a> Requirements Terminology..................................<a href="#page-2">2</a>
<a href="#section-2">2</a> INTERNATIONALIZATION.........................................<a href="#page-3">3</a>
<a href="#section-2.1">2.1</a> International Character Set...............................<a href="#page-3">3</a>
<a href="#section-2.2">2.2</a> Transfer Encoding Set.....................................<a href="#page-4">4</a>
<a href="#section-3">3</a> PATHNAMES....................................................<a href="#page-5">5</a>
<a href="#section-3.1">3.1</a> General compliance........................................<a href="#page-5">5</a>
<a href="#section-3.2">3.2</a> Servers compliance........................................<a href="#page-6">6</a>
<a href="#section-3.3">3.3</a> Clients compliance........................................<a href="#page-7">7</a>
<a href="#section-4">4</a> LANGUAGE SUPPORT.............................................<a href="#page-7">7</a>
<span class="grey">Curtin Proposed Standard [Page 1]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-2" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
<a href="#section-4.1">4.1</a> The LANG command..........................................<a href="#page-8">8</a>
<a href="#section-4.2">4.2</a> Syntax of the LANG command................................<a href="#page-9">9</a>
<a href="#section-4.3">4.3</a> Feat response for LANG command...........................<a href="#page-11">11</a>
<a href="#section-4.3.1">4.3.1</a> Feat examples.........................................<a href="#page-11">11</a>
<a href="#section-5">5</a> SECURITY CONSIDERATIONS.....................................<a href="#page-12">12</a>
<a href="#section-6">6</a> ACKNOWLEDGMENTS.............................................<a href="#page-12">12</a>
<a href="#section-7">7</a> GLOSSARY....................................................<a href="#page-13">13</a>
<a href="#section-8">8</a> BIBLIOGRAPHY................................................<a href="#page-13">13</a>
<a href="#section-9">9</a> AUTHOR'S ADDRESS............................................<a href="#page-15">15</a>
ANNEX A - IMPLEMENTATION CONSIDERATIONS.......................<a href="#page-16">16</a>
<a href="#appendix-A.1">A.1</a> General Considerations...................................<a href="#page-16">16</a>
<a href="#appendix-A.2">A.2</a> Transition Considerations................................<a href="#page-18">18</a>
ANNEX B - SAMPLE CODE AND EXAMPLES............................<a href="#page-19">19</a>
<a href="#appendix-B.1">B.1</a> Valid UTF-8 check........................................<a href="#page-19">19</a>
<a href="#appendix-B.2">B.2</a> Conversions..............................................<a href="#page-20">20</a>
<a href="#appendix-B.2.1">B.2.1</a> Conversion from Local Character Set to UTF-8..........<a href="#page-20">20</a>
<a href="#appendix-B.2.2">B.2.2</a> Conversion from UTF-8 to Local Character Set..........<a href="#page-23">23</a>
<a href="#appendix-B.2.3">B.2.3</a> ISO/IEC 8859-8 Example................................<a href="#page-25">25</a>
<a href="#appendix-B.2.4">B.2.4</a> Vendor Codepage Example...............................<a href="#page-25">25</a>
<a href="#appendix-B.3">B.3</a> Pseudo Code for Translating Servers......................<a href="#page-26">26</a>
Full Copyright Statement......................................<a href="#page-27">27</a>
<span class="h2"><a class="selflink" id="section-1" href="#section-1">1</a> Introduction</span>
As the Internet grows throughout the world the requirement to support
character sets outside of the ASCII [<a href="#ref-ASCII">ASCII</a>] / Latin-1 [<a href="#ref-ISO-8859">ISO-8859</a>]
character set becomes ever more urgent. For FTP, because of the
large installed base, it is paramount that this is done without
breaking existing clients and servers. This document addresses this
need. In doing so it defines a solution which will still allow the
installed base to interoperate with new clients and servers.
This document enhances the capabilities of the File Transfer Protocol
by removing the 7-bit restrictions on pathnames used in client
commands and server responses, RECOMMENDs the use of a Universal
Character Set (UCS) ISO/IEC 10646 [<a href="#ref-ISO-10646">ISO-10646</a>], RECOMMENDs a UCS
transformation format (UTF) UTF-8 [<a href="#ref-UTF-8">UTF-8</a>], and defines a new command
for language negotiation.
The recommendations made in this document are consistent with the
recommendations expressed by the IETF policy related to character
sets and languages as defined in <a href="./rfc2277">RFC 2277</a> [<a href="./rfc2277" title="" IETF Policy on Character Sets and Languages"">RFC2277</a>].
<span class="h3"><a class="selflink" id="section-1.1" href="#section-1.1">1.1</a>. Requirements Terminology</span>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <a href="https://www.rfc-editor.org/bcp/bcp14">BCP 14</a> [<a href="#ref-BCP14" title=""Key words for use in RFCs to Indicate Requirement Levels"">BCP14</a>].
<span class="grey">Curtin Proposed Standard [Page 2]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-3" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
<span class="h2"><a class="selflink" id="section-2" href="#section-2">2</a> Internationalization</span>
The File Transfer Protocol was developed when the predominate
character sets were 7 bit ASCII and 8 bit EBCDIC. Today these
character sets cannot support the wide range of characters needed by
multinational systems. Given that there are a number of character
sets in current use that provide more characters than 7-bit ASCII, it
makes sense to decide on a convenient way to represent the union of
those possibilities. To work globally either requires support of a
number of character sets and to be able to convert between them, or
the use of a single preferred character set. To assure global
interoperability this document RECOMMENDS the latter approach and
defines a single character set, in addition to NVT ASCII and EBCDIC,
which is understandable by all systems. For FTP this character set
SHALL be ISO/IEC 10646:1993. For support of global compatibility it
is STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding
when exchanging pathnames. Clients and servers are, however, under
no obligation to perform any conversion on the contents of a file for
operations such as STOR or RETR.
The character set used to store files SHALL remain a local decision
and MAY depend on the capability of local operating systems. Prior to
the exchange of pathnames they SHOULD be converted into a ISO/IEC
10646 format and UTF-8 encoded. This approach, while allowing
international exchange of pathnames, will still allow backward
compatibility with older systems because the code set positions for
ASCII characters are identical to the one byte sequence in UTF-8.
Sections <a href="#section-2.1">2.1</a> and <a href="#section-2.2">2.2</a> give a brief description of the international
character set and transfer encoding RECOMMENDED by this document. A
more thorough description of UTF-8, ISO/IEC 10646, and UNICODE
[<a href="#ref-UNICODE" title=""The Unicode Standard - Version 2.0"">UNICODE</a>], beyond that given in this document, can be found in <a href="./rfc2279">RFC</a>
<a href="./rfc2279">2279</a> [<a href="./rfc2279" title=""UTF-8, a transformation format of ISO 10646"">RFC2279</a>].
<span class="h3"><a class="selflink" id="section-2.1" href="#section-2.1">2.1</a> International Character Set</span>
The character set defined for international support of FTP SHALL be
the Universal Character Set as defined in ISO 10646:1993 as amended.
This standard incorporates the character sets of many existing
international, national, and corporate standards. ISO/IEC 10646
defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a
four byte (31 bit) encoding containing 2**31 code positions divided
into 128 groups of 256 planes. Each plane consists of 256 rows of 256
cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane
zero or the Basic Multilingual Plane (BMP). Currently, no codesets
have been defined outside of the 2 byte BMP.
<span class="grey">Curtin Proposed Standard [Page 3]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-4" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
The Unicode standard version 2.0 [<a href="#ref-UNICODE" title=""The Unicode Standard - Version 2.0"">UNICODE</a>] is consistent with the
UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0
includes the repertoire of IS 10646 characters, amendments 1-7 of IS
10646, and editorial and technical corrigenda.
<span class="h3"><a class="selflink" id="section-2.2" href="#section-2.2">2.2</a> Transfer Encoding</span>
UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2
or UTF-FSS, SHALL be used as a transfer encoding to transmit the
international character set. UTF-8 is a file safe encoding which
avoids the use of byte values that have special significance during
the parsing of pathname character strings. UTF-8 is an 8 bit encoding
of the characters in the UCS. Some of UTF-8's benefits are that it is
compatible with 7 bit ASCII, so it doesn't affect programs that give
special meanings to various ASCII characters; it is immune to
synchronization errors; its encoding rules allow for easy
identification; and it has enough space to support a large number of
character sets.
UTF-8 encoding represents each UCS character as a sequence of 1 to 6
bytes in length. For all sequences of one byte the most significant
bit is ZERO. For all sequences of more than one byte the number of
ONE bits in the first byte, starting from the most significant bit
position, indicates the number of bytes in the UTF-8 sequence
followed by a ZERO bit. For example, the first byte of a 3 byte UTF-8
sequence would have 1110 as its most significant bits. Each
additional bytes (continuing bytes) in the UTF-8 sequence, contain a
ONE bit followed by a ZERO bit as their most significant bits. The
remaining free bit positions in the continuing bytes are used to
identify characters in the UCS. The relationship between UCS and
UTF-8 is demonstrated in the following table:
UCS-4 range(hex) UTF-8 byte sequence(binary)
00000000 - 0000007F 0xxxxxxx
00000080 - 000007FF 110xxxxx 10xxxxxx
00000800 - 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
04000000 - 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx
A beneficial property of UTF-8 is that its single byte sequence is
consistent with the ASCII character set. This feature will allow a
transition where old ASCII-only clients can still interoperate with
new servers that support the UTF-8 encoding.
<span class="grey">Curtin Proposed Standard [Page 4]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-5" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
Another feature is that the encoding rules make it very unlikely that
a character sequence from a different character set will be mistaken
for a UTF-8 encoded character sequence. Clients and servers can use a
simple routine to determine if the character set being exchanged is
valid UTF-8. Section B.1 shows a code example of this check.
<span class="h2"><a class="selflink" id="section-3" href="#section-3">3</a> Pathnames</span>
<span class="h3"><a class="selflink" id="section-3.1" href="#section-3.1">3.1</a> General compliance</span>
- The 7-bit restriction for pathnames exchanged is dropped.
- Many operating system allow the use of spaces <SP>, carriage return
<CR>, and line feed <LF> characters as part of the pathname. The
exchange of pathnames with these special command characters will
cause the pathnames to be parsed improperly. This is because ftp
commands associated with pathnames have the form:
COMMAND <SP> <pathname> <CRLF>.
To allow the exchange of pathnames containing these characters, the
definition of pathname is changed from
<pathname> ::= <string> ; in BNF format
to
pathname = 1*(%x01..%xFF) ; in ABNF format [<a href="#ref-ABNF" title=""Augmented BNF for Syntax Specifications: ABNF"">ABNF</a>].
To avoid mistaking these characters within pathnames as special
command characters the following rules will apply:
There MUST be only one <SP> between a ftp command and the pathname.
Implementations MUST assume <SP> characters following the initial
<SP> as part of the pathname. For example the pathname in STOR
<SP><SP><SP>foo.bar<CRLF> is <SP><SP>foo.bar.
Current implementations, which may allow multiple <SP> characters as
separators between the command and pathname, MUST assure that they
comply with this single <SP> convention. Note: Implementations which
treat 3 character commands (e.g. CWD, MKD, etc.) as a fixed 4
character command by padding the command with a trailing <SP> are in
non-compliance to this specification.
When a <CR> character is encountered as part of a pathname it MUST be
padded with a <NUL> character prior to sending the command. On
receipt of a pathname containing a <CR><NUL> sequence the <NUL>
character MUST be stripped away. This approach is described in the
Telnet protocol [<a href="./rfc854" title=""Telnet Protocol Specification"">RFC854</a>] on pages 11 and 12. For example, to store a
pathname foo<CR><LF>boo.bar the pathname would become
<span class="grey">Curtin Proposed Standard [Page 5]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-6" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
foo<CR><NUL><LF>boo.bar prior to sending the command STOR
<SP>foo<CR><NUL><LF>boo.bar<CRLF>. Upon receipt of the altered
pathname the <NUL> character following the <CR> would be stripped
away to form the original pathname.
- Conforming clients and servers MUST support UTF-8 for the transfer
and receipt of pathnames. Clients and servers MAY in addition give
users a choice of specifying interpretation of pathnames in another
encoding. Note that configuring clients and servers to use
character sets / encoding other than UTF-8 is outside of the scope
of this document. While it is recognized that in certain
operational scenarios this may be desirable, this is left as a
quality of implementation and operational issue.
- Pathnames are sequences of bytes. The encoding of names that are
valid UTF-8 sequences is assumed to be UTF-8. The character set of
other names is undefined. Clients and servers, unless otherwise
configured to support a specific native character set, MUST check
for a valid UTF-8 byte sequence to determine if the pathname being
presented is UTF-8.
- To avoid data loss, clients and servers SHOULD use the UTF-8
encoded pathnames when unable to convert them to a usable code set.
- There may be cases when the code set / encoding presented to the
server or client cannot be determined. In such cases the raw bytes
SHOULD be used.
<span class="h3"><a class="selflink" id="section-3.2" href="#section-3.2">3.2</a> Servers compliance</span>
- Servers MUST support the UTF-8 feature in response to the FEAT
command [<a href="./rfc2389" title=""Feature Negotiation Mechanism for the File Transfer Protocol"">RFC2389</a>]. The UTF-8 feature is a line containing the exact
string "UTF8". This string is not case sensitive, but SHOULD be
transmitted in upper case. The response to a FEAT command SHOULD
be:
C> feat
S> 211- <any descriptive text>
S> ...
S> UTF8
S> ...
S> 211 end
The ellipses indicate placeholders where other features may be
included, but are NOT REQUIRED. The one space indentation of the
feature lines is mandatory [<a href="./rfc2389" title=""Feature Negotiation Mechanism for the File Transfer Protocol"">RFC2389</a>].
<span class="grey">Curtin Proposed Standard [Page 6]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-7" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
- Mirror servers may want to exactly reflect the site that they are
mirroring. In such cases servers MAY store and present the exact
pathname bytes that it received from the main server.
<span class="h3"><a class="selflink" id="section-3.3" href="#section-3.3">3.3</a> Clients compliance</span>
- Clients which do not require display of pathnames are under no
obligation to do so. Non-display clients do not need to conform to
requirements associated with display.
- Clients, which are presented UTF-8 pathnames by the server, SHOULD
parse UTF-8 correctly and attempt to display the pathname within
the limitation of the resources available.
- Clients MUST support the FEAT command and recognize the "UTF8"
feature (defined in 3.2 above) to determine if a server supports
UTF-8 encoding.
- Character semantics of other names shall remain undefined. If a
client detects that a server is non UTF-8, it SHOULD change its
display appropriately. How a client implementation handles non
UTF-8 is a quality of implementation issue. It MAY try to assume
some other encoding, give the user a chance to try to assume
something, or save encoding assumptions for a server from one FTP
session to another.
- Glyph rendering is outside the scope of this document. How a client
presents characters it cannot display is a quality of
implementation issue. This document RECOMMENDS that octets
corresponding to non-displayable characters SHOULD be presented in
URL %HH format defined in <a href="./rfc1738">RFC 1738</a> [<a href="./rfc1738" title=""Uniform Resource Locators (URL)"">RFC1738</a>]. They MAY, however,
display them as question marks, with their UCS hexadecimal value,
or in any other suitable fashion.
- Many existing clients interpret 8-bit pathnames as being in the
local character set. They MAY continue to do so for pathnames that
are not valid UTF-8.
<span class="h2"><a class="selflink" id="section-4" href="#section-4">4</a>. Language Support</span>
The Character Set Workshop Report [<a href="./rfc2130" title=""Character Set Workshop Report"">RFC2130</a>] suggests that clients and
servers SHOULD negotiate a language for "greetings" and "error
messages". This specification interprets the use of the term "error
message", by <a href="./rfc2130">RFC 2130</a>, to mean any explanatory text string returned
by server-PI in response to a user-PI command.
<span class="grey">Curtin Proposed Standard [Page 7]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-8" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
Implementers SHOULD note that FTP commands and numeric responses are
protocol elements. As such, their use is not affected by any guidance
expressed by this specification.
Language support of greetings and command responses shall be the
default language supported by the server or the language supported by
the server and selected by the client.
It may be possible to achieve language support through a virtual host
as described in [<a href="#ref-MLST" title=""Extensions to FTP"">MLST</a>]. However, an FTP server might not support
virtual servers, or virtual servers might be configured to support an
environment without regard for language. To allow language
negotiation this specification defines a new LANG command. Clients
and servers that comply with this specification MUST support the LANG
command.
<span class="h3"><a class="selflink" id="section-4.1" href="#section-4.1">4.1</a> The LANG command</span>
A new command "LANG" is added to the FTP command set to allow
server-FTP process to determine in which language to present server
greetings and the textual part of command responses. The parameter
associated with the LANG command SHALL be one of the language tags
defined in <a href="./rfc1766">RFC 1766</a> [<a href="./rfc1766" title=""Tags for the Identification of Languages"">RFC1766</a>]. If a LANG command without a parameter
is issued the server's default language will be used.
Greetings and responses issued prior to language negotiation SHALL be
in the server's default language. Paragraph 4.5 of [<a href="./rfc2277" title="" IETF Policy on Character Sets and Languages"">RFC2277</a>] state
that this "default language MUST be understandable by an English-
speaking person". This specification RECOMMENDS that the server
default language be English encoded using ASCII. This text may be
augmented by text from other languages. Once negotiated, server-PI
MUST return server messages and textual part of command responses in
the negotiated language and encoded in UTF-8. Server-PI MAY wish to
re-send previously issued server messages in the newly negotiated
language.
The LANG command only affects presentation of greeting messages and
explanatory text associated with command responses. No attempt should
be made by the server to translate protocol elements (FTP commands
and numeric responses) or data transmitted over the data connection.
User-PI MAY issue the LANG command at any time during an FTP session.
In order to gain the full benefit of this command, it SHOULD be
presented prior to authentication. In general, it will be issued
after the HOST command [<a href="#ref-MLST" title=""Extensions to FTP"">MLST</a>]. Note that the issuance of a HOST or
<span class="grey">Curtin Proposed Standard [Page 8]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-9" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
REIN command [<a href="./rfc959" title=""File Transfer Protocol (FTP)"">RFC959</a>] will negate the affect of the LANG command.
User-PI SHOULD be capable of supporting UTF-8 encoding for the
language negotiated. Guidance on interpretation and rendering of
UTF-8, defined in <a href="#section-3">section 3</a>, SHALL apply.
Although NOT REQUIRED by this specification, a user-PI SHOULD issue a
FEAT command [<a href="./rfc2389" title=""Feature Negotiation Mechanism for the File Transfer Protocol"">RFC2389</a>] prior to a LANG command. This will allow the
user-PI to determine if the server supports the LANG command and
which language options.
In order to aid the server in identifying whether a connection has
been established with a client which conforms to this specification
or an older client, user-PI MUST send a HOST [<a href="#ref-MLST" title=""Extensions to FTP"">MLST</a>] and/or LANG
command prior to issuing any other command (other than FEAT
[<a href="./rfc2389" title=""Feature Negotiation Mechanism for the File Transfer Protocol"">RFC2389</a>]). If user-PI issues a HOST command, and the server's
default language is acceptable, it need not issue a LANG command.
However, if the implementation does not support the HOST command, a
LANG command MUST be issued. Until server-PI is presented with either
a HOST or LANG command it SHOULD assume that the user-PI does not
comply with this specification.
<span class="h3"><a class="selflink" id="section-4.2" href="#section-4.2">4.2</a> Syntax of the LANG command</span>
The LANG command is defined as follows:
lang-command = "Lang" [(SP lang-tag)] CRLF
lang-tag = Primary-tag *( "-" Sub-tag)
Primary-tag = 1*8ALPHA
Sub-tag = 1*8ALPHA
lang-response = lang-ok / error-response
lang-ok = "200" [SP *(%x00..%xFF) ] CRLF
error-response = command-unrecognized / bad-argument /
not-implemented / unsupported-parameter
command-unrecognized = "500" [SP *(%x01..%xFF) ] CRLF
bad-argument = "501" [SP *(%x01..%xFF) ] CRLF
not-implemented = "502" [SP *(%x01..%xFF) ] CRLF
unsupported-parameter = "504" [SP *(%x01..%xFF) ] CRLF
The "lang" command word is case independent and may be specified in
any character case desired. Therefore "LANG", "lang", "Lang", and
"lAnG" are equivalent commands.
The OPTIONAL "Lang-tag" given as a parameter specifies the primary
language tags and zero or more sub-tags as defined in [<a href="./rfc1766" title=""Tags for the Identification of Languages"">RFC1766</a>]. As
described in [<a href="./rfc1766" title=""Tags for the Identification of Languages"">RFC1766</a>] language tags are treated as case insensitive.
If omitted server-PI MUST use the server's default language.
<span class="grey">Curtin Proposed Standard [Page 9]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-10" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
Server-FTP responds to the "Lang" command with either "lang-ok" or
"error-response". "lang-ok" MUST be sent if Server-FTP supports the
"Lang" command and can support some form of the "lang-tag". Support
SHOULD be as follows:
- If server-FTP receives "Lang" with no parameters it SHOULD return
messages and command responses in the server default language.
- If server-FTP receives "Lang" with only a primary tag argument
(e.g. en, fr, de, ja, zh, etc.), which it can support, it SHOULD
return messages and command responses in the language associated
with that primary tag. It is possible that server-FTP will only
support the primary tag when combined with a sub-tag (e.g. en-US,
en-UK, etc.). In such cases, server-FTP MAY determine the
appropriate variant to use during the session. How server-FTP makes
that determination is outside the scope of this specification. If
server-FTP cannot determine if a sub-tag variant is appropriate it
SHOULD return an "unsupported-parameter" (504) response.
- If server-FTP receives "Lang" with a primary tag and sub-tag(s)
argument, which is implemented, it SHOULD return messages and
command responses in support of the language argument. It is
possible that server-FTP can support the primary tag of the "Lang"
argument but not the sub-tag(s). In such cases server-FTP MAY
return messages and command responses in the most appropriate
variant of the primary tag that has been implemented. How server-
FTP makes that determination is outside the scope of this
specification. If server-FTP cannot determine if a sub-tag variant
is appropriate it SHOULD return an "unsupported-parameter" (504)
response.
For example if client-FTP sends a "LANG en-AU" command and server-FTP
has implemented language tags en-US and en-UK it may decide that the
most appropriate language tag is en-UK and return "200 en-AU not
supported. Language set to en-UK". The numeric response is a protocol
element and can not be changed. The associated string is for
illustrative purposes only.
Clients and servers that conform to this specification MUST support
the LANG command. Clients SHOULD, however, anticipate receiving a 500
or 502 command response, in cases where older or non-compliant
servers do not recognize or have not implemented the "Lang". A 501
response SHOULD be sent if the argument to the "Lang" command is not
syntactically correct. A 504 response SHOULD be sent if the "Lang"
argument, while syntactically correct, is not implemented. As noted
above, an argument may be considered a lexicon match even though it
is not an exact syntax match.
<span class="grey">Curtin Proposed Standard [Page 10]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-11" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
<span class="h3"><a class="selflink" id="section-4.3" href="#section-4.3">4.3</a> Feat response for LANG command</span>
A server-FTP process that supports the LANG command, and language
support for messages and command responses, MUST include in the
response to the FEAT command [<a href="./rfc2389" title=""Feature Negotiation Mechanism for the File Transfer Protocol"">RFC2389</a>], a feature line indicating
that the LANG command is supported and a fact list of the supported
language tags. A response to a FEAT command SHALL be in the following
format:
Lang-feat = SP "LANG" SP lang-fact CRLF
lang-fact = lang-tag ["*"] *(";" lang-tag ["*"])
lang-tag = Primary-tag *( "-" Sub-tag)
Primary-tag= 1*8ALPHA
Sub-tag = 1*8ALPHA
The lang-feat response contains the string "LANG" followed by a
language fact. This string is not case sensitive, but SHOULD be
transmitted in upper case, as recommended in [<a href="./rfc2389" title=""Feature Negotiation Mechanism for the File Transfer Protocol"">RFC2389</a>]. The initial
space shown in the Lang-feat response is REQUIRED by the FEAT
command. It MUST be a single space character. More or less space
characters are not permitted. The lang-fact SHALL include the lang-
tags which server-FTP can support. At least one lang-tag MUST be
included with the FEAT response. The lang-tag SHALL be in the form
described earlier in this document. The OPTIONAL asterisk, when
present, SHALL indicate the current lang-tag being used by server-FTP
for messages and responses.
<span class="h4"><a class="selflink" id="section-4.3.1" href="#section-4.3.1">4.3.1</a> Feat examples</span>
C> feat
S> 211- <any descriptive text>
S> ...
S> LANG EN*
S> ...
S> 211 end
In this example server-FTP can only support English, which is the
current language (as shown by the asterisk) being used by the server
for messages and command responses.
C> feat
S> 211- <any descriptive text>
S> ...
S> LANG EN*;FR
S> ...
S> 211 end
<span class="grey">Curtin Proposed Standard [Page 11]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-12" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
C> LANG fr
S> 200 Le response sera changez au francais
C> feat
S> 211- <quelconque descriptif texte>
S> ...
S> LANG EN;FR*
S> ...
S> 211 end
In this example server-FTP supports both English and French as shown
by the initial response to the FEAT command. The asterisk indicates
that English is the current language in use by server-FTP. After a
LANG command is issued to change the language to French, the FEAT
response shows French as the current language in use.
In the above examples ellipses indicate placeholders where other
features may be included, but are NOT REQUIRED.
<span class="h2"><a class="selflink" id="section-5" href="#section-5">5</a> Security Considerations</span>
This document addresses the support of character sets beyond 1 byte
and a new language negotiation command. Conformance to this document
should not induce a security risk.
<span class="h2"><a class="selflink" id="section-6" href="#section-6">6</a> Acknowledgments</span>
The following people have contributed to this document:
D. J. Bernstein
Martin J. Duerst
Mark Harris
Paul Hethmon
Alun Jones
Gregory Lundberg
James Matthews
Keith Moore
Sandra O'Donnell
Benjamin Riefenstahl
Stephen Tihor
(and others from the FTPEXT working group)
<span class="grey">Curtin Proposed Standard [Page 12]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-13" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
<span class="h2"><a class="selflink" id="section-7" href="#section-7">7</a> Glossary</span>
BIDI - abbreviation for Bi-directional, a reference to mixed right-
to-left and left-to-right text.
Character Set - a collection of characters used to represent textual
information in which each character has a numeric value
Code Set - (see character set).
Glyph - a character image represented on a display device.
I18N - "I eighteen N", the first and last letters of the word
"internationalization" and the eighteen letters in between.
UCS-2 - the ISO/IEC 10646 two octet Universal Character Set form.
UCS-4 - the ISO/IEC 10646 four octet Universal Character Set form.
UTF-8 - the UCS Transformation Format represented in 8 bits.
TF-16 - A 16-bit format including the BMP (directly encoded) and
surrogate pairs to represent characters in planes 01-16; equivalent
to Unicode.
<span class="h2"><a class="selflink" id="section-8" href="#section-8">8</a> Bibliography</span>
[<a id="ref-ABNF">ABNF</a>] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", <a href="./rfc2234">RFC 2234</a>, November 1997.
[<a id="ref-ASCII">ASCII</a>] ANSI X3.4:1986 Coded Character Sets - 7 Bit American
National Standard Code for Information Interchange (7-
bit ASCII)
[<a id="ref-ISO-8859">ISO-8859</a>] ISO 8859. International standard -- Information
processing -- 8-bit single-byte coded graphic character
sets -- Part 1:Latin alphabet No. 1 (1987) -- Part 2:
Latin alphabet No. 2 (1987) -- Part 3: Latin alphabet
No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) --
Part 5: Latin/Cyrillic alphabet (1988) -- Part 6:
Latin/Arabic alphabet (1987) -- Part : Latin/Greek
alphabet (1987) -- Part 8: Latin/Hebrew alphabet (1988)
-- Part 9: Latin alphabet No. 5 (1989) -- Part10: Latin
alphabet No. 6 (1992)
[<a id="ref-BCP14">BCP14</a>] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", <a href="https://www.rfc-editor.org/bcp/bcp14">BCP 14</a>, <a href="./rfc2119">RFC 2119</a>, March 1997.
<span class="grey">Curtin Proposed Standard [Page 13]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-14" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
[<a id="ref-ISO-10646">ISO-10646</a>] ISO/IEC 10646-1:1993. International standard --
Information technology -- Universal multiple-octet coded
character set (UCS) -- Part 1: Architecture and basic
multilingual plane.
[<a id="ref-MLST">MLST</a>] Elz, R. and P. Hethmon, <a style="text-decoration: none" href='https://www.google.com/search?sitesearch=datatracker.ietf.org%2Fdoc%2Fhtml%2F&q=inurl:draft-+%22Extensions+to+FTP%22'>"Extensions to FTP"</a>, Work in
Progress.
[<a id="ref-RFC854">RFC854</a>] Postel, J. and J. Reynolds, "Telnet Protocol
Specification", STD 8, <a href="./rfc854">RFC 854</a>, May 1983.
[<a id="ref-RFC959">RFC959</a>] Postel, J. and J. Reynolds, "File Transfer Protocol
(FTP)", STD 9, <a href="./rfc959">RFC 959</a>, October 1985.
[<a id="ref-RFC1123">RFC1123</a>] Braden, R., "Requirements for Internet Hosts --
Application and Support", STD 3, <a href="./rfc1123">RFC 1123</a>, October 1989.
[<a id="ref-RFC1738">RFC1738</a>] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform
Resource Locators (URL)", <a href="./rfc1738">RFC 1738</a>, December 1994.
[<a id="ref-RFC1766">RFC1766</a>] Alvestrand, H., "Tags for the Identification of
Languages", <a href="./rfc1766">RFC 1766</a>, March 1995.
[<a id="ref-RFC2130">RFC2130</a>] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin, M. and P. Svanberg, "Character
Set Workshop Report", <a href="./rfc2130">RFC 2130</a>, April 1997.
[<a id="ref-RFC2277">RFC2277</a>] Alvestrand, H., " IETF Policy on Character Sets and
Languages", <a href="./rfc2277">RFC 2277</a>, January 1998.
[<a id="ref-RFC2279">RFC2279</a>] Yergeau, F., "UTF-8, a transformation format of ISO
10646", <a href="./rfc2279">RFC 2279</a>, January 1998.
[<a id="ref-RFC2389">RFC2389</a>] Elz, R. and P. Hethmon, "Feature Negotiation Mechanism
for the File Transfer Protocol", <a href="./rfc2389">RFC 2389</a>, August 1998.
[<a id="ref-UNICODE">UNICODE</a>] The Unicode Consortium, "The Unicode Standard - Version
2.0", Addison Westley Developers Press, July 1996.
[<a id="ref-UTF-8">UTF-8</a>] ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS
Transformation Format 8 (UTF-8).
<span class="grey">Curtin Proposed Standard [Page 14]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-15" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
<span class="h2"><a class="selflink" id="section-9" href="#section-9">9</a> Author's Address</span>
Bill Curtin
JIEO
Attn: JEBBD
Ft. Monmouth, N.J. 07703-5613
EMail: curtinw@ftm.disa.mil
<span class="grey">Curtin Proposed Standard [Page 15]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-16" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
Annex A - Implementation Considerations
<span class="h3"><a class="selflink" id="appendix-A.1" href="#appendix-A.1">A.1</a> General Considerations</span>
- Implementers should ensure that their code accounts for potential
problems, such as using a NULL character to terminate a string or
no longer being able to steal the high order bit for internal use,
when supporting the extended character set.
- Implementers should be aware that there is a chance that pathnames
that are non UTF-8 may be parsed as valid UTF-8. The probabilities
are low for some encoding or statistically zero to zero for others.
A recent non-scientific analysis found that EUC encoded Japanese
words had a 2.7% false reading; SJIS had a 0.0005% false reading;
other encoding such as ASCII or KOI-8 have a 0% false reading. This
probability is highest for short pathnames and decreases as
pathname size increases. Implementers may want to look for signs
that pathnames which parse as UTF-8 are not valid UTF-8, such as
the existence of multiple local character sets in short pathnames.
Hopefully, as more implementations conform to UTF-8 transfer
encoding there will be a smaller need to guess at the encoding.
- Client developers should be aware that it will be possible for
pathnames to contain mixed characters (e.g.
//Latin1DirectoryName/HebrewFileName). They should be prepared to
handle the Bi-directional (BIDI) display of these character sets
(i.e. right to left display for the directory and left to right
display for the filename). While bi-directional display is outside
the scope of this document and more complicated than the above
example, an algorithm for bi-directional display can be found in
the UNICODE 2.0 [<a href="#ref-UNICODE" title=""The Unicode Standard - Version 2.0"">UNICODE</a>] standard. Also note that pathnames can
have different byte ordering yet be logically and display-wise
equivalent due to the insertion of BIDI control characters at
different points during composition. Also note that mixed character
sets may also present problems with font swapping.
- A server that copies pathnames transparently from a local
filesystem may continue to do so. It is then up to the local file
creators to use UTF-8 pathnames.
- Servers can supports charset labeling of files and/or directories,
such that different pathnames may have different charsets. The
server should attempt to convert all pathnames to UTF-8, but if it
can't then it should leave that name in its raw form.
- Some server's OS do not mandate character sets, but allow
administrators to configure it in the FTP server. These servers
should be configured to use a particular mapping table (either
<span class="grey">Curtin Proposed Standard [Page 16]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-17" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
external or built-in). This will allow the flexibility of defining
different charsets for different directories.
- If the server's OS does not mandate the character set and the FTP
server cannot be configured, the server should simply use the raw
bytes in the file name. They might be ASCII or UTF-8.
- If the server is a mirror, and wants to look just like the site it
is mirroring, it should store the exact file name bytes that it
received from the main server.
<span class="grey">Curtin Proposed Standard [Page 17]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-18" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
<span class="h3"><a class="selflink" id="appendix-A.2" href="#appendix-A.2">A.2</a> Transition Considerations</span>
- Servers which support this specification, when presented a pathname
from an old client (one which does not support this specification),
can nearly always tell whether the pathname is in UTF-8 (see B.1)
or in some other code set. In order to support these older clients,
servers may wish to default to a non UTF-8 code set. However, how a
server supports non UTF-8 is outside the scope of this
specification.
- Clients which support this specification will be able to determine
if the server can support UTF-8 (i.e. supports this specification)
by the ability of the server to support the FEAT command and the
UTF8 feature (defined in 3.2). If the newer clients determine that
the server does not support UTF-8 it may wish to default to a
different code set. Client developers should take into
consideration that pathnames, associated with older servers, might
be stored in UTF-8. However, how a client supports non UTF-8 is
outside the scope of this specification.
- Clients and servers can transition to UTF-8 by either converting
to/from the local encoding, or the users can store UTF-8 filenames.
The former approach is easier on tightly controlled file systems
(e.g. PCs and MACs). The latter approach is easier on more free
form file systems (e.g. Unix).
- For interactive use attention should be focused on user interface
and ease of use. Non-interactive use requires a consistent and
controlled behavior.
- There may be many applications which reference files under their
old raw pathname (e.g. linked URLs). Changing the pathname to UTF-8
will cause access to the old URL to fail. A solution may be for the
server to act as if there was 2 different pathnames associated with
the file. This might be done internal to the server on controlled
file systems or by using symbolic links on free form systems. While
this approach may work for single file transfer non-interactive
use, a non-interactive transfer of all of the files in a directory
will produce duplicates. Interactive users may be presented with
lists of files which are double the actual number files.
<span class="grey">Curtin Proposed Standard [Page 18]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-19" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
Annex B - Sample Code and Examples
<span class="h3"><a class="selflink" id="appendix-B.1" href="#appendix-B.1">B.1</a> Valid UTF-8 check</span>
The following routine checks if a byte sequence is valid UTF-8. This
is done by checking for the proper tagging of the first and following
bytes to make sure they conform to the UTF-8 format. It then checks
to assure that the data part of the UTF-8 sequence conforms to the
proper range allowed by the encoding. Note: This routine will not
detect characters that have not been assigned and therefore do not
exist.
int utf8_valid(const unsigned char *buf, unsigned int len)
{
const unsigned char *endbuf = buf + len;
unsigned char byte2mask=0x00, c;
int trailing = 0; // trailing (continuation) bytes to follow
while (buf != endbuf)
{
c = *buf++;
if (trailing)
if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8 format?
{if (byte2mask) // Need to check 2nd byte for proper range?
if (c&byte2mask) // Are appropriate bits set?
byte2mask=0x00;
else
return 0;
trailing--; }
else
return 0;
else
if ((c&0x80) == 0x00) continue; // valid 1 byte UTF-8
else if ((c&0xE0) == 0xC0) // valid 2 byte UTF-8
if (c&0x1E) // Is UTF-8 byte in
// proper range?
trailing =1;
else
return 0;
else if ((c&0xF0) == 0xE0) // valid 3 byte UTF-8
{if (!(c&0x0F)) // Is UTF-8 byte in
// proper range?
byte2mask=0x20; // If not set mask
// to check next byte
trailing = 2;}
else if ((c&0xF8) == 0xF0) // valid 4 byte UTF-8
{if (!(c&0x07)) // Is UTF-8 byte in
// proper range?
<span class="grey">Curtin Proposed Standard [Page 19]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-20" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
byte2mask=0x30; // If not set mask
// to check next byte
trailing = 3;}
else if ((c&0xFC) == 0xF8) // valid 5 byte UTF-8
{if (!(c&0x03)) // Is UTF-8 byte in
// proper range?
byte2mask=0x38; // If not set mask
// to check next byte
trailing = 4;}
else if ((c&0xFE) == 0xFC) // valid 6 byte UTF-8
{if (!(c&0x01)) // Is UTF-8 byte in
// proper range?
byte2mask=0x3C; // If not set mask
// to check next byte
trailing = 5;}
else return 0;
}
return trailing == 0;
}
<span class="h3"><a class="selflink" id="appendix-B.2" href="#appendix-B.2">B.2</a> Conversions</span>
The code examples in this section closely reflect the algorithm in
ISO 10646 and may not present the most efficient solution for
converting to / from UTF-8 encoding. If efficiency is an issue,
implementers should use the appropriate bitwise operators.
Additional code examples and numerous mapping tables can be found at
the Unicode site, <a href="HTTP://www.unicode.org">HTTP://www.unicode.org</a> or <a href="FTP://unicode.org">FTP://unicode.org</a>.
Note that the conversion examples below assume that the local
character set supported in the operating system is something other
than UCS2/UTF-16. There are some operating systems that already
support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case no
conversion will be necessary from the local character set to the UCS.
<span class="h4"><a class="selflink" id="appendix-B.2.1" href="#appendix-B.2.1">B.2.1</a> Conversion from Local Character Set to UTF-8</span>
Conversion from the local filesystem character set to UTF-8 will
normally involve a two step process. First convert the local
character set to the UCS; then convert the UCS to UTF-8.
The first step in the process can be performed by maintaining a
mapping table that includes the local character set code and the
corresponding UCS code. For instance the ISO/IEC 8859-8 [<a href="#ref-ISO-8859">ISO-8859</a>]
code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte
ISO/IEC 10646 code is 0x000005D5.
<span class="grey">Curtin Proposed Standard [Page 20]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-21" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
The next step is to convert the UCS character code to the UTF-8
encoding. The following routine can be used to determine and encode
the correct number of bytes based on the UCS-4 character code:
unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int
ucs4_len, unsigned char *utf8_buf)
{
const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len;
unsigned int utf8_len = 0; // return value for UTF8 size
unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer
// to load UTF8 values
while (ucs4_buf != ucs4_endbuf)
{
if ( *ucs4_buf <= 0x7F) // ASCII chars no conversion needed
{
*t_utf8_buf++ = (unsigned char) *ucs4_buf;
utf8_len++;
ucs4_buf++;
}
else
if ( *ucs4_buf <= 0x07FF ) // In the 2 byte utf-8 range
{
*t_utf8_buf++= (unsigned char) (0xC0 + (*ucs4_buf/0x40));
*t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
utf8_len+=2;
ucs4_buf++;
}
else
if ( *ucs4_buf <= 0xFFFF ) /* In the 3 byte utf-8 range. The
values 0x0000FFFE, 0x0000FFFF
and 0x0000D800 - 0x0000DFFF do
not occur in UCS-4 */
{
*t_utf8_buf++= (unsigned char) (0xE0 +
(*ucs4_buf/0x1000));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x40)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
utf8_len+=3;
ucs4_buf++;
}
else
if ( *ucs4_buf <= 0x1FFFFF ) //In the 4 byte utf-8 range
{
*t_utf8_buf++= (unsigned char) (0xF0 +
(*ucs4_buf/0x040000));
<span class="grey">Curtin Proposed Standard [Page 21]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-22" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x10000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x40)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
utf8_len+=4;
ucs4_buf++;
}
else
if ( *ucs4_buf <= 0x03FFFFFF )//In the 5 byte utf-8 range
{
*t_utf8_buf++= (unsigned char) (0xF8 +
(*ucs4_buf/0x01000000));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x040000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x1000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x40)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
(*ucs4_buf%0x40));
utf8_len+=5;
ucs4_buf++;
}
else
if ( *ucs4_buf <= 0x7FFFFFFF )//In the 6 byte utf-8 range
{
*t_utf8_buf++= (unsigned char)
(0xF8 +(*ucs4_buf/0x40000000));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x01000000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x040000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x1000)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
((*ucs4_buf/0x40)%0x40));
*t_utf8_buf++= (unsigned char) (0x80 +
(*ucs4_buf%0x40));
utf8_len+=6;
ucs4_buf++;
}
}
return (utf8_len);
}
<span class="grey">Curtin Proposed Standard [Page 22]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-23" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
<span class="h4"><a class="selflink" id="appendix-B.2.2" href="#appendix-B.2.2">B.2.2</a> Conversion from UTF-8 to Local Character Set</span>
When moving from UTF-8 encoding to the local character set the
reverse procedure is used. First the UTF-8 encoding is transformed
into the UCS-4 character set. The UCS-4 is then converted to the
local character set from a mapping table (i.e. the opposite of the
table used to form the UCS-4 character code).
To convert from UTF-8 to UCS-4 the free bits (those that do not
define UTF-8 sequence size or signify continuation bytes) in a UTF-8
sequence are concatenated as a bit string. The bits are then
distributed into a four-byte sequence starting from the least
significant bits. Those bits not assigned a bit in the four-byte
sequence are padded with ZERO bits. The following routine converts
the UTF-8 encoding to UCS-4 character codes:
int utf8_to_ucs4 (unsigned long *ucs4_buf, unsigned int utf8_len,
unsigned char *utf8_buf)
{
const unsigned char *utf8_endbuf = utf8_buf + utf8_len;
unsigned int ucs_len=0;
while (utf8_buf != utf8_endbuf)
{
if ((*utf8_buf & 0x80) == 0x00) /*ASCII chars no conversion
needed */
{
*ucs4_buf++ = (unsigned long) *utf8_buf;
utf8_buf++;
ucs_len++;
}
else
if ((*utf8_buf & 0xE0)== 0xC0) //In the 2 byte utf-8 range
{
*ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xC0) * 0x40)
+ ( *(utf8_buf+1) - 0x80));
utf8_buf += 2;
ucs_len++;
}
else
if ( (*utf8_buf & 0xF0) == 0xE0 ) /*In the 3 byte utf-8
range */
{
*ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xE0) * 0x1000)
+ (( *(utf8_buf+1) - 0x80) * 0x40)
+ ( *(utf8_buf+2) - 0x80));
<span class="grey">Curtin Proposed Standard [Page 23]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-24" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
utf8_buf+=3;
ucs_len++;
}
else
if ((*utf8_buf & 0xF8) == 0xF0) /* In the 4 byte utf-8
range */
{
*ucs4_buf++ = (unsigned long)
(((*utf8_buf - 0xF0) * 0x040000)
+ (( *(utf8_buf+1) - 0x80) * 0x1000)
+ (( *(utf8_buf+2) - 0x80) * 0x40)
+ ( *(utf8_buf+3) - 0x80));
utf8_buf+=4;
ucs_len++;
}
else
if ((*utf8_buf & 0xFC) == 0xF8) /* In the 5 byte utf-8
range */
{
*ucs4_buf++ = (unsigned long)
(((*utf8_buf - 0xF8) * 0x01000000)
+ ((*(utf8_buf+1) - 0x80) * 0x040000)
+ (( *(utf8_buf+2) - 0x80) * 0x1000)
+ (( *(utf8_buf+3) - 0x80) * 0x40)
+ ( *(utf8_buf+4) - 0x80));
utf8_buf+=5;
ucs_len++;
}
else
if ((*utf8_buf & 0xFE) == 0xFC) /* In the 6 byte utf-8
range */
{
*ucs4_buf++ = (unsigned long)
(((*utf8_buf - 0xFC) * 0x40000000)
+ ((*(utf8_buf+1) - 0x80) * 0x010000000)
+ ((*(utf8_buf+2) - 0x80) * 0x040000)
+ (( *(utf8_buf+3) - 0x80) * 0x1000)
+ (( *(utf8_buf+4) - 0x80) * 0x40)
+ ( *(utf8_buf+5) - 0x80));
utf8_buf+=6;
ucs_len++;
}
}
return (ucs_len);
}
<span class="grey">Curtin Proposed Standard [Page 24]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-25" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
<span class="h4"><a class="selflink" id="appendix-B.2.3" href="#appendix-B.2.3">B.2.3</a> ISO/IEC 8859-8 Example</span>
This example demonstrates mapping ISO/IEC 8859-8 character set to
UTF-8 and back to ISO/IEC 8859-8. As noted earlier, the Hebrew letter
"VAV" is convertd from the ISO/IEC 8859-8 character code 0xE4 to the
corresponding 4 byte ISO/IEC 10646 code of 0x000005D5 by a simple
lookup of a conversion/mapping file.
The UCS-4 character code is transformed into UTF-8 using the
ucs4_to_utf8 routine described earlier by:
1. Because the UCS-4 character is between 0x80 and 0x07FF it will map
to a 2 byte UTF-8 sequence.
2. The first byte is defined by (0xC0 + (0x000005D5 / 0x40)) = 0xD7.
3. The second byte is defined by (0x80 + (0x000005D5 % 0x40)) = 0x95.
The UTF-8 encoding is transferred back to UCS-4 by using the
utf8_to_ucs4 routine described earlier by:
1. Because the first byte of the sequence, when the '&' operator with
a value of 0xE0 is applied, will produce 0xC0 (0xD7 & 0xE0 = 0xC0)
the UTF-8 is a 2 byte sequence.
2. The four byte UCS-4 character code is produced by (((0xD7 - 0xC0)
* 0x40) + (0x95 -0x80)) = 0x000005D5.
Finally, the UCS-4 character code is converted to ISO/IEC 8859-8
character code (using the mapping table which matches ISO/IEC 8859-8
to UCS-4 ) to produce the original 0xE4 code for the Hebrew letter
"VAV".
<span class="h4"><a class="selflink" id="appendix-B.2.4" href="#appendix-B.2.4">B.2.4</a> Vendor Codepage Example</span>
This example demonstrates the mapping of a codepage to UTF-8 and back
to a vendor codepage. Mapping between vendor codepages can be done in
a very similar manner as described above. For instance both the PC
and Mac codepages reflect the character set from the Thai standard
TIS 620-2533. The character code on both platforms for the Thai
letter "SO SO" is 0xAB. This character can then be mapped into the
UCS-4 by way of a conversion/mapping file to produce the UCS-4 code
of 0x0E0B.
The UCS-4 character code is transformed into UTF-8 using the
ucs4_to_utf8 routine described earlier by:
1. Because the UCS-4 character is between 0x0800 and 0xFFFF it will
map to a 3 byte UTF-8 sequence.
2. The first byte is defined by (0xE0 + (0x00000E0B / 0x1000) = 0xE0.
<span class="grey">Curtin Proposed Standard [Page 25]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-26" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
3. The second byte is defined by (0x80 + ((0x00000E0B / 0x40) %
0x40))) = 0xB8.
4. The third byte is defined by (0x80 + (0x00000E0B % 0x40)) = 0x8B.
The UTF-8 encoding is transferred back to UCS-4 by using the
utf8_to_ucs4 routine described earlier by:
1. Because the first byte of the sequence, when the '&' operator with
a value of 0xF0 is applied, will produce 0xE0 (0xE0 & 0xF0 = 0xE0)
the UTF-8 is a 3 byte sequence.
2. The four byte UCS-4 character code is produced by (((0xE0 - 0xE0)
* 0x1000) + ((0xB8 - 0x80) * 0x40) + (0x8B -0x80) = 0x0000E0B.
Finally, the UCS-4 character code is converted to either the PC or
MAC codepage character code (using the mapping table which matches
codepage to UCS-4 ) to produce the original 0xAB code for the Thai
letter "SO SO".
<span class="h3"><a class="selflink" id="appendix-B.3" href="#appendix-B.3">B.3</a> Pseudo Code for a High-Quality Translating Server</span>
if utf8_valid(fn)
{
attempt to convert fn to the local charset, producing localfn
if (conversion fails temporarily) return error
if (conversion succeeds)
{
attempt to open localfn
if (open fails temporarily) return error
if (open succeeds) return success
}
}
attempt to open fn
if (open fails temporarily) return error
if (open succeeds) return success
return permanent error
<span class="grey">Curtin Proposed Standard [Page 26]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-27" ></span>
<span class="grey"><a href="./rfc2640">RFC 2640</a> FTP Internalization July 1999</span>
Full Copyright Statement
Copyright (C) The Internet Society (1999). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Internet Society.
Curtin Proposed Standard [Page 27]
</pre>
|