1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333
|
<pre>Network Working Group D. Goldsmith
Request for Comments: 1641 M. Davis
Category: Experimental Taligent, Inc.
July 1994
<span class="h1">Using Unicode with MIME</span>
Status of this Memo
This memo defines an Experimental Protocol for the Internet
community. This memo does not specify an Internet standard of any
kind. Distribution of this memo is unlimited.
Abstract
The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993(E)
jointly define a 16 bit character set (hereafter referred to as
Unicode) which encompasses most of the world's writing systems.
However, Internet mail (STD 11, <a href="./rfc822">RFC 822</a>) currently supports only 7-
bit US ASCII as a character set. MIME (<a href="./rfc1521">RFC 1521</a> and <a href="./rfc1522">RFC 1522</a>) extends
Internet mail to support different media types and character sets,
and thus could support Unicode in mail messages. MIME neither defines
Unicode as a permitted character set nor specifies how it would be
encoded, although it does provide for the registration of additional
character sets over time.
This document specifies the usage of Unicode within MIME.
Motivation
Since Unicode is starting to see widespread commercial adoption,
users will want a way to transmit information in this character set
in mail messages and other Internet media. Since MIME was expressly
designed to allow such extensions and is on the standards track for
the Internet, it is the most appropriate means for encoding Unicode.
<a href="./rfc1521">RFC 1521</a> and <a href="./rfc1522">RFC 1522</a> do not define Unicode as an allowed character
set, but allow registration of additional character sets.
In addition to allowing use of Unicode within MIME bodies, another
goal is to specify a way of using Unicode that allows text which
consists largely, but not entirely, of US-ASCII characters to be
represented in a way that can be read by mail clients who do not
understand Unicode. This is in keeping with the philosophy of MIME.
Such an encoding is described in another document, "UTF-7: A Mail
Safe Transformation Format of Unicode" [<a href="#ref-UTF-7" title=""UTF-7: A Mail Safe Transformation Format of Unicode"">UTF-7</a>].
<span class="grey">Goldsmith & Davis [Page 1]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-2" ></span>
<span class="grey"><a href="./rfc1641">RFC 1641</a> Using Unicode with MIME July 1994</span>
Overview
Several ways of using Unicode are possible. This document specifies
both guidelines for use of Unicode within MIME, and a specific usage.
The usage specified in this document is a straightforward use of
Unicode as specified in "The Unicode Standard, Version 1.1".
This encoding is intended for situations where sender and recipient
do not want to do a lot of processing, when the text does not consist
primarily of characters from the US-ASCII character set, or when
sender and receiver are known in advance to support Unicode.
Another encoding is intended for situations where the text consists
primarily of US-ASCII, with occasional characters from other parts of
Unicode. This encoding allows the US-ASCII portion to be read by all
recipients without having to support Unicode. This encoding is
specified in another document, "UTF-7: A Mail Safe Transformation
Format of Unicode" [<a href="#ref-UTF-7" title=""UTF-7: A Mail Safe Transformation Format of Unicode"">UTF-7</a>].
Finally, in keeping with the principles set forth in <a href="./rfc1521">RFC 1521</a>, text
which can be represented using the US-ASCII or ISO-8859-x character
sets should be so represented where possible, for maximum
interoperability.
Definitions
The definition of character set Unicode:
The 16 bit character set Unicode is defined by "The Unicode
Standard, Version 1.1". This character set is identical with the
character repertoire and coding of the international standard
ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2;
Subset=300; Implementation Level=3.
Note. Unicode 1.1 further specifies the use and interaction of
these character codes beyond the ISO standard. However, any valid
10646 BMP (Basic Multilingual Plane) sequence is a valid Unicode
sequence, and vice versa; Unicode supplies interpretations of
sequences on which the ISO standard is silent as to
interpretation.
This character set is encoded as sequences of octets, two per 16-
bit character, with the most significant octet first. Text with an
odd number of octets is ill-formed.
Rationale. ISO/IEC 10646-1:1993(E) specifies that when characters
in the UCS-2 form are serialized as octets, that the most
significant octet appear first. This is also in keeping with
<span class="grey">Goldsmith & Davis [Page 2]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-3" ></span>
<span class="grey"><a href="./rfc1641">RFC 1641</a> Using Unicode with MIME July 1994</span>
common network practice of choosing a canonical format for
transmission.
General Specification of Unicode Character Sets Within MIME
The Unicode Standard is currently at version 1.1. Although new
versions should be compatible with old implementations if an
implementation is compliant with the standard, some implementations
may choose to check the version of the character set that is being
used. In order to allow some implementations to check the version
number and allow others to ignore it, all registrations of Unicode
variants and versions for MIME usage should have MIME charset names
which conform to one of the two following patterns:
UNICODE-major-minor
UNICODE-major-minor-variant
Where major and minor are strings of decimal digits (0 through 9)
specifying the major and minor version number of the Unicode standard
to which the text in question conforms. In the interests of
interoperability, the lowest version number compatible with the text
should be used. The lowest acceptable version number is UNICODE-1-1,
corresponding to "The Unicode Standard, Version 1.1". The optional
trailing string "variant" describes the particular transformation
format of Unicode that the registration describes; its content is up
to the particular registration. If there is no trailing variant
string, the charset name refers to the basic two octet form of
Unicode as described in "The Unicode Standard".
Example. A hypothetical charset which referred to the UTF-8
transformation format of Unicode/10646 (also known as UTF-2 or UTF-
FSS) might be named UNICODE-1-1-UTF-8.
Encoding Character Set Unicode Within MIME
Character set Unicode uses 16 bit characters, and therefore would
normally be used with the Binary or Base64 content transfer encodings
of MIME. In header fields, it would normally be used with the B
content transfer encoding. The MIME character set identifier is
UNICODE-1-1.
Example. Here is a text portion of a MIME message containing the
Japanese word "nihongo" (hexadecimal 65E5,672C,8A9E) written in Han
characters.
Content-Type: text/plain; charset=UNICODE-1-1
Content-Transfer-Encoding: base64
<span class="grey">Goldsmith & Davis [Page 3]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-4" ></span>
<span class="grey"><a href="./rfc1641">RFC 1641</a> Using Unicode with MIME July 1994</span>
ZeVnLIqe
Example. Here is a text portion of a MIME message containing the
Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (hexadecimal
0041,2262,0391,002E)
Content-Type: text/plain; charset=UNICODE-1-1
Content-Transfer-Encoding: base64
AEEiYgORAC4=
Acknowledgements
Many thanks to the following people for their contributions,
comments, and suggestions. If we have omitted anyone it was through
oversight and not intentionally.
Glenn Adams
Harald T. Alvestrand
Nathaniel Borenstein
Lee Collins
Jim Conklin
Dave Crocker
Steve Dorner
Dana S. Emery
Ned Freed
John H. Jenkins
John C. Klensin
Valdis Kletnieks
Keith Moore
Masataka Ohta
Einar Stefferud
Security Considerations
Security issues are not discussed in this memo.
References
[UNICODE 1.1] "The Unicode Standard, Version 1.1": Version 1.0, Volume 1
(ISBN 0-201-56788-1), Version 1.0, Volume 2 (ISBN 0-201-
60845-6), and "Unicode Technical Report #4, The Unicode
Standard, Version 1.1" (available from The Unicode
Consortium, and soon to be published by Addison-Wesley).
[ISO 10646] ISO/IEC 10646-1:1993(E) Information Technology--Universal
Multiple-octet Coded Character Set (UCS).
<span class="grey">Goldsmith & Davis [Page 4]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-5" ></span>
<span class="grey"><a href="./rfc1641">RFC 1641</a> Using Unicode with MIME July 1994</span>
[<a id="ref-UTF-7">UTF-7</a>] Goldsmith, D., and M. Davis, "UTF-7: A Mail Safe
Transformation Format of Unicode", <a href="./rfc1642">RFC 1642</a>, Taligent,
Inc., July 1994.
[<a id="ref-US-ASCII">US-ASCII</a>] Coded Character Set--7-bit American Standard Code for
Information Interchange, ANSI X3.4-1986.
[<a id="ref-ISO-8859">ISO-8859</a>] Information Processing -- 8-bit Single-Byte Coded Graphic
Character Sets -- Part 1: Latin Alphabet No. 1, ISO 8859-
1:1987. Part 2: Latin alphabet No. 2, ISO 8859-2, 1987.
Part 3: Latin alphabet No. 3, ISO 8859-3, 1988. Part 4:
Latin alphabet No. 4, ISO 8859-4, 1988. Part 5:
Latin/Cyrillic alphabet, ISO 8859-5, 1988. Part 6:
Latin/Arabic alphabet, ISO 8859-6, 1987. Part 7:
Latin/Greek alphabet, ISO 8859-7, 1987. Part 8:
Latin/Hebrew alphabet, ISO 8859-8, 1988. Part 9: Latin
alphabet No. 5, ISO 8859-9, 1990.
[<a id="ref-RFC822">RFC822</a>] Crocker, D., "Standard for the Format of ARPA Internet
Text Messages", STD 11, <a href="./rfc822">RFC 822</a>, UDEL, August 1982.
[<a id="ref-RFC-1521">RFC-1521</a>] Borenstein N., and N. Freed, "MIME (Multipurpose Internet
Mail Extensions) Part One: Mechanisms for Specifying and
Describing the Format of Internet Message Bodies", <a href="./rfc1521">RFC</a>
<a href="./rfc1521">1521</a>, Bellcore, Innosoft, September 1993.
[<a id="ref-RFC-1522">RFC-1522</a>] Moore, K., "Representation of Non-Ascii Text in Internet
Message Headers" <a href="./rfc1522">RFC 1522</a>, University of Tennessee,
September 1993.
[<a id="ref-UTF-8">UTF-8</a>] X/Open Company Ltd., "File System Safe UCS Transformation
Format (FSS_UTF)", X/Open Preliminary Specification,
Document Number: P316. This information also appears in
Unicode Technical Report #4, and in a forthcoming annex to
ISO/IEC 10646.
<span class="grey">Goldsmith & Davis [Page 5]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-6" ></span>
<span class="grey"><a href="./rfc1641">RFC 1641</a> Using Unicode with MIME July 1994</span>
Authors' Addresses
David Goldsmith
Taligent, Inc.
10201 N. DeAnza Blvd.
Cupertino, CA 95014-2233
Phone: 408-777-5225
Fax: 408-777-5081
EMail: david_goldsmith@taligent.com
Mark Davis
Taligent, Inc.
10201 N. DeAnza Blvd.
Cupertino, CA 95014-2233
Phone: 408-777-5116
Fax: 408-777-5081
EMail: mark_davis@taligent.com
Goldsmith & Davis [Page 6]
</pre>
|