1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333
|
<pre>Network Working Group M. Ohta
Request For Comments: 1815 Tokyo Institute of Technology
Category: Informational July 1995
<span class="h1">Character Sets ISO-10646 and ISO-10646-J-1</span>
Status of this Memo
This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.
Abstract
Though the ISO character set standard of ISO 10646 is specified
reasonably well about European characters, it is not so useful in an
fully internationalized environment.
For the practical use of ISO 10646, a lot of external profiling such
as restriction of characters, restriction of combination of
characters and addition of language information is necessary.
This memo provides information on such profiling, along with charset
names to each profiled instance.
Though all the effort is done to make the resulting charset as useful
10646 based charset as possible, the result is not so good. So, the
charsets defined in this memo are only for reference purpose and its
use for practical purpose is strongly discouraged.
Introduction
This memo describes two text encoding schemes based on ISO 10646
[<a href="#ref-10646" title=""Universal Multiple-Octet Coded Character Set (UCS)"">10646</a>].
As ISO 10646 specifies too little about how text is visualized, to
practically use ISO 10646, it is necessary to restrict the standard
minimally and then add some amount of profiling information.
For ISO 2022 [<a href="#ref-ISO2022" title=""Information processing -- ISO 7-bit and 8-bit coded character sets -- Code extension techniques"">ISO2022</a>] based national standards, sufficient profiling
information is provided by national standardization bodies, but, for
ISO 10646, such a profiling is not yet provided.
As the profiling of ISO 10646 largely affects which character or
combination of characters could be properly displayed, changes of
profiling of ISO 10646 are as significant as additions of new
character sets of ISO 2022.
<span class="grey">M. Ohta Informational [Page 1]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-2" ></span>
<span class="grey"><a href="./rfc1815">RFC 1815</a> Character Sets ISO-10646 and ISO-10646-J-1 July 1995</span>
That is, it's impractical to support the entirety of ISO 10646 (new
restriction or profiling can always be added), so a client needs to
know whether some restriction or profiling is being used before it
can decide whether to display the body part. Thus, it is necessary to
provide multiple charset names to each variation of ISO 10646.
For example, in Japan with Japanese windows NT, only those Han
characters already supported by MS Kanji code (mostly equivalent to
JIS X 0208 [<a href="#ref-JISX0208" title=""Code of the Japanese graphic character set for information interchange"">JISX0208</a>]) can be displayed, because no other font
pattern is commonly provided.
The other problem of ISO 10646 for Han characters is that, to display
them in quality required for daily plain text processing in
China/Japan/Korea, it is necessary to add profiling information on
which one of Chinese/Japanese/Korean the text is using. It should be
noted that this feature makes multilingual mixed
Chinese/Japanese/Korean text with ISO 10646 impractical.
Also, just as [<a href="./rfc1521" title=""MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies"">RFC1521</a>] was unclear about how bi-directionality
should be supported with "ISO-8859-6" and "ISO-8859-8" which was
corrected by [<a href="./rfc1556" title=""Handling of Bi-directional Texts in MIME"">RFC1556</a>], it is also unclear how bi-directionality
could be supported with ISO 10646. There are too much ways to
support bi- directionality. So, until some bi-directionality
mechanism(s) becomes widely supported, it is necessary to exclude
characters for languages which requires bi-directionality support
from the minimal variation. It should be noted that, though ISO
10646 is intended to be free from long term states, save for some
profiling information, introduction of bi-directionality with ISO
10646 do requires the long term states.
Combining characters also cause problems. In many countries where
combining characters based on [<a href="#ref-ISO2022" title=""Information processing -- ISO 7-bit and 8-bit coded character sets -- Code extension techniques"">ISO2022</a>] is used, there are
restrictions on how combining characters are ordered [<a href="#ref-TIS">TIS</a>]. Without
such restriction, the result of combination is completely meaningless
which is the current state of ISO 10646. That is, if some
combination is allowed in some implementation while the other does
not support it, communication between them is difficult unless ISO
10646 is profiled to be least common set of widely supported
combinations. So, again, until combination restriction will be
developed for each language, it is necessary to exclude characters
for such languages from the minimal variation.
Conjoining characters also, may or may not be supported, which
requires another profiling.
According to those considerations, this memo defines two variations
of ISO 10646. They are "ISO-10646" as the minimal basic variation and
"ISO-10646-J-1" as the variation which could be useful in Japan.
<span class="grey">M. Ohta Informational [Page 2]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-3" ></span>
<span class="grey"><a href="./rfc1815">RFC 1815</a> Character Sets ISO-10646 and ISO-10646-J-1 July 1995</span>
Finally, this memo, by no means, promotes the use of ISO 10646 on the
Internet. It's use is strongly discouraged, when there are other
charsets which can encode the same information, Families of ISO 10646
based charsets, like ISO 2022 based charsets, only forms set of
mutually incompatible encoding systems and, unlike ISO 2022 based
charsets [<a href="#ref-2022INT" title=""draft-ohta-text-encoding-*.txt"">2022INT</a>], they can not be merged together to be the single
world wide charset.
Description of "ISO-10646"
ISO-10646 is profiled to be the most basic part of the family of
encodings based on ISO 10646 and contains the following minimal
graphic characters:
collection number and name positions further restriction
------------------------------------------------------------------
1 BASIC LATIN 0020-007E
2 LATIN-1 SUPPLEMENT 00A0-00FF
C0 and C1 control characters may also be used as specified in the
<a href="#section-16">section 16</a> of ISO 10646.
The text with "ISO-10646" encodes text in 16 bit big endian form.
As no combining characters are included, "ISO-10646" can be used with
applications at implementation level 1.
Left-to-right directionality should be used.
The encoding is implemented by Windows/NT.
For practical communication, use of "ISO-10646" is discouraged.
"ISO-8859-1" [<a href="./rfc1345" title=""Character Mnemonics & Character Sets"">RFC1345</a>] should be used instead.
<span class="grey">M. Ohta Informational [Page 3]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-4" ></span>
<span class="grey"><a href="./rfc1815">RFC 1815</a> Character Sets ISO-10646 and ISO-10646-J-1 July 1995</span>
Description of "ISO-10646-J-1"
ISO-10646-J-1 is profiled to be useful for Japanese PC users who use
Japanese version of Windows/NT and contains the following graphic
characters:
collection number and name positions further restrictions
------------------------------------------------------------------
1 BASIC LATIN 0020-007E
2 LATIN-1 SUPPLEMENT 00A0-00FF
8 BASIC GREEK 0370-03CF
10 CYRILLIC 0400-04FF
32 GENERAL PUNCTUATION 2000-206F See note 1, below.
39 MATHEMATICAL OPERATORS 2200-22FF See note 1, below.
44 BOX DRAWING 2500-257F
49 CJK SYMBOLS AND PUNCTUATION 3000-303F See note 1, below.
50 HIRAGANA 3040-309F
51 KATAKANA 30A0-30FF
60 CJK UNIFIED IDEOGRAPHS 4E00-9FFF See note 1, below.
62 CJK COMPATIBILITY IDEOGRAPHS F900-FAFF See note 1, below.
66 CJK COMPATIBILITY FORMS FE30-FE4F
69 HALFWIDTH AND FULLWIDTH FORMS FF00-FFEF
Note 1: Most of the characters are excluded. That is, only those
characters of JIS X 0208 [<a href="#ref-JISX0208" title=""Code of the Japanese graphic character set for information interchange"">JISX0208</a>] are included. The reason is that
the Japanese version of Windows/NT have fonts for them only and most
of the users can not read messages which contains other characters.
C0 and C1 control characters may also be used as specified in the
<a href="#section-16">section 16</a> of ISO 10646.
The text with "ISO-10646-J-1" encodes text in 16 bit big endian form.
Shapes of Han characters should be of Japanese Han, that is, those of
column "J" in <a href="#section-26">section 26</a> of ISO 10646.
As no combining characters are included, "ISO-10646-J-1" can be used
with applications at implementation level 1.
Characters in "HALFWIDTH AND FULLWIDTH FORMS" compared to be
different characters to the normal width characters.
When text is displayed horizontally, left-to-right directionality
should be used.
For practical communication, use of "ISO-10646-J-1" is discouraged.
ISO-2022-JP" [<a href="#ref-2022JP" title=""Japanese Character Encoding for Internet Messages"">2022JP</a>] should be used instead.
<span class="grey">M. Ohta Informational [Page 4]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-5" ></span>
<span class="grey"><a href="./rfc1815">RFC 1815</a> Character Sets ISO-10646 and ISO-10646-J-1 July 1995</span>
MIME Considerations
The names given to the character encoding methods described in this
memo are, respectively, "ISO-10646" and "ISO-10646-J-1". This name
is intended to be used in MIME messages as follows:
Content-Type: text/plain; charset=iso-10646
The ISO-10646 and ISO-10646-J-1 encoding are in 16-bit form, so it is
often necessary to use a Content-Transfer-Encoding header. Base64
should be useful.
The ISO-10646 and ISO-10646-J-1 may also be used in MIME Part 2
headers [<a href="./rfc1522" title=""MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text"">RFC1522</a>]. The "B" encoding should be used with them.
References
[<a id="ref-10646">10646</a>] International Organization for Standardization (ISO),
"Universal Multiple-Octet Coded Character Set (UCS)",
International Standard, Ref. No. ISO/IEC 10646-1:1993
(E).
[<a id="ref-2022INT">2022INT</a>] (An Internet Draft "<a href="./draft-ohta-text-encoding">draft-ohta-text-encoding</a>-*.txt" may
be available).
[<a id="ref-2022JP">2022JP</a>] Murai, J., Crispin, M., and E. van der Poel, "Japanese
Character Encoding for Internet Messages", <a href="./rfc1468">RFC 1468</a>, June
1993.
[<a id="ref-ISO2022">ISO2022</a>] International Organization for Standardization (ISO),
"Information processing -- ISO 7-bit and 8-bit coded
character sets -- Code extension techniques",
International Standard, Ref. No. ISO 2022-1986 (E).
[<a id="ref-JISX0208">JISX0208</a>] Japanese Standards Association, "Code of the Japanese
graphic character set for information interchange", JIS X
0208-1990.
[<a id="ref-RFC1345">RFC1345</a>] Simonsen, K., "Character Mnemonics & Character Sets",
<a href="./rfc1345">RFC-1345</a>, Rationel Almen Planlaegning, June 1992.
[<a id="ref-RFC1521">RFC1521</a>] Borenstein, N., and Freed, N., "MIME (Multipurpose
Internet Mail Extensions) Part One: Mechanisms for
Specifying and Describing the Format of Internet Message
Bodies", <a href="./rfc1521">RFC 1521</a>, September 1993.
<span class="grey">M. Ohta Informational [Page 5]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-6" ></span>
<span class="grey"><a href="./rfc1815">RFC 1815</a> Character Sets ISO-10646 and ISO-10646-J-1 July 1995</span>
[<a id="ref-RFC1522">RFC1522</a>] Moore, K., "MIME (Multipurpose Internet Mail Extensions)
Part Two: Message Header Extensions for Non-ASCII Text",
<a href="./rfc1522">RFC 1522</a>, September 1993.
[<a id="ref-RFC1556">RFC1556</a>] Nussbacher, H., "Handling of Bi-directional Texts in
MIME" <a href="./rfc1556">RFC 1556</a>, Israeli Inter-University Computer Center,
December 1993.
[<a id="ref-TIS">TIS</a>] Thai Industrial Standard for Thai Character Code for
Computer, TIS 620-2533:1990.
Security Considerations
Security issues are not discussed in this memo.
Author's Address
Masataka Ohta
Tokyo Institute of Technology
2-12-1, O-okayama, Meguro-ku,
Tokyo 152, JAPAN
Phone: +81-3-5499-7084
Fax: +81-3-3729-1940
EMail: mohta@cc.titech.ac.jp
<span class="h2"><a class="selflink" id="appendix-M" href="#appendix-M">M</a>. Ohta Informational [Page 6]</span>
</pre>
|