File: CharacterSet.md

package info (click to toggle)
vtk-dicom 0.8.17-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 7,032 kB
sloc: cpp: 113,806; python: 2,041; makefile: 43; tcl: 10
file content (276 lines) | stat: -rw-r--r-- 11,721 bytes
## Character Sets {#characterset}

@brief Decoding and Encoding of Character Sets


## Overview

Even though it would be convenient for all DICOM files to store text
in UTF-8, or even ASCII, the reality is that legacy character encodings
are very common.  Also, it is not uncommon to find DICOM files where
the encoding does not conform perfectly with the standard.  As a general
rule, we want to be forgiving when decoding text in DICOM files, but
we want to conform strictly to the standard when encoding text.


## ASCII and ISO-8859

Many of the ISO-8859 character sets are supported (and extended) by
Microsoft code pages, and have become closely associated with those
code pages.  The most important of these is CP1252 (Windows-1252)
which extends ISO-8859-1 (ISO IR 100).  This code page replaces
C1 control codes, which are unused by the ISO-8859 character sets,
with commonly-used characters such the Euro sign, combined OE and oe,
and left and right quotes (single and double).

### Decoding

Our decoder supports the decoding of these extra code points when the
character set is ISO IR 100.  The rationale is that CP1252 is fully
backwards-compatible with ISO IR 100, and the C1 control codes are not
used by DICOM.  This same rule is applied for ISO IR 148 (via CP1254)
and ISO IR 166 (via CP874). ISO IR 6, however, is decoded strictly
as ASCII.

### Encoding

The extra characters that are supported by the decoder cannot also be
supported by the encoder, since that would create files that do not
conform with the DICOM standard.  Encoding is done strictly according
to the named character set.


## Japanese via ISO 2022

The use of iso-2022 escape sequences for the encoding of Japanese text,
referred to as "JIS encoding" began in 1978 and is still common (with
restrictions) in the form of iso-2022-jp.  DICOM's encoding of Japanese
is also based on JIS encoding, but with different restrictions.

DICOM iso-2022 Japanese differs from iso-2022-jp most significantly
in the optional inclusion, in DICOM, of ISO IR 13 in G1.  This is done
to support the use of half-width katakana in the traditional manner as
described in JIS X 0201 and as commonly done in Japanese desktop
computing and industrial computing.

The Specific Character Set defined term "ISO 2022 IR 13" indicates
that the initial state of the decoder will be ISO IR 14 in G0 and
ISO IR 13 in G1.  Furthermore, the decoder returns to the initial
state for every new line of text (that is, after every CRNL).

In this usage, `ESC )I` will re-designate ISO IR 13 to G1 (though it is
not permitted to designate any other character set to G1), and `ESC (J`
will re-designated ISO IR 14 to G0.

### Decoding

Our decoder expands upon the JIS X 0208:1990 repertoire to include extra
characters defined in Microsoft CP932 (including the NEC extension).  If
the defined term ISO 2022 IR 13 is present, then JIS X 0201 is used as
the initial character set.

### Encoding

Encoding of ISO-2022 Japanese in DICOM is usually done with the following
defined terms for Specific Character Set:

1. ISO 2022 IR 13\ISO 2022 IR 87
2. \ISO 2022 IR 87
3. \ISO 2022 IR 87\ISO 2022 IR 159

Notes:

1. This generally implies iso-2022-jp, but requires the use of romaji instead
   of ASCII, and also requires the use of JIS X 0208:1990 rather than
   any other version (such as JIS X 0208:1978 or JIS X 0208:1983).  This
   provides exactly the same characters as classic shift-jis, with exactly
   the same method for encoding romaji and half-width katakana, but with a
   different way of encoding the JIS X 0208 characters.
2. This specifies the most widely used subset of iso-2022-jp, using only ASCII
   and JIS X 0208:1990.  For broad compatibility, this is the best.
3. This specifies the most widely used subset of iso-2022-jp-2, like the above
   but adding JIS X 0212:1990 for additional characters.

It is preferred to avoid the use of ISO 2022 IR 13 (and ISO IR 13), both
because it requires romaji (not ASCII) in G0, and because half-width katakana
are not supported by either iso-2022-jp or by iso-2022-jp-2.  The romaji
character set does not contain tilde or backslash, so when these characters
are encoded when romaji has replaced ASCII, they become MACRON and YEN
SIGN respectively.

For (2) and (3) above, when our encoder encounters half-width katakana,
it will convert them to full-width katakana for ISO 2022 IR 87. For (1),
or whenever ISO IR 13 is present, the half-width katakana will be
encoded as-is.

Regarding the structure of the ISO 2022 encoding, our encoder will always
switch G0 to ASCII (or romaji if ISO IR 13) before any ASCII character,
including any control character except ESC.  This includes SPACE and TAB
as well as CR, NL, and FF.

Of course some tricky situations can arise when converting from UTF-8
to ISO 2022 Japanese, or even when converting from Windows CP932, which
contains characters not present in JIS X 0208 or JIS X 0212.  Firstly,
ISO 2022 IR 13 should only be used if half-width katakana are present
in the original text and cannot be avoided.  Secondly, it is important
to note that some characters might be converted to equivalents at
different code points in order to make them compatible with either
JIS X 0208 or JIS X 0212.

The following characters are considered to be compatible by our encoder,
and will be converted if necessary to fit them into an encoding:

1. MACRON and FULLWIDTH MACRON
2. OVERLINE and FULLWIDTH MACRON
3. YEN SIGN and FULLWIDTH YEN SIGN
4. CENT SIGN and FULLWIDTH CENT SIGN
5. POUND SIGN and FULLWIDTH POUND SIGN
6. MINUS SIGN and FULLWIDTH HYPHEN-MINUS
7. NOT SIGN and FULLWIDTH NOT SIGN
8. EM DASH and HORIZONTAL BAR
9. FULLWIDTH TILDE and WAVE DASH
10. PARALLEL TO and DOUBLE VERTICAL LINE


## Korean via ISO 2022

Outside of DICOM, ISO 2022 escape sequences are generally not used in
encodings of Korean.  In fact, the euc-kr encoding is identical to DICOM's
Korean encoding except that it lacks the escape sequences.  Conversely,
DICOM requires the escape sequence `ESC $)C` at the beginning of every
line of text that contains Korean characters.

### Decoding

Though required by the DICOM standard, the escape sequences are not present
in all Korean DICOM files.  Therefore, our decoder will simply remove the
escape sequence wherever it is found, and decode the text using Microsoft
CP949 (which is backwards compatible with euc-kr, which in turn is identical
to DICOM ISO 2022 IR 149 except for the escape sequences).  As compared to
euc-kr, CP949 contains thousands of additional characters.

As well, KS X 1001 (and ISO IR 149) permit Hangul to be stored using 8-byte
codes rather than 2-byte codes.  This eight-byte code stores the character
using four 2-byte component codes, and was necessary because ISO IR 149 did
not contain enough code points for all characters used in Korean.  Our
decoder will convert each of these 8-byte codes into a single Unicode
character.

### Encoding

Our ISO 2022 IR 149 encoder places the escape `ESC $)C` sequence at the front
of every line that uses Korean characters (or any characters from KS X 1001),
as required by DICOM.  Hangul that do not exist in KS X 1001 are decomposed
and stored as 8-byte codes.

The following characters are considered to be compatible by our encoder,
due to the various tables that vendors use to map KS X 1001 to Unicode:

1. MIDDLE DOT and KATAKANA MIDDLE DOT
2. SOFT HYPHEN and EN DASH
3. HORIZONTAL BAR and EM DASH
4. PARALLEL TO and DOUBLE VERTICAL LINE
5. TILDE OPERATOR and WAVE DASH
6. FULLWIDTH CENT SIGN and CENT SIGN
7. FULLWIDTH POUND SIGN and POUND SIGN
8. FULLWIDTH YEN SIGN and YEN SIGN
9. FULLWIDTH NOT SIGN and NOT SIGN
10. FULLWIDTH TILDE and SMALL TILDE
11. CIRCLED DOT OPERATOR and FISHEYE


## Chinese via GB18030

The GB18030 encoding was designed to encompass all Unicode code points.
Every GB18030 code point maps to a unique Unicode code point and
vice-versa, so round-trip encoding of all Unicode is possible.  Since
the ASCII backslash code (0x5c) can appear in GB18030-encoded strings
as the second byte of a two-byte character, special consideration is
applied when decoding multi-value data elements to ensure that 0x5c
is only parsed as a separator when it appears as a single-byte character.

### Decoding

Decoding is done strictly according to the GB18030:2022 mapping tables.
Compared to GB18030:2005, 18 character code points that previously mapped
to the PUA (Unicode Private Use Area) are now mapped to valid Unicode
characters, and 6 other character code points that previously mapped to
the PUA have been retired in favor of new (previously unused) code points
for those same six characters.

### Encoding

As with decoding, encoding is done strictly according to the GB18030:2022
tables.  Unlike our encoders for other character sets, our GB18030
encoder does not perform any compatibility conversions.  This is because
every Unicode code point maps to a unique GB18030 code point.


## Chinese via GBK

The GBK character table is a strict subset of GB18030, and does not
encompass all of Unicode.  Like GB18030, special consideration is used
when scanning GBK strings for backslash, since the second byte of
some two-byte characters can have the value 0x5c.

### Decoding

Decoding is done per the GBK subset of GB18030, with one addition for
compatibility with text that was encoded with Microsoft Code Page 936:
the code 0x80 is mapped to the Euro symbol.

### Encoding

Unlike GB18030, our encoder for GBK includes compatibility mappings
for PUA codes that defined characters in older versions of the GB
conversion tables.  Our encoder does not encode the Euro symbol.
See the section on the GB2312 encoder for compatibility mappings
(all of these are used for GBK except for EM DASH and HORIZONTAL BAR,
since GBK encodes both separately).


## Chinese via ISO 2022 (GB2312)

DICOM's ISO 2022 IR 58 is identical to the popular euc-cn encoding of
GB2312, except that DICOM requires the use of `ESC $)A` at the beginning
of every line containing characters from GB2312.  Unlike GBK and GB18030,
the second byte of a two-byte GB2312 character code will never be
0x5c (ASCII backslash), due to the fact that the high bit is always set.

### Decoding

Decoding is done by removing the `ESC $)A` escape sequence and decoding
as GB2312 in its euc-cn form.  The table used for decoding is a strict
subset of the table defined by GB18030:2022.

### Encoding

Encoding involves adding the `ESC $)A` escape sequence at the beginning
of every line containing Chinese characters, and then encoding as GB2312.

The following encoding equivalencies are used for compatibility with
historical GB2312 to Unicode mapping tables:

1. MIDDLE DOT and KATAKANA MIDDLE DOT
2. EM DASH and HORIZONTAL BAR
3. FULLWIDTH TILDE and WAVE DASH
4. DOUBLE VERTICAL LINE and PARALLEL TO
5. HORIZONTAL ELLIPSIS and MIDLINE HORIZONTAL ELLIPSIS
6. FULLWIDTH CENT SIGN and CENT SIGN
7. FULLWIDTH POUND SIGN and POUND SIGN


## Considerations for UTF-8

When decoding UTF-8 data, there are a small number of important concerns.
Foremost, the UTF-8 sequences must be checked for validity, and each
sequence must decode to a code point between 0x00000 and 0x10FFFF.
UTF-16 surrogate code points must be detected (0xD800 to 0xDFFF),
since these are also invalid in UTF-8.  Finally, any codes in the
Private Use Area (PUA, 0xE000–0xF8FF) might need special consideration.

Our UTF-8 decoder simply re-encodes as UTF-8 after examining the string.
That is, the decoder and the encoder are one and the same, and its
only purpose is to check for invalid UTF-8 byte sequences and invalid
Unicode code points.  The PUA code points are simply passed through,
as are any unassigned code points.