1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243
|
'\"
'\" Copyright (c) 1998 Scriptics Corporation.
'\"
'\" See the file "license.terms" for information on usage and redistribution
'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
'\"
.TH encoding n "8.1" Tcl "Tcl Built-In Commands"
.so man.macros
.BS
.SH NAME
encoding \- Manipulate encodings
.SH SYNOPSIS
\fBencoding \fIoption\fR ?\fIarg arg ...\fR?
.BE
.SH INTRODUCTION
.PP
Strings in Tcl are logically a sequence of Unicode characters.
These strings are represented in memory as a sequence of bytes that
may be in one of several encodings: modified UTF\-8 (which uses 1 to 4
bytes per character), or a custom encoding start as 8 bit binary data.
.PP
Different operating system interfaces or applications may generate
strings in other encodings such as Shift\-JIS. The \fBencoding\fR
command helps to bridge the gap between Unicode and these other
formats.
.SH DESCRIPTION
.PP
Performs one of several encoding related operations, depending on
\fIoption\fR. The legal \fIoption\fRs are:
.\" METHOD: convertfrom
.TP
\fBencoding convertfrom\fR ?\fIencoding\fR? \fIdata\fR
.VS "TCL8.7 TIP607, TIP656"
.TP
\fBencoding convertfrom\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding data\fR
.VE "TCL8.7 TIP607, TIP656"
.
Converts \fIdata\fR, which should be in binary string encoded as per
\fIencoding\fR, to a Tcl string. If \fIencoding\fR is not specified, the current
system encoding is used.
.PP
.VS "TCL8.7 TIP607, TIP656"
The \fB-profile\fR option determines the command behavior in the presence
of conversion errors. See the \fBPROFILES\fR section below for details. Any premature
termination of processing due to errors is reported through an exception if
the \fB-failindex\fR option is not specified.
.PP
If the \fB-failindex\fR is specified, instead of an exception being raised
on premature termination, the result of the conversion up to the point of the
error is returned as the result of the command. In addition, the index
of the source byte triggering the error is stored in \fBvar\fR. If no
errors are encountered, the entire result of the conversion is returned and
the value \fB-1\fR is stored in \fBvar\fR.
.VE "TCL8.7 TIP607, TIP656"
.\" METHOD: convertto
.TP
\fBencoding convertto\fR ?\fIencoding\fR? \fIdata\fR
.TP
\fBencoding convertto\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding data\fR
.
Convert \fIstring\fR to the specified \fIencoding\fR. The result is a Tcl binary
string that contains the sequence of bytes representing the converted string in
the specified encoding. If \fIencoding\fR is not specified, the current system
encoding is used.
.PP
.VS "TCL8.7 TIP607, TIP656"
The \fB-profile\fR and \fB-failindex\fR options have the same effect as
described for the \fBencoding convertfrom\fR command.
.VE "TCL8.7 TIP607, TIP656"
.\" METHOD: dirs
.TP
\fBencoding dirs\fR ?\fIdirectoryList\fR?
.
Tcl can load encoding data files from the file system that describe
additional encodings for it to work with. This command sets the search
path for \fB*.enc\fR encoding data files to the list of directories
\fIdirectoryList\fR. If \fIdirectoryList\fR is omitted then the
command returns the current list of directories that make up the
search path. It is an error for \fIdirectoryList\fR to not be a valid
list. If, when a search for an encoding data file is happening, an
element in \fIdirectoryList\fR does not refer to a readable,
searchable directory, that element is ignored.
.\" METHOD: names
.TP
\fBencoding names\fR
.
Returns a list containing the names of all of the encodings that are
currently available.
The encodings
.QW utf-8
and
.QW iso8859-1
are guaranteed to be present in the list.
.\" METHOD: profiles
.TP
\fBencoding profiles\fR
.VS "TCL8.7 TIP656"
Returns a list of the names of encoding profiles. See \fBPROFILES\fR below.
.VE "TCL8.7 TIP656"
.\" METHOD: system
.TP
\fBencoding system\fR ?\fIencoding\fR?
.
Set the system encoding to \fIencoding\fR. If \fIencoding\fR is
omitted then the command returns the current system encoding. The
system encoding is used whenever Tcl passes strings to system calls.
.\" Do not put .VS on whole section as that messes up the bullet list alignment
.SH PROFILES
.PP
.VS "TCL8.7 TIP656"
Operations involving encoding transforms may encounter several types of
errors such as invalid sequences in the source data, characters that
cannot be encoded in the target encoding and so on.
A \fIprofile\fR prescribes the strategy for dealing with such errors
in one of two ways:
.VE "TCL8.7 TIP656"
.
.IP \(bu
.VS "TCL8.7 TIP656"
Terminating further processing of the source data. The profile does not
determine how this premature termination is conveyed to the caller. By default,
this is signalled by raising an exception. If the \fB-failindex\fR option
is specified, errors are reported through that mechanism.
.VE "TCL8.7 TIP656"
.IP \(bu
.VS "TCL8.7 TIP656"
Continue further processing of the source data using a fallback strategy such
as replacing or discarding the offending bytes in a profile-defined manner.
.VE "TCL8.7 TIP656"
.PP
The following profiles are currently implemented with \fBstrict\fR being
the default if the \fB-profile\fR is not specified.
.VS "TCL8.7 TIP656"
.TP
\fBstrict\fR
.
The \fBstrict\fR profile always stops processing when an conversion error is
encountered. The error is signalled via an exception or the \fB-failindex\fR
option mechanism. The \fBstrict\fR profile implements a Unicode standard
conformant behavior.
.TP
\fBtcl8\fR
.
The \fBtcl8\fR profile always follows the first strategy above and corresponds
to the behavior of encoding transforms in Tcl 8.6. When converting from an
external encoding \fBother than utf-8\fR to Tcl strings with the \fBencoding
convertfrom\fR command, invalid bytes are mapped to their numerically equivalent
code points. For example, the byte 0x80 which is invalid in ASCII would be
mapped to code point U+0080. When converting from \fButf-8\fR, invalid bytes
that are defined in CP1252 are mapped to their Unicode equivalents while those
that are not fall back to the numerical equivalents. For example, byte 0x80 is
defined by CP1252 and is therefore mapped to its Unicode equivalent U+20AC while
byte 0x81 which is not defined by CP1252 is mapped to U+0081. As an additional
special case, the sequence 0xC0 0x80 is mapped to U+0000.
When converting from Tcl strings to an external encoding format using
\fBencoding convertto\fR, characters that cannot be represented in the
target encoding are replaced by an encoding-dependent character, usually
the question mark \fB?\fR.
.TP
\fBreplace\fR
.
Like the \fBtcl8\fR profile, the \fBreplace\fR profile always continues
processing on conversion errors but follows a Unicode standard conformant
method for substitution of invalid source data.
When converting an encoded byte sequence to a Tcl string using
\fBencoding convertfrom\fR, invalid bytes
are replaced by the U+FFFD REPLACEMENT CHARACTER code point.
When encoding a Tcl string with \fBencoding convertto\fR,
code points that cannot be represented in the
target encoding are transformed to an encoding-specific fallback character,
U+FFFD REPLACEMENT CHARACTER for UTF targets and generally `?` for other
encodings.
.VE "TCL8.7 TIP656"
.SH EXAMPLES
.PP
These examples use the utility proc below that prints the Unicode code points
comprising a Tcl string.
.PP
.CS
proc codepoints s {join [lmap c [split $s {}] {
string cat U+ [format %.6X [scan $c %c]]}]
}
.CE
.PP
Example 1: convert a byte sequence in Japanese euc-jp encoding to a TCL string:
.PP
.CS
% codepoints [\fBencoding convertfrom\fR euc-jp "\exA4\exCF"]
U+00306F
.CE
.PP
The result is the unicode codepoint
.QW "\eu306F" ,
which is the Hiragana letter HA.
.VS "TCL8.7 TIP607, TIP656"
.PP
Example 2: Error handling based on profiles:
.PP
The letter \fBA\fR is Unicode character U+0041 and the byte "\ex80" is invalid
in ASCII encoding.
.PP
.CS
% codepoints [\fBencoding convertfrom\fR -profile tcl8 ascii A\ex80]
U+000041 U+000080
% codepoints [\fBencoding convertfrom\fR -profile replace ascii A\ex80]
U+000041 U+00FFFD
% codepoints [\fBencoding convertfrom\fR -profile strict ascii A\ex80]
unexpected byte sequence starting at index 1: '\ex80'
.CE
.PP
Example 3: Get partial data and the error location:
.PP
.CS
% codepoints [\fBencoding convertfrom\fR -profile strict -failindex idx ascii AB\ex80]
U+000041 U+000042
% set idx
2
.CE
.PP
Example 4: Encode a character that is not representable in ISO8859-1:
.PP
.CS
% \fBencoding convertto\fR iso8859-1 A\eu0141
A?
% \fBencoding convertto\fR -profile strict iso8859-1 A\eu0141
unexpected character at index 1: 'U+000141'
% \fBencoding convertto\fR -profile strict -failindex idx iso8859-1 A\eu0141
A
% set idx
1
.CE
.VE "TCL8.7 TIP607, TIP656"
.PP
.SH "SEE ALSO"
Tcl_GetEncoding(3), fconfigure(n)
.SH KEYWORDS
encoding, unicode
.\" Local Variables:
.\" mode: nroff
.\" End:
|