1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277
|
.\" Copyright (C) 2001 Information-technology Promotion Agency (IPA)
.\" Copyright (C) 2001-2011
.\" National Institute of Advanced Industrial Science and Technology (AIST)
.\" This file is part of the m17n library documentation.
.\" Permission is granted to copy, distribute and/or modify this document
.\" under the terms of the GNU Free Documentation License, Version 1.2 or
.\" any later version published by the Free Software Foundation; with no
.\" Invariant Section, no Front-Cover Texts,
.\" and no Back-Cover Texts. A copy of the license is included in the
.\" appendix entitled "GNU Free Documentation License".
.TH "Charset" 3m17n "12 Jan 2011" "Version 1.6.2" "The m17n Library" \" -*- nroff -*-
.ad l
.nh
.SH NAME
Charset \- Charset objects and API for them.
.SS "Defines"
.in +1c
.ti -1c
.RI "#define \fBMCHAR_INVALID_CODE\fP"
.br
.RI "\fIInvalid code-point. \fP"
.in -1c
.SS "Functions"
.in +1c
.ti -1c
.RI "\fBMSymbol\fP \fBmchar_define_charset\fP (const char *name, \fBMPlist\fP *plist)"
.br
.RI "\fIDefine a charset. \fP"
.ti -1c
.RI "\fBMSymbol\fP \fBmchar_resolve_charset\fP (\fBMSymbol\fP symbol)"
.br
.RI "\fIResolve charset name. \fP"
.ti -1c
.RI "int \fBmchar_list_charset\fP (\fBMSymbol\fP **symbols)"
.br
.RI "\fIList symbols representing charsets. \fP"
.ti -1c
.RI "int \fBmchar_decode\fP (\fBMSymbol\fP charset_name, unsigned code)"
.br
.RI "\fIDecode a code-point. \fP"
.ti -1c
.RI "unsigned \fBmchar_encode\fP (\fBMSymbol\fP charset_name, int c)"
.br
.RI "\fIEncode a character code. \fP"
.ti -1c
.RI "int \fBmchar_map_charset\fP (\fBMSymbol\fP charset_name, void(*func)(int from, int to, void *arg), void *func_arg)"
.br
.RI "\fICall a function for all the characters in a specified charset. \fP"
.in -1c
.SS "Variables"
.in +1c
.ti -1c
.RI "\fBMSymbol\fP \fBMcharset\fP"
.br
.RI "\fIThe symbol \fCMcharset\fP. \fP"
.in -1c
.SS "Variables: Symbols representing a charset."
Each of the following symbols represents a predefined charset.
.in +1c
.ti -1c
.RI "\fBMSymbol\fP \fBMcharset_ascii\fP"
.br
.RI "\fISymbol representing the charset ASCII. \fP"
.ti -1c
.RI "\fBMSymbol\fP \fBMcharset_iso_8859_1\fP"
.br
.RI "\fISymbol representing the charset ISO/IEC 8859/1. \fP"
.ti -1c
.RI "\fBMSymbol\fP \fBMcharset_unicode\fP"
.br
.RI "\fISymbol representing the charset Unicode. \fP"
.ti -1c
.RI "\fBMSymbol\fP \fBMcharset_m17n\fP"
.br
.RI "\fISymbol representing the largest charset. \fP"
.ti -1c
.RI "\fBMSymbol\fP \fBMcharset_binary\fP"
.br
.RI "\fISymbol representing the charset for ill-decoded characters. \fP"
.in -1c
.SS "Variables: Parameter keys for mchar_define_charset()."
These are the predefined symbols to use as parameter keys for the function \fBmchar_define_charset()\fP (which see).
.in +1c
.ti -1c
.RI "\fBMSymbol\fP \fBMmethod\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMdimension\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMmin_range\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMmax_range\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMmin_code\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMmax_code\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMascii_compatible\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMfinal_byte\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMrevision\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMmin_char\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMmapfile\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMparents\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMsubset_offset\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMdefine_coding\fP"
.br
.ti -1c
.RI "\fBMSymbol\fP \fBMaliases\fP"
.br
.in -1c
.SS "Variables: Symbols representing charset methods."
These are the predefined symbols that can be a value of the \fBMmethod\fP parameter of a charset used in an argument to the \fBmchar_define_charset()\fP function.
.PP
A method specifies how code\-points and character codes are converted. See the documentation of the \fBmchar_define_charset()\fP function for the details.
.in +1c
.ti -1c
.RI "\fBMSymbol\fP \fBMoffset\fP"
.br
.RI "\fISymbol for the offset type method of charset. \fP"
.ti -1c
.RI "\fBMSymbol\fP \fBMmap\fP"
.br
.RI "\fISymbol for the map type method of charset. \fP"
.ti -1c
.RI "\fBMSymbol\fP \fBMunify\fP"
.br
.RI "\fISymbol for the unify type method of charset. \fP"
.ti -1c
.RI "\fBMSymbol\fP \fBMsubset\fP"
.br
.RI "\fISymbol for the subset type method of charset. \fP"
.ti -1c
.RI "\fBMSymbol\fP \fBMsuperset\fP"
.br
.RI "\fISymbol for the superset type method of charset. \fP"
.in -1c
.SH "Detailed Description"
.PP
Charset objects and API for them.
The m17n library uses \fIcharset\fP objects to represent a coded character sets (CCS). The m17n library supports many predefined coded character sets. r, application programs can add other charsets. A character can belong to multiple charsets.
.PP
The m17n library distinguishes the following three concepts:
.PP
.PD 0
.IP "\(bu" 2
A \fIcode\-point\fP is a number assigned by the CCS to each character. Code\-points may or may not be continuous. The type \fCunsigned\fP is used to represent a code\-point. An invalid code\-point is represented by the macro \fCMCHAR_INVALID_CODE\fP.
.PP
.PD 0
.IP "\(bu" 2
A \fIcharacter\fP \fIindex\fP is the canonical index of a character in a CCS. The character that has the character index N occupies the Nth position when all the characters in the current CCS are sorted by their code\-points. Character indices in a CCS are continuous and start with 0.
.PP
.PD 0
.IP "\(bu" 2
A \fIcharacter\fP \fIcode\fP is the internal representation in the m17n library of a character. A character code is a signed integer of 21 bits or longer.
.PP
Each charset object defines how characters are converted between code\-points and character codes. To \fIencode\fP means converting code\-points to character codes and to \fIdecode\fP means converting character codes to code\-points.
.SH "Define Documentation"
.PP
.SS "#define MCHAR_INVALID_CODE"
.PP
Invalid code\-point. The macro \fBMCHAR_INVALID_CODE\fP gives the invalid code\-point.
.SH "Variable Documentation"
.PP
.SS "\fBMSymbol\fP \fBMcharset_ascii\fP"
.PP
Symbol representing the charset ASCII. The symbol \fBMcharset_ascii\fP has name \fC'ascii'\fP and represents the charset ISO 646, USA Version X3.4\-1968 (ISO\-IR\-6).
.SS "\fBMSymbol\fP \fBMcharset_iso_8859_1\fP"
.PP
Symbol representing the charset ISO/IEC 8859/1. The symbol \fBMcharset_iso_8859_1\fP has name \fC'iso\-8859\-1'\fP and represents the charset ISO/IEC 8859\-1:1998.
.SS "\fBMSymbol\fP \fBMcharset_unicode\fP"
.PP
Symbol representing the charset Unicode. The symbol \fBMcharset_unicode\fP has name \fC'unicode'\fP and represents the charset Unicode.
.SS "\fBMSymbol\fP \fBMcharset_m17n\fP"
.PP
Symbol representing the largest charset. The symbol \fBMcharset_m17n\fP has name \fC'm17n'\fP and represents the charset that contains all characters supported by the m17n library.
.SS "\fBMSymbol\fP \fBMcharset_binary\fP"
.PP
Symbol representing the charset for ill\-decoded characters. The symbol \fBMcharset_binary\fP has name \fC'binary'\fP and represents the fake charset which the decoding functions put to an M\-text as a text property when they encounter an invalid byte (sequence).
.PP
See \fBCode Conversion\fP for more details.
.SS "\fBMSymbol\fP \fBMmethod\fP"
.SS "\fBMSymbol\fP \fBMdimension\fP"
.SS "\fBMSymbol\fP \fBMmin_range\fP"
.SS "\fBMSymbol\fP \fBMmax_range\fP"
.SS "\fBMSymbol\fP \fBMmin_code\fP"
.SS "\fBMSymbol\fP \fBMmax_code\fP"
.SS "\fBMSymbol\fP \fBMascii_compatible\fP"
.SS "\fBMSymbol\fP \fBMfinal_byte\fP"
.SS "\fBMSymbol\fP \fBMrevision\fP"
.SS "\fBMSymbol\fP \fBMmin_char\fP"
.SS "\fBMSymbol\fP \fBMmapfile\fP"
.SS "\fBMSymbol\fP \fBMparents\fP"
.SS "\fBMSymbol\fP \fBMsubset_offset\fP"
.SS "\fBMSymbol\fP \fBMdefine_coding\fP"
.SS "\fBMSymbol\fP \fBMaliases\fP"
.SS "\fBMSymbol\fP \fBMoffset\fP"
.PP
Symbol for the offset type method of charset. The symbol \fBMoffset\fP has the name \fC'offset'\fP and, when used as a value of \fBMmethod\fP parameter of a charset, it means that the conversion of code\-points and character codes of the charset is done by this calculation:
.PP
.PP
.nf
CHARACTER\-CODE = CODE\-POINT \- MIN\-CODE + MIN\-CHAR
.fi
.PP
.PP
where, MIN\-CODE is a value of \fBMmin_code\fP parameter of the charset, and MIN\-CHAR is a value of \fBMmin_char\fP parameter.
.SS "\fBMSymbol\fP \fBMmap\fP"
.PP
Symbol for the map type method of charset. The symbol \fBMmap\fP has the name \fC'map'\fP and, when used as a value of \fBMmethod\fP parameter of a charset, it means that the conversion of code\-points and character codes of the charset is done by map looking up. The map must be given by \fBMmapfile\fP parameter.
.SS "\fBMSymbol\fP \fBMunify\fP"
.PP
Symbol for the unify type method of charset. The symbol \fBMunify\fP has the name \fC'unify'\fP and, when used as a value of \fBMmethod\fP parameter of a charset, it means that the conversion of code\-points and character codes of the charset is done by map looking up and offsetting. The map must be given by \fBMmapfile\fP parameter. For this kind of charset, a unique continuous character code space for all characters is assigned.
.PP
If the map has an entry for a code\-point, the conversion is done by looking up the map. Otherwise, the conversion is done by this calculation:
.PP
.PP
.nf
CHARACTER\-CODE = CODE\-POINT \- MIN\-CODE + LOWEST\-CHAR\-CODE
.fi
.PP
.PP
where, MIN\-CODE is a value of \fBMmin_code\fP parameter of the charset, and LOWEST\-CHAR\-CODE is the lowest character code of the assigned code space.
.SS "\fBMSymbol\fP \fBMsubset\fP"
.PP
Symbol for the subset type method of charset. The symbol \fBMsubset\fP has the name \fC'subset'\fP and, when used as a value of \fBMmethod\fP parameter of a charset, it means that the charset is a subset of a parent charset. The parent charset must be given by \fBMparents\fP parameter. The conversion of code\-points and character codes of the charset is done conceptually by this calculation:
.PP
.PP
.nf
CHARACTER\-CODE = PARENT\-CODE (CODE\-POINT) + SUBSET\-OFFSET
.fi
.PP
.PP
where, PARENT\-CODE is a pseudo function that returns a character code of CODE\-POINT in the parent charset, and SUBSET\-OFFSET is a value given by \fBMsubset_offset\fP parameter.
.SS "\fBMSymbol\fP \fBMsuperset\fP"
.PP
Symbol for the superset type method of charset. The symbol \fBMsuperset\fP has the name \fC'superset'\fP and, when used as a value of \fBMmethod\fP parameter of a charset, it means that the charset is a superset of parent charsets. The parent charsets must be given by \fBMparents\fP parameter.
.SS "\fBMSymbol\fP \fBMcharset\fP"
.PP
The symbol \fCMcharset\fP. Any decoded M\-text has a text property whose key is the predefined symbol \fCMcharset\fP. The name of \fCMcharset\fP is \fC'charset'\fP.
.SH "Author"
.PP
Generated automatically by Doxygen for The m17n Library from the source code.
.SH COPYRIGHT
Copyright (C) 2001 Information\-technology Promotion Agency (IPA)
.br
Copyright (C) 2001\-2011 National Institute of Advanced Industrial Science and Technology (AIST)
.br
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License
<http://www.gnu.org/licenses/fdl.html>.
|