1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310
|
@node uninorm.h
@chapter Normalization forms (composition and decomposition) @code{<uninorm.h>}
@cindex normal forms
@cindex normalizing
This include file defines functions for transforming Unicode strings to one
of the four normal forms, known as NFC, NFD, NKFC, NFKD. These
transformations involve decomposition and --- for NFC and NFKC --- composition
of Unicode characters.
@menu
* Decomposition of characters::
* Composition of characters::
* Normalization of strings::
* Normalizing comparisons::
* Normalization of streams::
@end menu
@node Decomposition of characters
@section Decomposition of Unicode characters
@cindex decomposing
The following enumerated values are the possible types of decomposition of a
Unicode character.
@deftypevr Constant int UC_DECOMP_CANONICAL
Denotes canonical decomposition.
@end deftypevr
@deftypevr Constant int UC_DECOMP_FONT
UCD marker: @code{<font>}. Denotes a font variant (e.g@. a blackletter form).
@end deftypevr
@deftypevr Constant int UC_DECOMP_NOBREAK
UCD marker: @code{<noBreak>}.
Denotes a no-break version of a space or hyphen.
@end deftypevr
@deftypevr Constant int UC_DECOMP_INITIAL
UCD marker: @code{<initial>}.
Denotes an initial presentation form (Arabic).
@end deftypevr
@deftypevr Constant int UC_DECOMP_MEDIAL
UCD marker: @code{<medial>}.
Denotes a medial presentation form (Arabic).
@end deftypevr
@deftypevr Constant int UC_DECOMP_FINAL
UCD marker: @code{<final>}.
Denotes a final presentation form (Arabic).
@end deftypevr
@deftypevr Constant int UC_DECOMP_ISOLATED
UCD marker: @code{<isolated>}.
Denotes an isolated presentation form (Arabic).
@end deftypevr
@deftypevr Constant int UC_DECOMP_CIRCLE
UCD marker: @code{<circle>}.
Denotes an encircled form.
@end deftypevr
@deftypevr Constant int UC_DECOMP_SUPER
UCD marker: @code{<super>}.
Denotes a superscript form.
@end deftypevr
@deftypevr Constant int UC_DECOMP_SUB
UCD marker: @code{<sub>}.
Denotes a subscript form.
@end deftypevr
@deftypevr Constant int UC_DECOMP_VERTICAL
UCD marker: @code{<vertical>}.
Denotes a vertical layout presentation form.
@end deftypevr
@deftypevr Constant int UC_DECOMP_WIDE
UCD marker: @code{<wide>}.
Denotes a wide (or zenkaku) compatibility character.
@end deftypevr
@deftypevr Constant int UC_DECOMP_NARROW
UCD marker: @code{<narrow>}.
Denotes a narrow (or hankaku) compatibility character.
@end deftypevr
@deftypevr Constant int UC_DECOMP_SMALL
UCD marker: @code{<small>}.
Denotes a small variant form (CNS compatibility).
@end deftypevr
@deftypevr Constant int UC_DECOMP_SQUARE
UCD marker: @code{<square>}.
Denotes a CJK squared font variant.
@end deftypevr
@deftypevr Constant int UC_DECOMP_FRACTION
UCD marker: @code{<fraction>}.
Denotes a vulgar fraction form.
@end deftypevr
@deftypevr Constant int UC_DECOMP_COMPAT
UCD marker: @code{<compat>}.
Denotes an otherwise unspecified compatibility character.
@end deftypevr
The following constant denotes the maximum size of decomposition of a single
Unicode character.
@deftypevr Macro {unsigned int} UC_DECOMPOSITION_MAX_LENGTH
This macro expands to a constant that is the required size of buffer passed to
the @code{uc_decomposition} and @code{uc_canonical_decomposition} functions.
@end deftypevr
The following functions decompose a Unicode character.
@deftypefun int uc_decomposition (ucs4_t@tie{}@var{uc}, int@tie{}*@var{decomp_tag}, ucs4_t@tie{}*@var{decomposition})
Returns the character decomposition mapping of the Unicode character @var{uc}.
@var{decomposition} must point to an array of at least
@code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements.
When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} and
@code{*@var{decomp_tag}} are filled and @var{n} is returned. Otherwise -1 is
returned.
@end deftypefun
@deftypefun int uc_canonical_decomposition (ucs4_t@tie{}@var{uc}, ucs4_t@tie{}*@var{decomposition})
Returns the canonical character decomposition mapping of the Unicode character
@var{uc}. @var{decomposition} must point to an array of at least
@code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements.
When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} is filled
and @var{n} is returned. Otherwise -1 is returned.
Note: This function returns the (simple) ``canonical decomposition'' of
@var{uc}. If you want the ``full canonical decomposition'' of @var{uc},
that is, the recursive application of ``canonical decomposition'', use the
function @code{u*_normalize} with argument @code{UNINORM_NFD} instead.
@end deftypefun
@node Composition of characters
@section Composition of Unicode characters
@cindex composing, Unicode characters
@cindex combining, Unicode characters
The following function composes a Unicode character from two Unicode
characters.
@deftypefun ucs4_t uc_composition (ucs4_t@tie{}@var{uc1}, ucs4_t@tie{}@var{uc2})
Attempts to combine the Unicode characters @var{uc1}, @var{uc2}.
@var{uc1} is known to have canonical combining class 0.
Returns the combination of @var{uc1} and @var{uc2}, if it exists.
Returns 0 otherwise.
Not all decompositions can be recombined using this function. See the Unicode
file @file{CompositionExclusions.txt} for details.
@end deftypefun
@node Normalization of strings
@section Normalization of strings
The Unicode standard defines four normalization forms for Unicode strings.
The following type is used to denote a normalization form.
@deftp Type uninorm_t
An object of type @code{uninorm_t} denotes a Unicode normalization form.
This is a scalar type; its values can be compared with @code{==}.
@end deftp
The following constants denote the four normalization forms.
@deftypevr Macro uninorm_t UNINORM_NFD
Denotes Normalization form D: canonical decomposition.
@end deftypevr
@deftypevr Macro uninorm_t UNINORM_NFC
Normalization form C: canonical decomposition, then canonical composition.
@end deftypevr
@deftypevr Macro uninorm_t UNINORM_NFKD
Normalization form KD: compatibility decomposition.
@end deftypevr
@deftypevr Macro uninorm_t UNINORM_NFKC
Normalization form KC: compatibility decomposition, then canonical composition.
@end deftypevr
The following functions operate on @code{uninorm_t} objects.
@deftypefun bool uninorm_is_compat_decomposing (uninorm_t@tie{}@var{nf})
Tests whether the normalization form @var{nf} does compatibility decomposition.
@end deftypefun
@deftypefun bool uninorm_is_composing (uninorm_t@tie{}@var{nf})
Tests whether the normalization form @var{nf} includes canonical composition.
@end deftypefun
@deftypefun uninorm_t uninorm_decomposing_form (uninorm_t@tie{}@var{nf})
Returns the decomposing variant of the normalization form @var{nf}.
This maps NFC,NFD @arrow{} NFD and NFKC,NFKD @arrow{} NFKD.
@end deftypefun
The following functions apply a Unicode normalization form to a Unicode string.
@deftypefun {uint8_t *} u8_normalize (uninorm_t@tie{}@var{nf}, const@tie{}uint8_t@tie{}*@var{s}, size_t@tie{}@var{n}, uint8_t@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
@deftypefunx {uint16_t *} u16_normalize (uninorm_t@tie{}@var{nf}, const@tie{}uint16_t@tie{}*@var{s}, size_t@tie{}@var{n}, uint16_t@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
@deftypefunx {uint32_t *} u32_normalize (uninorm_t@tie{}@var{nf}, const@tie{}uint32_t@tie{}*@var{s}, size_t@tie{}@var{n}, uint32_t@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
Returns the specified normalization form of a string.
The @var{resultbuf} and @var{lengthp} arguments are as described in
chapter @ref{Conventions}.
@end deftypefun
@node Normalizing comparisons
@section Normalizing comparisons
@cindex comparing, ignoring normalization
The following functions compare Unicode string, ignoring differences in
normalization.
@deftypefun int u8_normcmp (const@tie{}uint8_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint8_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
@deftypefunx int u16_normcmp (const@tie{}uint16_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint16_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
@deftypefunx int u32_normcmp (const@tie{}uint32_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint32_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
Compares @var{s1} and @var{s2}, ignoring differences in normalization.
@var{nf} must be either @code{UNINORM_NFD} or @code{UNINORM_NFKD}.
If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2},
0 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0.
Upon failure, returns -1 with @code{errno} set.
@end deftypefun
@cindex comparing, ignoring normalization, with collation rules
@cindex comparing, with collation rules, ignoring normalization
@deftypefun {char *} u8_normxfrm (const@tie{}uint8_t@tie{}*@var{s}, size_t@tie{}@var{n}, uninorm_t@tie{}@var{nf}, char@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
@deftypefunx {char *} u16_normxfrm (const@tie{}uint16_t@tie{}*@var{s}, size_t@tie{}@var{n}, uninorm_t@tie{}@var{nf}, char@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
@deftypefunx {char *} u32_normxfrm (const@tie{}uint32_t@tie{}*@var{s}, size_t@tie{}@var{n}, uninorm_t@tie{}@var{nf}, char@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp})
Converts the string @var{s} of length @var{n} to a NUL-terminated byte
sequence, in such a way that comparing @code{u8_normxfrm (@var{s1})} and
@code{u8_normxfrm (@var{s2})} with the @code{u8_cmp2} function is equivalent to
comparing @var{s1} and @var{s2} with the @code{u8_normcoll} function.
@var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}.
The @var{resultbuf} and @var{lengthp} arguments are as described in
chapter @ref{Conventions}.
@end deftypefun
@deftypefun int u8_normcoll (const@tie{}uint8_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint8_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
@deftypefunx int u16_normcoll (const@tie{}uint16_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint16_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
@deftypefunx int u32_normcoll (const@tie{}uint32_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint32_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp})
Compares @var{s1} and @var{s2}, ignoring differences in normalization, using
the collation rules of the current locale.
@var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}.
If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2},
0 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0.
Upon failure, returns -1 with @code{errno} set.
@end deftypefun
@node Normalization of streams
@section Normalization of streams of Unicode characters
@cindex stream, normalizing a
A ``stream of Unicode characters'' is essentially a function that accepts an
@code{ucs4_t} argument repeatedly, optionally combined with a function that
``flushes'' the stream.
@deftp Type {struct uninorm_filter}
This is the data type of a stream of Unicode characters that normalizes its
input according to a given normalization form and passes the normalized
character sequence to the encapsulated stream of Unicode characters.
@end deftp
@deftypefun {struct uninorm_filter *} uninorm_filter_create (uninorm_t@tie{}@var{nf}, int@tie{}(*@var{stream_func})@tie{}(void@tie{}*@var{stream_data}, ucs4_t@tie{}@var{uc}), void@tie{}*@var{stream_data})
Creates and returns a normalization filter for Unicode characters.
The pair (@var{stream_func}, @var{stream_data}) is the encapsulated stream.
@code{@var{stream_func} (@var{stream_data}, @var{uc})} receives the Unicode
character @var{uc} and returns 0 if successful, or -1 with @code{errno} set
upon failure.
Returns the new filter, or NULL with @code{errno} set upon failure.
@end deftypefun
@deftypefun int uninorm_filter_write (struct@tie{}uninorm_filter@tie{}*@var{filter}, ucs4_t@tie{}@var{uc})
Stuffs a Unicode character into a normalizing filter.
Returns 0 if successful, or -1 with @code{errno} set upon failure.
@end deftypefun
@deftypefun int uninorm_filter_flush (struct@tie{}uninorm_filter@tie{}*@var{filter})
Brings data buffered in the filter to its destination, the encapsulated stream.
Returns 0 if successful, or -1 with @code{errno} set upon failure.
Note! If after calling this function, additional characters are written
into the filter, the resulting character sequence in the encapsulated stream
will not necessarily be normalized.
@end deftypefun
@deftypefun int uninorm_filter_free (struct@tie{}uninorm_filter@tie{}*@var{filter})
Brings data buffered in the filter to its destination, the encapsulated stream,
then closes and frees the filter.
Returns 0 if successful, or -1 with @code{errno} set upon failure.
@end deftypefun
|