1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
|
HOW TO ADD A NEW CHARACTER SET MAPPING.
* Create a struct unicode_info structure. This structure defines the
official character set name, as well as pointers to conversion functions.
* Add the name of the character set, and the name of your structure to
unicode/charsetlist.txt. Multiple entries in unicode/charsetlist.txt can
be used to define aliases for the same character set. Example - "IBM869"
and "CP869" both specify the same character set, they both point to the
unicode_IBM_869 object, which is defined in ibm869.c
There's an automatically generated source file, charsetlist.c, which is
generated by a script from charsetlist.txt. That's how character sets end up
being linked into the code, and how individual character sets can be
selectively included or excluded.
The struct unicode_info structure contains pointers to the following
functions:
+ Convert text in this character set to unicode.
+ Convert unicode to text in this character set.
+ Convert text in this character set to uppercase.
+ Convert text in this character set to lowercase.
+ Convert text in this character set to titlecase.
If the character set allows for convenient conversion to
upper/lower/titlecase, the conversion code should be coded directly.
Otherwise, the library has a set of convenient functions that go against
the unicode master table. Text in any character set can
upper/lower/titlecased by converting it to unicode, running it through
unicode_uc/unicode_lc/unicode_tc, then converting unicode back to the
original character set. See utf8_chset.c for an example.
Note that unicode_uc/unicode_lc/unicode_tc carries a heavy penalty, and
should be avoided. unicode_[ult]c() adds about 26Kb of data tables.
Finally, all this code has to be added to libunicode.a. It can simply be
added to libunicode_a_SOURCES.
If, after doing all that, run make to build libunicode.a and the
unicode-info program. Run unicode-info. If the character set is listed by
unicode-info, you should be all set, provided that the conversion functions
actually work as advertised.
|