1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482
|
About .kmap files
Mind Tuning: The Unicode theory and Unicode-savvy text editors make
a clear (and useful) distinction among
1. the keyboard key(s) typed
2. the Unicode character(s) entered and stored in a file or
buffer, and
3. the glyph(s) (i.e. the graphic shapes) that are displayed.
Rendering: The mapping between the Unicode characters stored in a
file or buffer and the display of glyphs is handled by a Rendering
Engine, which can vary considerably in complexity, being fairly
straightforward for Latin scripts and very complex for Arabic or
Devanagari. Most users should be able to assume that the Rendering
Engine is taken care of.
Input Method: The mapping between the keyboard key(s) typed and the
Unicode character(s) entered in a file or buffer is defined by an
Input Method.
Defining Input Methods: In Yudit, .kmap ("keymap") files with names
like Arabic.kmap, Dutch.kmap and Hungarian.kmap provide the mapping
between the key or keys physically typed by the user and the
desired Unicode character(s) to be inserted in the edit buffer.
Each .kmap file defines an Input Method, and almost 100 .kmap files
are supplied with the Yudit download.
Yudit users can switch among the input methods at will to
facilitate entering text in multiple languages/scripts, even within
the same document. Users can also define their own .kmap files to
handle new exotic scripts or simply to define new entry methods
that satisfy their own needs and tastes. For a single script,
e.g. Arabic, Chinese, Esperanto or Hangul, there may be different
kmap-defined input methods that appeal to different users. For
example, Yudit comes supplied with three different .kmap files for
Arabic input:
Arabic.kmap
ArabicTranslit.kmap
ArabicKeyboard.kmap
and you might not like any of them. That's OK, because it is
fairly simple to write your own .kmap file.
Writing a New .kmap File
To write a new .kmap file, the user should first study the Unicode
character implementation of the target script to determine the full
set of Unicode values that need to be entered. For a Roman-based
script like Esperanto, most of the characters are standard lower
ASCII Roman and need no special handling; you type 'b' in Yudit and
a 'b' gets inserted into the buffer. But Esperanto orthography has
six oddly accented letters that need to be entered via a
.kmap-defined input method. The six letters can be uppercase or
lowercase, resulting in 12 separate characters, and these
characters are defined in Unicode. For example, Esperanto
orthography includes a 'c' letter with a circumflex accent. The
Unicode codepoint for the lowercase version is 0x0109, and the
codepoint for the uppercase version is 0x0108. In the Unicode
specification, these are listed as
Codepoint Official Unicode Name
0x0108 LATIN CAPITAL LETTER C WITH CIRCUMFLEX
0x0109 LATIN SMALL LETTER C WITH CIRCUMFLEX
As there is no key on the ASCII keyboard for entering either of
these characters directly, the most obvious solution is to enter
them using multiple keystrokes. The basic syntax of each entry in
the .kmap file is
" InputKeystroke(s) = OutputUnicodeCharacter(s)" ,
surrounded by double quotes and terminated with a comma. Spaces
around the equal sign and between the symbols are optional and are
ignored if present. (Such spaces can improve the human readability
of the file.) The format of these entries will be discussed in
detail below.
The Esperanto.kmap file supplied with Yudit includes the following
straightforward entries. [The terminating commas do not appear to
be absolutely necessary to the current parser, but technically they
should be present.]
"cx = 0x0109" , // LATIN SMALL LETTER C WITH CIRCUMFLEX
"Cx = 0x0108" , // LATIN CAPITAL LETTER C WITH CIRCUMFLEX
"CX = 0x0108" , // LATIN CAPITAL LETTER C WITH CIRCUMFLEX
The InputKeystrokes are 8-bit values, entered either as printable
letters like c and x (as in this example) or by their numerical
ASCII values (see formats and special cases below). The output
characters are 16-bit Unicode values, notated 0xHHHH, where H is a
hex digit 0 to F. These entries have optional comments starting
with // and continuing to the end of line.
When the Esperanto entry method compiled from this source file is
selected and the user types c followed by x, the sequence of these
two keystrokes will be detected and intercepted by the input
method, and a single Unicode character 0x0109 will be inserted into
the edit buffer. (And if a suitable font is installed, Yudit will
render the appropriate glyph on the screen.) Similarly, if the
user types big C followed by either little x or big X, then the
sequence will be intercepted and mapped into a single 0x0108
Unicode character. Thus multiple input keystrokes can be mapped to
a single Unicode character, and there can be multiple sequences
that map to the same Unicode character.
Input keystroke sequences should be chosen with care to avoid
ambiguity and clashes; for human convenience, the input sequences
should also be maximally mnemonic.
The choice of "cx" is a very suitable sequence for Esperanto entry
of c-with-circumflex because
1. The letter x is not normally used in Esperanto orthography, and
2. There is already a well-known convention for typing Esperanto
that uses "cx" for c-with-circumflex when the real letter is
not available.
However, there are other conventions for typing Esperanto, and some
other user might prefer to define a slightly different Esperanto
kmap file to satisfy their own taste and habits. The following
entries, for example, could be put into a different EsperantoB.kmap
file, or even added to the original Esperanto.kmap file.
"c^ = 0x0109" , // LATIN SMALL LETTER C WITH CIRCUMFLEX
"C^ = 0x0108" , // LATIN CAPITAL LETTER C WITH CIRCUMFLEX
If such entries were defined, the typing c followed by ^ would
result in Unicode character 0x0109 being inserted in the buffer.
Similar for C followed by ^.
The mappings can also be one-InputKey-to-one-UnicodeCharacter, and
this is very common with keymaps for non-Roman scripts. Arabic has
a letter named siin that sounds like 's' and one named raa' that
sounds like 'r'. In ArabicTranslit.kmap, which as its name implies
is based on a Roman transliteration, one finds the following
straightforward one-to-one mappings.
"s = 0x0633" , // type 's', get Unicode Arabic siin character
"r = 0x0631" , // type 'r', get Unicode Arabic raa' character
(In this case and elsewhere, one must study the Unicode
documentation to see what the Unicode codepoint values are; in this
case, you simply have to look up the information that Arabic siin
has value 0x0633 and Arabic raa' has the value 0x0631. The Unicode
charts for Arabic and other languages are easily examined starting
from http://www.unicode.org/charts/)
An Arabic keymap like this one, covering a whole Roman
transliteration system, would appeal only to those people who know
and like that particular transliteration. Those who prefer a
different transliteration should define their own keymap. The only
formal restriction for a new transliteration is that it be
unambiguously mappable into proper Arabic Unicode characters.
Other people entering Arabic might already be used to the keyboard
layout on the PC or the Mac, and they might prefer to define
keymaps that correspond to their favorite keyboard layouts (rather
than to some kind of Roman transliteration).
It is also conceivable that one (or multiple) input keystrokes
could be mapped to multiple Arabic Unicode characters. For
example, the Arabic sequence of characters laam followed by a bare
'alif character is properly and preferably encoded in a file or
buffer just like that, as 0x0644 (laam) followed by 0x0627 (bare
'alif). One can therefore simply type laam, followed by 'alif, but
suppose that some user preferred to think of them as a unit and
preferred to enter them together with a single keystroke, say X.
In such a case, the following one-to-many entry in a .kmap file
would be appropriate:
"X = 0x0644 0x0627" , // type just X to enter laam followed by 'alif
Remember that Unicode makes a distinction between the key(s) typed,
the Unicode character(s) encoded, and the glyph(s) rendered.
Whether the user chooses to type laam followed by 'alif, or both at
once using the shortcut defined just above, the Unicode characters
inserted into the file or buffer should be the same. Unicode
doesn't care how you _enter_ the characters--it simply defines what
the encoded file should look like.
Arabic orthography requires that laam followed by 'alif be
_rendered_ together, and all Arabic fonts supply special glyphs for
rendering the laam-'alif sequences. But the font-and-rendering
considerations are altogether separate from the entry and encoding,
and in a properly set up Arabic-capable system, the user should not
have to worry about the rendering.
Converting .kmap Files to .my Files
.kmap files are the human-edited and human-readable _source_ files
for keyboard mappings. Yudit, however, really works with .my
files, which must be generated from the .kmap files.
Study some of the .kmap files provided with the yudit release. By
default, the yudit installation "prefix" is /usr, and the yudit
executable is installed in <prefix>/bin/yudit, which by default is
/usr/bin/yudit. The .kmap source files are stored in
<prefix>/share/yudit/src/ and the .my files for the whole site are
stored in <prefix>/share/yudit/data/. (During installation, the
'prefix' variable may be modified manually, in which case the
relevant bin and share directories may be somewhere other than
/usr, which is the default prefix.)
Files with names like Esperanto.kmap and Georgian.kmap are source
files, written using any convenient text editor. To be used by
Yudit, an .my file must be generated from the .kmap source file
according to the following example, using the 'mytool' program
supplied with Yudit. Let's assume that we have just written
Georgian2.kmap to define a new input method for Georgian.
$ mytool -type kmap -kmap Georgian2.kmap -rkmap Georgian2.kmap -write Georgian2.my
If this new input method is intended only for private use by one
user, then the user should copy the Georgian2.my file to the
directory ~/.yudit/data/, i.e. to the directory named 'data', under
the directory named ".yudit", which is in the user's own home
directory.
If the new entry method is intended for your whole site, then copy
it to <prefix>/share/yudit/data/, which by default is
/usr/share/yudit/data/.
More Detail on the Syntax of .kmap Files:
Comments: Comments in .kmap files start with // and continue to the
end of line. The sequence // is therefore "special" to the parser.
It's a good idea to comment each entry.
Mapping Entries: There can be only one entry (mapping) to a line.
Each entry must be surrounded by double quotes. Technically, each
entry should also be terminated with a comma.
Each entry is of the form:
" InputSymbol(s) = OutputSymbol(s) " ,
where the Input is one or more 8-bit ASCII keystrokes, represented
as printable ASCII characters, or as octal, decimal or hexadecimal
numbers. Eight-bit hexadecimal numbers are represented as 0xHH,
where H is a hex digit from 0 to F. Octal-format numbers start
with \0, e.g. \037. The Input and Output symbols are separated by
an equal sign. The Output consists of one or more 16-bit _Unicode_
characters, each represented as 0xHHHH, where H is a hexadecimal
digit 0 to F. Spaces around the equal sign and between symbols are
ignored.
From the point of view of the Input Method, the Input Symbols are
keystrokes coming from the user; the Output Symbols are the Unicode
characters to be inserted into the Yudit edit buffer.
The following symbols are "special" inside .kmap files and require
special attention:
" double quotes (used to surround an entry)
= equal sign (used to separate Input Symbols and Output Symbols)
space (ignored around the equal sign and between symbols)
+ plus sign (interpreted by default as a positive prefix to a number)
- minus sign (interpreted by default as a negative prefix to a number)
\ backslash (used to literalize a special symbol)
Th special symbols are literalized, where necessary, by preceding
them with the backslash "literalizer" symbol; e.g. \= denotes a
literal equal sign. The literal backslash itself is notated \\.
Examples:
" s = 0x0633 " , // one to one
" cx = 0x0109 " , // many to one
" \= = 0x0109 " , // type a literal equal sign to input 0x0109
" \\ ~ n = 0x00f1 " , // allow LaTeX-like input of n-with-tilde;
// the user types literal \ then ~ then n
" \\ \" a = 0x00E4 " , // allow LaTeX-like input of a-with-dieresis;
// the user types literal \ then literal "
// then a to input Unicode a-with-dieresis
// (a with two dots above)
" \+u = 0x00FC " , // user types literal + followed by u
// to input 0x00FC, which is u with dieresis
// (two dots above)
" X = 0x0644 0x0627 ", // one to many
" 0x32 = 0x0662 " , // type ASCII '2' get ARABIC-INDIC DIGIT 2
Literal double quote: Each whole entry is surrounded by double
quotes, so the double quote is "special" in .kmap files. A literal
double quote can be included inside an entry by typing \", where
the backslash is the escape or "literalizing" character. Another
way to indicate a literal double quote is to give its 8-bit lower
ASCII value, e.g.
" \\ 0x22 a = 0x00E1 " , // 0x22 is the literal double quote symbol
Literal backslash: To indicate a literal backslash as input, use \\
or indicate the 8-bit ASCII numeric value.
" \\ \" a = 0x00E4 " , // \\ denotes a literal backslash symbol
" 0x5c 0x22 a = 0x00E1 " , // equivalent to the entry just above
Similarly, the equal sign separates the Input and Output; where
necessary, a literal input equal sign is indicated as \= or \0x3D.
" \= = 0x2345 " ,
" 0x3D = 0x2345 " , // 0x3d is the literal equal sign
The + and - can are interpreted by default by the parser as part of
a numerical value, so if they are intended as literal input, then
they should be literalized in the usual ways, as in this example
from Georgian.kmap
"\+z = 0x10df " ,
or using the ASCII value of the plus sign (0x2B)
" 0x2b z = 0x10df " ,
Literal white spaces: Whitespace between between characters in a
kmap entry are normally ignored; this allows you to space out the
input sequences for better human readability. To notate a literal
space as part of the Input, literalize it with a preceding
backslash or indicate the 8-bit ASCII value.
" \ = 0x0020 " , // space literalized with preceding backslash
" 0x20 = 0x0020 " , // 0x20 is the space ASCII value
Digits: By default a string segment that starts with a digit 0-9 is
considered a numeric value, not the literal symbols 0-9. To
indicate the input of a literal digit, indicate its 8-bit ASCII
value.
" 0x32 = 0x0662 " , // 0x32 is the ASCII value of '2'
Number formats. Input keystrokes can be notated as octal, decimal
or hexadecimal numbers, e.g. a period (full stop) could be notated
equivalently as 46 (decimal), \056 (octal) or 0x2E.
More Examples, taken from Kana.kmap:
"\033 KE=0x30F5", // SMALL KE
"\"R=0x201D", // RIGHT DOUBLE QUOTATION MARK
"kke=0x3063 0x3051",
"0x20 = 0x201D", // INPUT LITERAL BLANK
"0x3D=0x003D", // INPUT '=' SIGN THIS WAY
The // sequence is special in kmap files, being the introducer of a
comment. If you want the sequence // as part of the Input, then
the two slashes need to be separated by a space, e.g
"&/ /=0x005C", // YUDIT NEEDS SPACE BETWEEN / /
// to avoid treating // as the comment introducer
Yudit Strategy for Input Matching
Clashes: It is illegal to have the same input sequence map into two
different outputs.
// this is an illegal clash
" cx = 0x0109 " ,
" cx = 0x0108 " ,
Where multiple input sequences start with the same substring but
have different lengths, the "greedy" matching algorithm prefers
the longest match. The Esperanto mapping for c-with-circumflex
could conceivably be done as follows:
" cx = 0x0109 " , // lowercase c with circumflex
" cxx = 0x0108 " , // uppercase C with circumflex
Then if the user typed c, x, and anything but another x, the value
0x0109 would be inserted in the buffer. If the user typed c, x,
and x, then the greedy algorithm would match the longer input
sequence and insert 0x0108 into the buffer. (Such tricks are
possible, but are hardly recommended.)
Sub-translations:
"Sub-translations" are currently used only for the Yudit-supplied
Hangul-entry keymap (for Korean orthography) and for the
Yudit-supplied Unicode-entry keymap; most users don't need to know
about sub-translations. [This part of the documentation should be
reviewed and expanded. I may have completely misunderstood it.]
For entering whole Hangul characters, one can use a notation
involving "sub-translations". If the translation can be broken
down to (maximum 5) sub-translations then the first line in the
.kmap file may contain:
"string1+string2+string3...",
The subsequent lines between "begin string1", and "end string1"
contain the sub-translation lines. The resulting Unicode number is
the value of the sub-translations added up.
Sub-translation may contain empty strings
"=0x0021",
but consecutive ones cannot.
Matching strategy for sub-translations: When the first
sub-translation would match the null string, one character may be
borrowed from the previous translation.
Autoshaping information
[This section needs to be reviewed and expanded.]
The Yudit-supplied Arabic.kmap and ArabicTranslit.kmap files
contain "shaping" information, presumably for proper rendering of
Arabic. (Many special ligatures are possible in Arabic rendering,
but only the laam+alif ones are absolutely required. The following
four mysterious entries would appear to indicate the entry
sequences and Unicode value sequences that require special
rendering.)
// Shaping part using shape.mys. Autogenerated.
//
"l aM=0x0644 0x0622", // ?? LAM ALEF WITH MADDA ABOVE
"l aH=0x0644 0x0623", // ?? LAM ALEF WITH HAMZA ABOVE
"l aB=0x0644 0x0625", // ?? LAM ALEF WITH HAMZA BELOW
"l a=0x0644 0x0627", // ?? LAM ALEF
//
// End of shaping part. Autogenerated.
//
Gaspar [2003-01-21] clarifies that this section was auto-generated
by bin/arabickmap.pl, and
"bin/shape.pl generates a shape.mys shaper map that is responsible
for Arabic/Syriac shaping. As Yudit is a plain text editor, it
needs to know the obligatory ligatures, the ones that definitely
should go into one box. Those four ligatures in the example are
the obligatory ligatures.
"Please note: in fact in yudit 2.7.+ you don't need to have it
because additional characters and diacritics will automatically be
added to form a ligature, so in fact the aforementioned ligature
can be entered by simply inputting the components."
[end of quotation]
As I interpret this, Arabic kmap files can now be written without
these four mysterious lines; when the user types laam followed by
some variety of 'alif letter, the appropriate ligature glyph should
be displayed automatically. [This section subject to correction.]
Generic .mys Source Text-Conversion Files [for the future]
[This section needs to be reviewed and expanded.]
This document describes the format for .kmap files, which are
converted into .my files.
The generic .mys source files provide a few more features than the
standard .kmap files, and the .mys format will someday be
recommended as a source format for new kmap files.
The only documentation currently available for .mys files is in the
Yudit download:
http://www.yudit.org/download/yudit-2.7.2/mytool/mys/example.mys
Better documentation will be required before the average user can
understand and use .mys files.
If you need a keymap file for SMP (Supplementary Multilingual
Plane) all you can use is mys format.
Take a look at the source mys file in Yudit source tree:
yudit-2.7.8/mytool/mys/OldItalic.mys
You can compile OldItalic.mys with
mytool -convert my -write OldItalic.my -mys OldItalic.mys
Gaspar Sinai <gaspar@yudit.org> Tokyo 2006-05-21
Gaspar Sinai <gsinai@yudit.org> Tokyo 2001-01-11
Edited and augmented: ken.beesley@xrce.xerox.com 2003-02-06
|