1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
|
=== Programs ===
doit.sh -- Regenerates all */*.base and */*.c files from the source one
(given as first parameter in */doit.sh), used by */doit.sh to
regenerate stuff in individual directories too. Uses many of
following scripts.
*/doit.sh -- Customized scripts for individual directories. Once a directory
contains doit.sh, it's run by the main one.
clean.sh -- Removes most auxiliary files from language subdirs.
basetoc.c -- [filter] Converts one .base file to .c file, used by doit.sh
$ ./basetoc <CHARSET.base >CHARSET.c
totals.pl -- Reads generated .c files and computes significancy data, weight
sums and other summary data, writes file `totals.c'
$ ./totals.pl CHARSET1.c ... CHARSETn.c
normalize.pl -- [filter] Does some kind of funny weight normalization, useful
for producing CHARSET.base files, since the weights must fit
into unsigned short int:
$ ./normalize.pl <COUNTS >NORMALIZED_COUNTS
Given a file on command line, it normalizes input to have
exactly(!) the same weight sum:
$ ./normalize.pl REFERENCE_COUNTS <COUNTS >RENORMALIZED_COUNTS
This is not run by doit.sh.
extreme.pl -- Given two count files, it finds characters most suitable for
hook deciding between these two, i.e. characters with the
biggest difference of occurrences:
$ ./extreme.pl COUNT1 COUNT2
xlt.c -- [filter] Extremely simple charset converter, to become independent
on the other broken converters:
$ ./xlt SOURCE.map TARGET.map <TEXT >CONVERTED_TEXT
mystrings.c -- [filter] Extract text chunks from input (strings(1) doesn't
seem to do good job on 8bit files):
$ ./mystrings <FILE | ...
countall.c -- [filter] Count character frequencies
$ ./countall <TEXT >rawcounts.CHARSET
countpair.c -- [filter] Count 8bit letter pair frequencies and print a table
containing as much pairs as to get 95% of all
$ ./countpair CHARSET.letters <TEXT >paircounts.CHARSET
findletters.c -- [filter] Find what 8bit characters from a charset map are
letters
$ ./findletters CHARSET.map <Letters >CHARSET.letters
map2letters.sh -- Run findletters.c for all charsets in maps/.
=== Data ===
Letters -- Unicode characters assumed to be letters, excluding 7bits. Also
excluding non-European scripts, to keep it small.
maps/ -- 8bit charset -> UCS2 maps, notable ones:
ibm866-bad.map -- Translates Latin `i' and `I' to Cyrillic 0x0456 and 0x0406,
thus approximates them the opposite way when used as
TARGET.
maccyr.map -- It's Macintosh Cyrillic after Apple unification of Russian
and Ukriainian variants and adding Euro symbol there, in
Mac OS 9.0 or so (recode uses the old Russian maccyr -- FIXME
with iconv it doesn't?).
macce.map -- Macintosh Central European encoding, the real one, not the
crappy one used by recode.
koi8u.map -- KOI8-U (Ukrainian) (recode uses some strange mapping?).
koi8uni.map -- KOI8-Unified.
koi8ub.map -- KOI8-UB (Ukrainian/Belarusian).
cork.map -- T1 Cork encoding (recode uses some strange mapping?).
iso885913.map -- ISO-8859-13 map (recode uses some strange mapping?).
letters/ -- lists of 8bit charset that are letters (generated) for various
charsets, run map2letters.sh to create it
|