1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203
|
Test data for line breaking
===========================
Files named ``*.in'' are input text. ``*.out'' are expected outputs.
Default configuration is as following. Overridden values are described
by each test data.
charmax: 998
colmin: 0
colmax: 76
context: NONEASTASIAN
format: SIMPLE
hangul as AL: no
legacy CM: yes
virama sign: behave as consonant joiner
newline: "\n"
sizing: UAX11
tailoring EAW: none
tailoring LBC: none
urgent breaking:none
preprocessing: none
01 Generic
----------
ar.in
ar.out
Arabic
el.in
el.out
Greek
fr.in
fr.out
French
he.in
he.out
Hebrew
ja.in
ja.out
Japanese
ja-a.in
ja-a.out
Japanese (annotated readings)
ja-k.in
Japanese (kana transcribed)
ko.in
ko.out
Korean
ru.in
ru.out
Russian
sa.in
sa.out
Sanskrit
th.in
th.out
Thai
vi.in
vi.out
Vietnamese
vi-decomp.in
vi-decomp.out
Vietnamese (decomposed)
zh.in
zh.out
Chinese Mandarin
02 Hangul text
--------------
amitagyong.in
amitagyong.out
complex hangul syllables and conjoining jamo.
tailoring EAW: U+302E and U+302F are nonspacing.
ko.al.out
treat hangul syllables and conjoining jamo as alphabetic (AL).
03 Tailoring Line Breaking Classes
----------------------------------
ja-k.in
ja-k.out
colmax: 72
ja-k.ns.out
colmax: 72
kana nonstarters (small kana) are assigned Line Breaking Class ``ID''.
04 Folding/unfolding
--------------------
fr.fixed.out
ja.fixed.out
same as default but an empty line is inserted after each paragraph.
fr.flowed.out
ja.flowed.out
RFC 3676 ``Format="FLOWED"; DelSp="YES"'' format.
fr.plain.out
ja.plain.out
same as default.
quotes.in
unfolded e-mail text with one problematic line.
quotes.norm.in
unfolded e-mail text without problematic lines.
quotes.fixed.out
quotes.flowed.out
quotes.plain.out
folded e-mail text by three methods as above.
05 Long lines
-------------
ecclesiazusae.in
ecclesiazusae.out
ecclesiazusae.CharactersMax.out
charmax: 79
ecclesiazusae.ColumnsMax.out
urgent breaking:FORCE
ecclesiazusae.ColumnsMin.out
colmin: 7
colmax: 66
urgent breaking:FORCE
06 East Asian context
---------------------
fr.ea.out
context: EASTASIAN
07 n/a
------
08 n/a
------
09 URI
------
uri.in
uri.break.out
colmax: 1
preprocessing: break URIs according to some CMoS rules.
uri.break.http.out
colmax: 1
preprocessing: break HTTP URLs according to some CMoS rules; never
break FTP URLs.
uri.nonbreak.out
colmax: 1
preprocessing: never break URIs.
10 n/a
------
11 Formatting context
---------------------
fr.format.out
ja.format.out
insert context names.
fr.newline.out
ko.newline.out
trim spaces at end of each lines.
12 Indentation
--------------
fr.wrap.out
ja.wrap.out
Paragraphs are indented by one horizontal tab ("\t") and other lines
are indented by four spaces (" ").
13 RFC 3676
-----------
fr.flowed.out
ja.flowed.out
folding by RFC 3676 ``Format="FLOWED"; DelSp="YES"'' method.
flowedsp.in
flowedsp.out
unfolding RFC 3676 ``Format="FLOWED"; DelSp="NO"'' (obsoleted
RFC 2646) format.
14 Non-South East Asian context
-------------------------------
th.al.out
treat South East Asian complex context (SA) as alphabetic (AL).
15 n/a
------
16 n/a
------
Other useful test data
----------------------
* LineBreakTest.txt file in Unicode Character Database (UCD) may be
useful for regression test. Current version of this file will be
found at:
http://www.unicode.org/Public/UNIDATA/auxiliary/LineBreakTest.txt
$$
|