1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
|
= wordaxe Hyphenation =
Henning von Bargen, March 2009, Release 0.3.2
== Overview ==
The wordaxe library was formerly called "deco-cow", which is
an abbreviation for "decomposition of compound words".
The library consists of three program parts:
1) an (easily extendable) class library adding hyphenation support
to Python programs.
2) a special hyphenation algorithm, based on the decomposition of
compound words (the implementation is only for the German language).
3) a hyphenation extension for the <a href="http://www.reportlab.org">ReportLab</a> PDF library.
== <a name="bezug" />Where to get it ==
The wordaxe library is hosted at SourceForge (<a href="http://deco-cow.sourceforge.net">http://deco-cow.sourceforge.net</a>).
The current stable Release of the software can be downloaded from the
SourceForge <a href="http://sourceforge.net/project/showfiles.php?group_id=105867">download page</a>.
There's also a subversion repository where you can get the current development code.
== Licence ==
The wordaxe hyphenation library is dual-licensed.
You are permitted to use and distribute wordaxe under one
of these open source licenses:
"Apache 2.0 License" or "2-Clauses BSD-License".
For details, see the file license.txt contained in the library.
Regarding the licenes for the
<a href="http://hkn.eecs.berkeley.edu/~dyoo/pyHnj">pyHnj library</a> from Danny Yoo
(HNJ hyphenation, contained in the wordaxe installation), for the
<a href="http://www.reportlab.org">ReportLab PDF library</a>
and the <a href="http://code.google.com/p/pyhyphen">pyhyphen library</a>
please visit the corresponding web sites.
The dictionary files with the suffix <tt>.dic</tt> have been taken from
the OpenOffice distribution, they are licences under the GNU LGPL.
== Installation ==
=== ReportLab 2.3 ===
wordaxe Release 0.3.2 has been tested with Python 2.5 and ReportLab 2.3,
but it should work equally well with Python 2.4, since AFAIK
none of the new features of 2.5 are used.
ReportLab 2.3 can be obtained from <a href="http://www.reportlab.org">www.reportlab.org</a>.
The installation is easy (though not with easy-install ;-) and is not described here.
The older wordaxe 0.3.0 works with ReportLab 2.2 or 2.1, too.
Notes on even older ReportLab versions:
Though untested, wordaxe Release 0.3.0 may work with ReportLab 2.0.
Otherwise you can use 0.2.2 in this case,
but the installation is a little harder (because some more files
in the ReportLab installation had to be overwritten and the installation guide
was not correct).
For ReportLab 1.19 please use Release 0.1.1 (not recommended).
When upgrading from ReportLab 1.x to 2.x you probably have to change
existing code (independently from wordaxe), since for 2.x all paragraph
input text has to be unicode or UTF8 encoded.
=== Installating wordaxe step-by-step ===
1. Download the ZIP archive wordaxe-0.3.2.zip from the SourceForge site.
2. Unpack the ZIP-archive wordaxe-0.3.2.zip in the root directory C:\
(inside the archive, all files are in a directory "wordaxe-0.3.2").
The following directory structure will be created:
{{{
C:
wordaxe-0.3.2
doc
htdocs
css
examples
icons
images
wordaxe
dict
plugins
rl
}}}
3a. On the command line, execute:
{{{
cd /d c:\wordaxe-0.3.2
setup.py install
}}}
3b. Alternative:
For ReportLab pre 2.3: Make a backup copy of the file reportlab\pdfbase\rl_codecs.py from the
ReportLab Installation; then replace the file with the
modified version from c:\wordaxe-0.3.2\wordaxe\rl\rl_codecs.py.
<em>Note:</em> This will only change two lines of code in the file,
responsible for the "shy hyphen" character SHY.
Add the new library to the Python path, for example by creating a
file wordaxe.pth in c:\python25\lib\site-packages,
containg only one text line:
{{{
c:\wordaxe-0.3.2
}}}
4. Assure that "import wordaxe" doesn't create an error message.
5. ReportLab will work as before; differences may only occur,
if <em>your own</em> programs or texts use the SHY-character.
== Usage ==
To see the hyphenation with the DCW-Algorithmus (Decomposition of compound words)
in action for german language text,
run the example script "test_hyphenation.py" in the test subdirectory.
It will produce two PDF files, test_hyphenation-plain.pdf and test_hyphenation_styled.pdf.
The document you read now has also be produced with automatic hyphenation
(see the buildDoku.py script).
To add hyphenation support to your own programs (here using the DCW algorithm
as an example), only very few modifications in your code are necessary:
Add the following lines:
{{{
from wordaxe import hyphRegistry
from wordaxe.DCWHyphenator import DCWHyphenator
hyphRegistry['DE'] = DCWHyphenator('de',5)
}}}
Search and replace the following strings:
{{{
Search Replace with
reportlab.platypus.paragraph wordaxe.rl.paragraph
reportlab.platypus.xpreformatted wordaxe.rl.xpreformatted
reportlab.lib.styles wordaxe.rl.styles
}}}
Enable hyphenation. To do this, set two attributes in your ParagraphStyle:
{{{
stylesheet = getSampleStyleSheet()
myStyle = stylesheet["BodyText"]
myStyle.language = 'DE'
myStyle.hyphenation = True
}}}
=== Using a Hyphenator ===
Of course the wordaxe hyphenation can be used independent from ReportLab.
In the constructor, at least a language code and a minimal word-length
have to be supplied. Shorter words will not be considered for hyphenation.
{{{
from wordaxe.DCWHyphenator import DCWHyphenator
hyphenator = DCWHyphenator('de',5)
}}}
Now you can hyphenate (unicode) words.
The return value will either be None (unknown word)
or a HyphenatedWord, that is, a word with hyphenation points and their quality.
{{{
hword = hyphenator.hyphenate(u"Donaudampfschiffahrt")
print "Possible hyphenations", hword.hyphenations
# Split the word at the second possible hyphenation point:
left,right = hword.split(hword.hyphenations[1])
# returns: (u'Donau\xad', HyphenatedWord(u'dampfschiffahrt'))
# The left part is a unicode object (here: Donau-),
# the right part is the rest of the word (a HyphenatedWord instance again),
# that should go into the next line.
print left
print right
print right.hyphenations
}}}
== Hyphenation Classes ==
The source code for the classes contains test code that you can
examine to find out how to use the class.
The test code can be called to see how the corresponding class
handles the words given on the command line.
Probably you want to supply the -v argument for verbose output, too.
Example
{{{
c:\python25\python wordaxe\DCWHyphenator.py -v Silbentrennung
}}}
=== DCWHyphenator ===
This class works by splitting compound words into subwords,
inspired by the publications of the TU Vienna, see
<a href="http://www.ads.tuwien.ac.at/research/SiSiSi/">http://www.ads.tuwien.ac.at/research/SiSiSi/</a>.
However, the implementation here isn't in any way related to the closed source product "SiSiSi".
The algorithm works as follows:
A given compound word will be decomposed into subwords first,
using the file DE_hyph.ini, which contains stems,
some of them annotated with properties like
NEED_SUFFIX, NO_SUFFIX etc.
Furthermore, possible prefixes and suffixes are defined there.
Due to its complexity, the algorithm is quite slow:
The word will be scanned from left to right.
It will be split into a pair (L,R), where different partitions
are of possible of course, for example "Trennung": ("T", "rennung"),
("Tr", "ennung"), ("Tre", "nnung"), and so on.
For each pair, the algorithm checks if the left part matches a know prefix, stem
or suffix from the file DE_hyph.ini. If yes, the algorithm R continues with
the remainder R analogously.
Otherwise, or if the combination does not make sense
(i.e. prefix + suffix without stem) the algorithm cancels.
In principle, this is a recursive algorithm, although
the implementation uses a to-do like list instead.
The properties of the algorithm are:
Only the stems defined in DE_hyph.ini will be detected.
As a consequence, possible some valid hyphenation points may be missed, because
unknown words will not be hyphenated at all.
On the other hand, this prevents wrong hyphenations.
If a compound word cannot be decomposed uniquely,
only those hyphenation points that exist in all the decompositions.
This prevents hyphenations that may be valid but hard to read.
Hyphenation points have a priority, which can be used by the calling program
to prefer good hyphenation points (at subword boundaries).
This class supports all features of ExplicitHyphenator as well.
=== ExplicitHyphenator ===
This class supports supports all features of BaseHyphenator, plus:
Here you have to explicitly define the hyphenation for every word.
Thus this class isn't of much use on its own - only if the dictionary
is quite small (for example if more or less only fixed text templates
are used).
The hyphenations can be defined using the methods add_entry, add_entries
and add_entries_from_file (see test script special_words.py).
=== PyHnjHyphenator ===
This works like the pattern-based hyphenation in TeX (see also libhnj, pyhnj).
The implementation can use the pyhnj C-library or use pure Python, if the
argument purePython=True is supplied to the constructor.
=== wordaxe.plugins.PyHyphenHyphenator ===
This class also works like the pattern-based hyphenation in TeX,
but it uses a different implementation and <em>works much better</em>!
To use it, you need to have the pyhyphen library installed
(see <a href="http://pypi.python.org/pypi/PyHyphen/">http://pypi.python.org/pypi/PyHyphen/</a>).
This class supports all features of ExplicitHyphenator as well.
=== BaseHyphenator ===
This class should work for all languages.
It hyphenates only after the following characters:
{{{
'-' minus (45, '\x2D')
'.' dot (46, '\x2E') (but not if the dot is between digits)
'_' underscore (95, '\x5F')
'' SHY hyphenation character (173, '\xAD')
}}}
When the argument CamelCase=True is given to the constructor,
CamelCase words will be hyphenated, too.
== Remarks ==
=== Performance ===
The DCWHyphenator is quite slow, due to the rekursive nature of the
algorithm.
Since the word length is bounded, the run time when used with ReportLab
is proportional to the number of lines, because only the last word
of each line will be handled by the hyphenator.
For the DCWHyphenator you can cache the results instead of using it
directly, using the following code:
{{{
import wordaxe
from wordaxe.DCWHyphenator import DCWHyphenator
hyph = DCWHyphenator("DE")
wordaxe.hyphRegistry ["DE"] = wordaxe.Cached(hyph, 1000)
}}}
=== Extensions ===
Other hyphenation libraries can easily integrated with the
help of ctypes or SWIG.
To do this, you have to override the member function "hyphenate",
which receives a Unicode-word as input and has to return
a HyphenatedWord object or None.
=== Notes on ReportLab ===
The code in platypus/paragraph.py
is hard to read and therefore it is quite hard to extend the
functionality.
Since I was not able to add working hyphenation support to that code
in case of paragraph splitting, I decided to start a new Paragraph
implementation which a lot of code rewritten from scratch
(NewParagraph.py and para_fragments.py) and use that instead.
|