File: dokumentation_en.txt

package info (click to toggle)
python-wordaxe 0.3.2-1
  • links: PTS
  • area: main
  • in suites: squeeze
  • size: 1,228 kB
  • ctags: 786
  • sloc: python: 9,814; makefile: 5
file content (336 lines) | stat: -rw-r--r-- 11,758 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
= wordaxe Hyphenation =

Henning von Bargen, March 2009, Release 0.3.2

== Overview ==

The wordaxe library was formerly called "deco-cow", which is
an abbreviation for "decomposition of compound words".

The library consists of three program parts:

1) an (easily extendable) class library adding hyphenation support
   to Python programs.

2) a special hyphenation algorithm, based on the decomposition of
   compound words (the implementation is only for the German language).

3) a hyphenation extension for the <a href="http://www.reportlab.org">ReportLab</a> PDF library.

== <a name="bezug" />Where to get it ==

The wordaxe library is hosted at SourceForge (<a href="http://deco-cow.sourceforge.net">http://deco-cow.sourceforge.net</a>).

The current stable Release of the software can be downloaded from the
SourceForge <a href="http://sourceforge.net/project/showfiles.php?group_id=105867">download page</a>.
There's also a subversion repository where you can get the current development code.

== Licence ==

The wordaxe hyphenation library is dual-licensed.
You are permitted to use and distribute wordaxe under one 
of these open source licenses:

"Apache 2.0 License" or "2-Clauses BSD-License".

For details, see the file license.txt contained in the library.

Regarding the licenes for the 
<a href="http://hkn.eecs.berkeley.edu/~dyoo/pyHnj">pyHnj library</a> from Danny Yoo
(HNJ hyphenation, contained in the wordaxe installation), for the 
<a href="http://www.reportlab.org">ReportLab PDF library</a> 
and the <a href="http://code.google.com/p/pyhyphen">pyhyphen library</a>
please visit the corresponding web sites.

The dictionary files with the suffix <tt>.dic</tt> have been taken from
the OpenOffice distribution, they are licences under the GNU LGPL.

== Installation ==

=== ReportLab 2.3 ===

wordaxe Release 0.3.2 has been tested with Python 2.5 and ReportLab 2.3,
but it should work equally well with Python 2.4, since AFAIK
none of the new features of 2.5 are used.

ReportLab 2.3 can be obtained from <a href="http://www.reportlab.org">www.reportlab.org</a>.
The installation is easy (though not with easy-install ;-) and is not described here.

The older wordaxe 0.3.0 works with ReportLab 2.2 or 2.1, too.

Notes on even older ReportLab versions:

Though untested, wordaxe Release 0.3.0 may work with ReportLab 2.0.
Otherwise you can use 0.2.2 in this case,
but the installation is a little harder (because some more files
in the ReportLab installation had to be overwritten and the installation guide
was not correct).

For ReportLab 1.19 please use Release 0.1.1 (not recommended).
When upgrading from ReportLab 1.x to 2.x you probably have to change
existing code (independently from wordaxe), since for 2.x all paragraph
input text has to be unicode or UTF8 encoded.

=== Installating wordaxe step-by-step ===
1. Download the ZIP archive wordaxe-0.3.2.zip from the SourceForge site.

2. Unpack the ZIP-archive wordaxe-0.3.2.zip in the root directory C:\
   (inside the archive, all files are in a directory "wordaxe-0.3.2").
   
   The following directory structure will be created:

{{{
C:
    wordaxe-0.3.2
        doc
        htdocs
            css
            examples
            icons
            images
        wordaxe
            dict
            plugins
            rl
}}}

   
3a. On the command line, execute:

{{{
cd /d c:\wordaxe-0.3.2
setup.py install
}}}

3b. Alternative:

   For ReportLab pre 2.3: Make a backup copy of the file reportlab\pdfbase\rl_codecs.py from the
   ReportLab Installation; then replace the file with the
   modified version from c:\wordaxe-0.3.2\wordaxe\rl\rl_codecs.py.
   
   <em>Note:</em> This will only change two lines of code in the file, 
   responsible for the "shy hyphen" character SHY. 
   
   Add the new library to the Python path, for example by creating a
   file wordaxe.pth in c:\python25\lib\site-packages,
   containg only one text line:

{{{
c:\wordaxe-0.3.2
}}}

4. Assure that "import wordaxe" doesn't create an error message.
   
5. ReportLab will work as before; differences may only occur,
   if <em>your own</em> programs or texts use the SHY-character.
   
== Usage ==

To see the hyphenation with the DCW-Algorithmus (Decomposition of compound words)
in action for german language text,
run the example script "test_hyphenation.py" in the test subdirectory.
It will produce two PDF files, test_hyphenation-plain.pdf and test_hyphenation_styled.pdf.

The document you read now has also be produced with automatic hyphenation
(see the buildDoku.py script).

To add hyphenation support to your own programs (here using the DCW algorithm
as an example), only very few modifications in your code are necessary:
  
Add the following lines:

{{{
from wordaxe import hyphRegistry
from wordaxe.DCWHyphenator import DCWHyphenator
hyphRegistry['DE'] = DCWHyphenator('de',5)
}}}

Search and replace the following strings:
{{{
Search                            Replace with
reportlab.platypus.paragraph      wordaxe.rl.paragraph
reportlab.platypus.xpreformatted  wordaxe.rl.xpreformatted
reportlab.lib.styles              wordaxe.rl.styles
}}}

Enable hyphenation. To do this, set two attributes in your ParagraphStyle:

{{{
stylesheet = getSampleStyleSheet()
myStyle = stylesheet["BodyText"]
myStyle.language = 'DE'
myStyle.hyphenation = True
}}}

=== Using a Hyphenator ===

Of course the wordaxe hyphenation can be used independent from ReportLab.

In the constructor, at least a language code and a minimal word-length
have to be supplied. Shorter words will not be considered for hyphenation.

{{{
from wordaxe.DCWHyphenator import DCWHyphenator
hyphenator = DCWHyphenator('de',5)
}}}

Now you can hyphenate (unicode) words.
The return value will either be None (unknown word)
or a HyphenatedWord, that is, a word with hyphenation points and their quality.

{{{
hword = hyphenator.hyphenate(u"Donaudampfschiffahrt")
print "Possible hyphenations", hword.hyphenations
# Split the word at the second possible hyphenation point:
left,right = hword.split(hword.hyphenations[1])
# returns: (u'Donau\xad', HyphenatedWord(u'dampfschiffahrt'))
# The left part is a unicode object (here: Donau-),
# the right part is the rest of the word (a HyphenatedWord instance again),
# that should go into the next line.
print left
print right
print right.hyphenations
}}}

== Hyphenation Classes ==

The source code for the classes contains test code that you can
examine to find out how to use the class.
The test code can be called to see how the corresponding class
handles the words given on the command line.
Probably you want to supply the -v argument for verbose output, too.

Example
{{{
c:\python25\python wordaxe\DCWHyphenator.py -v Silbentrennung
}}}

=== DCWHyphenator ===

This class works by splitting compound words into subwords,
inspired by the publications of the TU Vienna, see 
<a href="http://www.ads.tuwien.ac.at/research/SiSiSi/">http://www.ads.tuwien.ac.at/research/SiSiSi/</a>.

However, the implementation here isn't in any way related to the closed source product "SiSiSi".

The algorithm works as follows:
A given compound word will be decomposed into subwords first,
using the file DE_hyph.ini, which contains stems,
some of them annotated with properties like
NEED_SUFFIX, NO_SUFFIX etc.
Furthermore, possible prefixes and suffixes are defined there.

Due to its complexity, the algorithm is quite slow:

The word will be scanned from left to right.
It will be split into a pair (L,R), where different partitions
are of possible of course, for example "Trennung": ("T", "rennung"),
("Tr", "ennung"), ("Tre", "nnung"), and so on.
For each pair, the algorithm checks if the left part matches a know prefix, stem
or suffix from the file DE_hyph.ini. If yes, the algorithm R continues with
the remainder R analogously.
Otherwise, or if the combination does not make sense
(i.e. prefix + suffix without stem) the algorithm cancels.

In principle, this is a recursive algorithm, although
the implementation uses a to-do like list instead.

The properties of the algorithm are:

Only the stems defined in DE_hyph.ini will be detected. 

As a consequence, possible some valid hyphenation points may be missed, because
unknown words will not be hyphenated at all.

On the other hand, this prevents wrong hyphenations.

If a compound word cannot be decomposed uniquely,
only those hyphenation points that exist in all the decompositions.
This prevents hyphenations that may be valid but hard to read.

Hyphenation points have a priority, which can be used by the calling program
to prefer good hyphenation points (at subword boundaries).

This class supports all features of ExplicitHyphenator as well.

=== ExplicitHyphenator ===

This class supports supports all features of BaseHyphenator, plus:

Here you have to explicitly define the hyphenation for every word.
Thus this class isn't of much use on its own - only if the dictionary
is quite small (for example if more or less only fixed text templates
are used).

The hyphenations can be defined using the methods add_entry, add_entries 
and add_entries_from_file (see test script special_words.py).

=== PyHnjHyphenator ===

This works like the pattern-based hyphenation in TeX (see also libhnj, pyhnj).
The implementation can use the pyhnj C-library or use pure Python, if the
argument purePython=True is supplied to the constructor.

=== wordaxe.plugins.PyHyphenHyphenator ===

This class also works like the pattern-based hyphenation in TeX,
but it uses a different implementation and <em>works much better</em>!
To use it, you need to have the pyhyphen library installed
(see <a href="http://pypi.python.org/pypi/PyHyphen/">http://pypi.python.org/pypi/PyHyphen/</a>).

This class supports all features of ExplicitHyphenator as well.

=== BaseHyphenator ===

This class should work for all languages.
It hyphenates only after the following characters:

{{{
    '-'   minus (45, '\x2D')
    '.'   dot (46, '\x2E') (but not if the dot is between digits)
    '_'   underscore (95, '\x5F')
    ''   SHY hyphenation character (173, '\xAD')
}}}

When the argument CamelCase=True is given to the constructor,
CamelCase words will be hyphenated, too.

== Remarks ==

=== Performance ===

The DCWHyphenator is quite slow, due to the rekursive nature of the
algorithm.

Since the word length is bounded, the run time when used with ReportLab
is proportional to the number of lines, because only the last word
of each line will be handled by the hyphenator.

For the DCWHyphenator you can cache the results instead of using it
directly, using the following code:
{{{
import wordaxe
from wordaxe.DCWHyphenator import DCWHyphenator
hyph = DCWHyphenator("DE")
wordaxe.hyphRegistry ["DE"] = wordaxe.Cached(hyph, 1000)
}}}

=== Extensions ===

Other hyphenation libraries can easily integrated with the
help of ctypes or SWIG.

To do this, you have to override the member function "hyphenate",
which receives a Unicode-word as input and has to return 
a HyphenatedWord object or None.

=== Notes on ReportLab ===

The code in platypus/paragraph.py
is hard to read and therefore it is quite hard to extend the
functionality.

Since I was not able to add working hyphenation support to that code
in case of paragraph splitting, I decided to start a new Paragraph
implementation which a lot of code rewritten from scratch
(NewParagraph.py and para_fragments.py) and use that instead.