= wordaxe Hyphenation = Henning von Bargen, March 2009, Release 0.3.2 == Overview == The wordaxe library was formerly called "deco-cow", which is an abbreviation for "decomposition of compound words". The library consists of three program parts: 1) an (easily extendable) class library adding hyphenation support to Python programs. 2) a special hyphenation algorithm, based on the decomposition of compound words (the implementation is only for the German language). 3) a hyphenation extension for the ReportLab PDF library. == Where to get it == The wordaxe library is hosted at SourceForge (http://deco-cow.sourceforge.net). The current stable Release of the software can be downloaded from the SourceForge download page. There's also a subversion repository where you can get the current development code. == Licence == The wordaxe hyphenation library is dual-licensed. You are permitted to use and distribute wordaxe under one of these open source licenses: "Apache 2.0 License" or "2-Clauses BSD-License". For details, see the file license.txt contained in the library. Regarding the licenes for the pyHnj library from Danny Yoo (HNJ hyphenation, contained in the wordaxe installation), for the ReportLab PDF library and the pyhyphen library please visit the corresponding web sites. The dictionary files with the suffix .dic have been taken from the OpenOffice distribution, they are licences under the GNU LGPL. == Installation == === ReportLab 2.3 === wordaxe Release 0.3.2 has been tested with Python 2.5 and ReportLab 2.3, but it should work equally well with Python 2.4, since AFAIK none of the new features of 2.5 are used. ReportLab 2.3 can be obtained from www.reportlab.org. The installation is easy (though not with easy-install ;-) and is not described here. The older wordaxe 0.3.0 works with ReportLab 2.2 or 2.1, too. Notes on even older ReportLab versions: Though untested, wordaxe Release 0.3.0 may work with ReportLab 2.0. Otherwise you can use 0.2.2 in this case, but the installation is a little harder (because some more files in the ReportLab installation had to be overwritten and the installation guide was not correct). For ReportLab 1.19 please use Release 0.1.1 (not recommended). When upgrading from ReportLab 1.x to 2.x you probably have to change existing code (independently from wordaxe), since for 2.x all paragraph input text has to be unicode or UTF8 encoded. === Installating wordaxe step-by-step === 1. Download the ZIP archive wordaxe-0.3.2.zip from the SourceForge site. 2. Unpack the ZIP-archive wordaxe-0.3.2.zip in the root directory C:\ (inside the archive, all files are in a directory "wordaxe-0.3.2"). The following directory structure will be created: {{{ C: wordaxe-0.3.2 doc htdocs css examples icons images wordaxe dict plugins rl }}} 3a. On the command line, execute: {{{ cd /d c:\wordaxe-0.3.2 setup.py install }}} 3b. Alternative: For ReportLab pre 2.3: Make a backup copy of the file reportlab\pdfbase\rl_codecs.py from the ReportLab Installation; then replace the file with the modified version from c:\wordaxe-0.3.2\wordaxe\rl\rl_codecs.py. Note: This will only change two lines of code in the file, responsible for the "shy hyphen" character SHY. Add the new library to the Python path, for example by creating a file wordaxe.pth in c:\python25\lib\site-packages, containg only one text line: {{{ c:\wordaxe-0.3.2 }}} 4. Assure that "import wordaxe" doesn't create an error message. 5. ReportLab will work as before; differences may only occur, if your own programs or texts use the SHY-character. == Usage == To see the hyphenation with the DCW-Algorithmus (Decomposition of compound words) in action for german language text, run the example script "test_hyphenation.py" in the test subdirectory. It will produce two PDF files, test_hyphenation-plain.pdf and test_hyphenation_styled.pdf. The document you read now has also be produced with automatic hyphenation (see the buildDoku.py script). To add hyphenation support to your own programs (here using the DCW algorithm as an example), only very few modifications in your code are necessary: Add the following lines: {{{ from wordaxe import hyphRegistry from wordaxe.DCWHyphenator import DCWHyphenator hyphRegistry['DE'] = DCWHyphenator('de',5) }}} Search and replace the following strings: {{{ Search Replace with reportlab.platypus.paragraph wordaxe.rl.paragraph reportlab.platypus.xpreformatted wordaxe.rl.xpreformatted reportlab.lib.styles wordaxe.rl.styles }}} Enable hyphenation. To do this, set two attributes in your ParagraphStyle: {{{ stylesheet = getSampleStyleSheet() myStyle = stylesheet["BodyText"] myStyle.language = 'DE' myStyle.hyphenation = True }}} === Using a Hyphenator === Of course the wordaxe hyphenation can be used independent from ReportLab. In the constructor, at least a language code and a minimal word-length have to be supplied. Shorter words will not be considered for hyphenation. {{{ from wordaxe.DCWHyphenator import DCWHyphenator hyphenator = DCWHyphenator('de',5) }}} Now you can hyphenate (unicode) words. The return value will either be None (unknown word) or a HyphenatedWord, that is, a word with hyphenation points and their quality. {{{ hword = hyphenator.hyphenate(u"Donaudampfschiffahrt") print "Possible hyphenations", hword.hyphenations # Split the word at the second possible hyphenation point: left,right = hword.split(hword.hyphenations[1]) # returns: (u'Donau\xad', HyphenatedWord(u'dampfschiffahrt')) # The left part is a unicode object (here: Donau-), # the right part is the rest of the word (a HyphenatedWord instance again), # that should go into the next line. print left print right print right.hyphenations }}} == Hyphenation Classes == The source code for the classes contains test code that you can examine to find out how to use the class. The test code can be called to see how the corresponding class handles the words given on the command line. Probably you want to supply the -v argument for verbose output, too. Example {{{ c:\python25\python wordaxe\DCWHyphenator.py -v Silbentrennung }}} === DCWHyphenator === This class works by splitting compound words into subwords, inspired by the publications of the TU Vienna, see http://www.ads.tuwien.ac.at/research/SiSiSi/. However, the implementation here isn't in any way related to the closed source product "SiSiSi". The algorithm works as follows: A given compound word will be decomposed into subwords first, using the file DE_hyph.ini, which contains stems, some of them annotated with properties like NEED_SUFFIX, NO_SUFFIX etc. Furthermore, possible prefixes and suffixes are defined there. Due to its complexity, the algorithm is quite slow: The word will be scanned from left to right. It will be split into a pair (L,R), where different partitions are of possible of course, for example "Trennung": ("T", "rennung"), ("Tr", "ennung"), ("Tre", "nnung"), and so on. For each pair, the algorithm checks if the left part matches a know prefix, stem or suffix from the file DE_hyph.ini. If yes, the algorithm R continues with the remainder R analogously. Otherwise, or if the combination does not make sense (i.e. prefix + suffix without stem) the algorithm cancels. In principle, this is a recursive algorithm, although the implementation uses a to-do like list instead. The properties of the algorithm are: Only the stems defined in DE_hyph.ini will be detected. As a consequence, possible some valid hyphenation points may be missed, because unknown words will not be hyphenated at all. On the other hand, this prevents wrong hyphenations. If a compound word cannot be decomposed uniquely, only those hyphenation points that exist in all the decompositions. This prevents hyphenations that may be valid but hard to read. Hyphenation points have a priority, which can be used by the calling program to prefer good hyphenation points (at subword boundaries). This class supports all features of ExplicitHyphenator as well. === ExplicitHyphenator === This class supports supports all features of BaseHyphenator, plus: Here you have to explicitly define the hyphenation for every word. Thus this class isn't of much use on its own - only if the dictionary is quite small (for example if more or less only fixed text templates are used). The hyphenations can be defined using the methods add_entry, add_entries and add_entries_from_file (see test script special_words.py). === PyHnjHyphenator === This works like the pattern-based hyphenation in TeX (see also libhnj, pyhnj). The implementation can use the pyhnj C-library or use pure Python, if the argument purePython=True is supplied to the constructor. === wordaxe.plugins.PyHyphenHyphenator === This class also works like the pattern-based hyphenation in TeX, but it uses a different implementation and works much better! To use it, you need to have the pyhyphen library installed (see http://pypi.python.org/pypi/PyHyphen/). This class supports all features of ExplicitHyphenator as well. === BaseHyphenator === This class should work for all languages. It hyphenates only after the following characters: {{{ '-' minus (45, '\x2D') '.' dot (46, '\x2E') (but not if the dot is between digits) '_' underscore (95, '\x5F') '­' SHY hyphenation character (173, '\xAD') }}} When the argument CamelCase=True is given to the constructor, CamelCase words will be hyphenated, too. == Remarks == === Performance === The DCWHyphenator is quite slow, due to the rekursive nature of the algorithm. Since the word length is bounded, the run time when used with ReportLab is proportional to the number of lines, because only the last word of each line will be handled by the hyphenator. For the DCWHyphenator you can cache the results instead of using it directly, using the following code: {{{ import wordaxe from wordaxe.DCWHyphenator import DCWHyphenator hyph = DCWHyphenator("DE") wordaxe.hyphRegistry ["DE"] = wordaxe.Cached(hyph, 1000) }}} === Extensions === Other hyphenation libraries can easily integrated with the help of ctypes or SWIG. To do this, you have to override the member function "hyphenate", which receives a Unicode-word as input and has to return a HyphenatedWord object or None. === Notes on ReportLab === The code in platypus/paragraph.py is hard to read and therefore it is quite hard to extend the functionality. Since I was not able to add working hyphenation support to that code in case of paragraph splitting, I decided to start a new Paragraph implementation which a lot of code rewritten from scratch (NewParagraph.py and para_fragments.py) and use that instead.