File: README.md

package info (click to toggle)
voikko-fi 2.4-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 9,340 kB
  • sloc: xml: 510,283; python: 1,857; makefile: 143
file content (142 lines) | stat: -rw-r--r-- 5,892 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
voikko-fi - Finnish dictionary for Voikko
=========================================

General information
-------------------

Voikko-fi (previously know as Suomi-malaga) is a description of
Finnish morphology written for libvoikko. The implementation uses
unweighted VFST format and provides format 5 Finnish dictionary for
libvoikko 4.0 or later.

For Voikko the morphology supports spell checking, hyphenation
and grammar checking. Special support is also included for text
indexer Sukija. This support includes support for common spelling
mistakes, old spellings, old inflection types and old or rare words.

Requirements
------------

Building voikko-fi from this source code requires foma, libvoikko,
python (version 3 or later) and GNU make. Optionally to build
dictionary that can be used in web browsers with js-libvoikko
Emscripten SDK is required.

Build and installation
----------------------

No configuration is required: to build the code for Voikko,
you only need to run

    make vvfst

Installation can be done by running

    make vvfst-install DESTDIR=/usr/lib/voikko

(Replace /usr/lib/voikko with the directory you want to install the
files to. Installing to ~/.voikko will cause libvoikko to use this
version of voikko-fi only for the user who does the installation.)

Building the code for Sukija can be done by running

    make vvfst-sukija

Installation can be done by running

    make vvfst-install-sukija DESTDIR=/usr/lib/voikko

You should install Sukija binaries to the same directory that you
install Voikko files.

Supported Make targets
----------------------

- vvfst  
  Builds the binary files for Finnish dictionary (libvoikko format version 5)
- vvfst-sukija  
  Builds the binary files for Finnish dictionary (libvoikko format version 5)
  used in Sukija indexer.
- vvfst-install DESTDIR=/usr/lib/voikko  
  Installs the version 5 binary files needed by libvoikko to the directory
  specified by DESTDIR. DESTDIR is optional and defaults to
  /usr/lib/voikko
- vvfst-install-sukija DESTDIR=/usr/lib/voikko  
  Like vvfst-install but installs the binary files build by command vvfst-sukija.
- vvfst-install-js-preload-file DESTDIR=/usr/lib/voikko
  Like vvfst-install but additionally converts the dictionary to Emscripten
  compatible preload file that can be used with js-libvoikko.
- dist-gzip  
  Builds the full source package.
- clean  
  Removes all files generated by other targets.
- update-vocabulary  
  Updates the XML vocabulary from the nightly snapshot at
  joukahainen.puimula.org. This target requires wget to
  be available.


Variables for tuning the build process
--------------------------------------

- make vvfst:
  * VVFST_BUILDDIR=path/to/directory  
    Specifies the directory where build files are written to while building
    for Voikko.  
    Default: vvfst (build within source directory)
  * VVFST_BASEFORMS=yes|no  
    Include information needed for generating BASEFORM attribute. Setting this
    to "no" will result in a smaller dictionary file. Note that BASEFORM attribute
    will still be produced but its values will likely be incorrect. This option
    should only be disabled for application specific (embedded) dictionaries
    that are known to be used only for spell checking, grammar checking or
    hyphenation.  
    Default: yes
  * GENLEX_OPTS="--option1=xxx --option2=yyy ..."  
    Sets options string for the lexicon generator script.
    The available options are
    + --min-frequency=n  
      Limits the words to be included in the .lex files to the
      specified or higher frequency class. Default is 9.
    + --extra-usage=usage1,usage2,...  
      If a word has usage flags (it belongs to a special vocabulary), it is
      included in the vocabulary only if at least one of the usage flags is
      listed here. Available usage flags are listed in file
      vocabulary/flags.txt.
      Listing "sukija" here causes application specific exclusions to be ignored
      (words marked with not_voikko will also be included).
      By default, no special vocabularies are included.
    + --style=style1,style2,...  
      If a word has style flags (such as old, foreign or dialect), it is
      included in the vocabulary only if all of the style flags are listed
      here. Available style flags are listed in file vocabulary/flags.txt.  
      Default: old,international,inappropriate
    + --sourceid  
      Insert word identifiers from Joukahainen to lexicon and return them
      during morphological analysis. This option has no effect unless
      VOIKKO_DEBUG=yes is set. By default source ids are not preserved.
  * VANHAT_MUODOT=yes|no  
    Accept word forms that were present in old Finnish but are no longer
    considered valid in standard Finnish. Default: no
  * VOIKKO_VARIANT=variant  
    Set the short name for the language variant of this vocabulary. The
    name should match the regular expression [a-z][a-z0-9_]*  
    Default: standard
  * VOIKKO_DESCRIPTION="Description of the vocabulary"  
    Set the long description for the language variant of this vocabulary.
  * SM_PATCHINFO="Information about applied patches"  
    If you have modified the source code or are distributing prerelease
    versions, describe any modifications made to the released version here.
    It may be best to change this directly in the Makefile.

Copyright and license information
---------------------------------

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2, or (at your option)
any later version. See file COPYING for details.

Copyright (©) 2006 - 2020 Hannu Väisänen (Email: Hannu.Vaisanen@uef.fi)
and 2006 - 2020 Harri Pitkänen (hatapitk@iki.fi). Contributors listed
in file CONTRIBUTORS hold copyrights to the vocabulary data.