File: filetypes.rst

package info (click to toggle)
morfessor 2.0.6-4
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 332 kB
  • sloc: python: 2,456; makefile: 147
file content (96 lines) | stat: -rw-r--r-- 2,737 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
Morfessor file types
====================

.. _binary-model-def:

Binary model
------------

.. warning::

    Pickled models are sensitive to bitrot. Sometimes incompatibilities exist
    between Python versions that prevent loading a model stored by a different
    version. Also, next versions of Morfessor are not guaranteed to be able to
    load models of older versions.

The standard format for Morfessor 2.0 is a binary model, generated by pickling
the :ref:`BaselineModel <baseline-model-label>` object. This ensures that all
training-data, annotation-data and weights are exactly the same as when the
model was saved.

.. _binary-reduced-model-def:

Reduced Binary model
--------------------
A reduced Morfessor model contains only that information that is necessary for
segmenting new words using (nbest) viterbi segmentation. Reduced binary models
much smaller that the full models, but no model modificating actions can be
performed.

.. _morfessor1-model-def:

Morfessor 1.0 style text model
------------------------------
Morfessor 2.0 also supports the text model files that are used in Morfessor
1.0. These files consists of one segmentation per line, preceded by a count,
where the constructions are separated by ' + '.

Specification: ::

    <int><space><CONSTRUCTION>[<space>+<space><CONSTRUCTION>]*

Example: ::

    10 kahvi + kakku
    5 kahvi + kilo + n
    24 kahvi + kone + emme

Text corpus file
----------------
A text corpus file is a free format text-file. All lines are split into
compounds using the compound-separator (default <space>). The compounds then
are split into atoms using the atom-separator. Compounds can occur multiple
times and will be counted as such.

Example: ::

    kavhikakku kahvikilon kahvikilon
    kahvikoneemme kahvikakku

Word list file
--------------
A word list corpus file contains one compound per line, possibly preceded by a
count. If multiple entries of the same word occur there counts are summed. If
no count is given, a count of one is assumed (per entry).

Specification: ::

    [<int><space>]<COMPOUND>

Example 1: ::

    10 kahvikakku
    5 kahvikilon
    24 kahvikoneemme

Example 2: ::

    kahvikakku
    kahvikilon
    kahvikoneemme

Annotation file
---------------
An annotation file contains one compound and one or more annotations per
compound on each line. The separators between the annotations (default ', ')
and between the constructions (default ' ') are configurable.

Specification: ::

    <compound> <analysis1construction1>[ <analysis1constructionN>][, <analysis2construction1> [<analysis2constructionN>]*]*

Example: ::

    kahvikakku kahvi kakku, kahvi kak ku
    kahvikilon kahvi kilon
    kahvikoneemme kahvi konee mme, kah vi ko nee mme