File: ucto.1

package info (click to toggle)
ucto 0.35-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 2,192 kB
  • sloc: cpp: 5,235; xml: 1,226; sh: 409; python: 151; makefile: 44
file content (248 lines) | stat: -rw-r--r-- 5,294 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
.TH ucto 1 "2024 apr 11"

.SH NAME
ucto \- Unicode Tokenizer
.SH SYNOPSIS
ucto [[options]] [input\(hyfile] [[output\(hyfile]]

.SH DESCRIPTION
.B ucto
tokenizes text files: it separates words from punctuation, splits
sentences (and optionally paragraphs), and finds paired quotes.
Ucto is preconfigured with tokenisation rules for several languages.

Those rules are provided by
.B uctodata

.SH OPTIONS

.BR \-c " configfile"
.RS
read settings from a 'configfile'
.RE

.BR \-B
.RS
run in batch mode. Process all inputfiles to an output directory specified
with \-O.
.RE

.BR \-d " value"
.RS
set debug mode to 'value'
.RE

.BR \-e " value"
.RS
set input encoding. (default UTF8)
.RE

.BR \-I " value"
.RS
set the input directory to 'value'. (batch mode only)
.RE

.BR \-O " value"
.RS
set the ouput directory to 'value'. (Required for batch mode)
.RE

.BR \-N " value"
.RS
set UTF8 output normalization. (default NFC)
.RE

.BR \-\-filter =[YES|NO]
.RS
disable filtering of special characters, (default YES)
These special characters can be specified in the [FILTER] block of the
configuration file.
.RE

.BR \-L " language"
.RS
Automatically selects a configuration file by language code.
The language code is generally a three-letter iso-639-3 code.
For example, 'fra' will select the file tokconfig\(hyfra from the installation
directory
.RE

.BR \-\-detectlanguages =<lang1,lang2,..langn>
.RS
try to detect all the specified languages. The default language will be 'lang1'.
(only useful for FoLiA output).

All values must be iso-639-3 codes.

You can also use the special language code `und`. This ensures there is NO
default language, and any language that is NOT in the list will remain
unanalyzed.

.B Warning:
To be able to handle utterances of mixed language, Ucto uses a simple
sentence splitter based on the markers '.' '?' and '!'.
This may occasionally lead to surprising results.
.RE

.BR \-l
.RS
Convert output text to all lowercase
.RE

.BR \-u
.RS
Convert all input text to all uppercase
.RE

.BR \-n
.RS
Emit one sentence per line on output
.RE

.BR \-m
.RS
Assume one sentence per line on input
.RE

.BR \-\-normalize =class1,class2,..,classn
.RS
map all occurrences of tokens with class1,...class to their generic names. e.g
\-\-normalize=DATE will map all dates to the word {{DATE}}. Very useful to
normalize tokens like URL's, DATE's, E\-mail addresses and so on.
.RE

.BR \-T\  value
or
.BR \-\-textredundancy =value
.RS
set text redundancy level for text nodes in FoLiA output:
 'full'    - add text to all levels: <p> <s> <w> etc.
 'minimal' - don't introduce text on higher levels, but retain what is already
 there.
 'none'    - only introduce text on <w>, AND remove all text from higher levels
.RE

.BR \-\-allow-word-correction
.RS
Allow ucto to tokenize inside FoLiA Word elements, creating FoLiA Corrections
.RE

.BR \-\-ignore-tag-hints
.RS
Skip all
.B tag=token
hints from the FoLiA input. These hints can be used to signal text markup like
subscript and superscript
.RE

.BR \-\-add\-tokens ="file"
.RS
Add additional tokens to the [TOKENS] block of the default language.
The file should contain one TOKEN per line.
.RE

.BR \-\-passthru
.RS
Don't tokenize, but perform input decoding and simple token role detection
.RE

.BR \-\-filterpunct
.RS
remove most of the punctuation from the output. (not from abreviations and
embedded punctuation like John's)
.RE

.B \-P
.RS
Disable Paragraph Detection
.RE

.B \-Q
.RS
Enable Quote Detection. (this is experimental and may lead to unexpected
results)
.RE

.B \-s
<string>
.RS
Set End\(hyof\(hysentence marker. (Default <utt>)
.RE

.B \-V
or
.B \-\- version
.RS
Show version information
.RE

.B \-v
.RS
set Verbose mode
.RE

.B \-F
.RS
The input file(s) are assumed to be FoLiA XML. Text in the correct 'inputclass'
will be tokenized.
For files with an '.xml' extension, \-F is the default.

In batch mode, this forces to only select files with the '.xml' extension from
the input directory.
.RE

.BR \-\-inputclass ="cls"
.RS
When tokenizing a FoLiA XML document, search for text nodes of class 'cls'.
The default is "current".
.RE

.BR \-\-outputclass ="cls"
.RS
When tokenizing a FoLiA XML document, output the tokenized text in text nodes
with 'cls'. The default is "current".
It is recommended to have different classes for input and output.
.RE

.BR \-\-textclass ="cls" (obsolete)
.RS
use 'cls' for input and output of text from FoLiA. Equivalent to both
\-\-inputclass='cls' and \-\-outputclass='cls')

This option is obsolete and NOT recommended. Please use the separate
\-\-inputclass= and \-\-outputclass options.
.RE

.BR \-\-copyclass
.RS
when ucto is used on FoLiA with fully tokenized text in inputclass='inputclass',
no text in textclass 'outputclass' is produced. (A warning will be given).
To circumvent this. Add the
.B \-\-copyclass
option. Which assures that text will be emitted in that class
.RE

.B \-X
.RS
All output will be FoLiA XML. Document id's are autogenerated.

Works in batch mode too.
.RE

.B \-\-id
<DocId>
.RS
Use the specified Document ID for the FoLiA XML. (not allowed in batch mode)
When not provided, a document is is generated based on the nema of the input
file.
.RE

.SH BUGS
likely

.SH AUTHORS
Maarten van Gompel

Ko van der Sloot

e-mail: lamasoftware@science.ru.nl