File: frog.1

package info (click to toggle)
frog 0.34-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 14,884 kB
  • sloc: cpp: 12,644; sh: 117; makefile: 58; python: 43; ansic: 39
file content (328 lines) | stat: -rw-r--r-- 8,031 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
.TH frog 1 "2023 feb 22"

.SH NAME
frog \- Dutch Natural Language Toolkit
.SH SYNOPSIS
frog [\-t] test\-file

frog [options]

.SH DESCRIPTION
Frog is an integration of memory\(hy-based natural language processing (NLP)
modules developed for Dutch.
Frog's current version will (optionally) tokenize, tag, lemmatize, and
morphologically segment word tokens in Dutch text files, add IOB chunks,
add Named Entities and will assign a dependency graph to each sentence.

.SH OPTIONS

.BR \-c " <file>  or " \-\-config =<file>
.RS
set the configuration using 'file'.

you can use
.B -c lang/config-file
to select the 'config-file' for an installed language 'lang'
.RE

.BR \-\-debug =<module><level>,...
.RS
set debug level per module, indicated by a single letter:
Tagger (T), Tokenizer (t), Lemmatizer (l), Morphological Analyzer (a),
Chunker (c), Multi\(hyWord Units (m), Named Entity Recognition (n),
or Parser (p). Different modules must be separated by commas.

(e.g. \-\-debug=l5,n3 sets the level for the Lemmatizer to 5 and for the NER
to 3 )

Debugging lines are written to a file
.BR frog.<number>.debug
.RE
The name of that file is given at the end of the run.

.BR \-d " <level>"
.RS
set a global debug level for all modules at once.
.RE

.BR \-\-deep\(hymorph
.RS
Add a deep morphological analysis to the 'Tabbed' and JSON output.
This analysis is added to XML unconditionally.

.RE

.BR \-\-compounds
.RS
include compound information as a column to 'Tabbed' output.
This information is added to XML and JSON unconditionally.

.RE

.BR \-e " <encoding>"
.RS
set input encoding. (default UTF8)
.RE

.BR \-h " or " \-\-help
.RS
give some help
.RE

.BR \-\-language= <comma\ separated\ list\ of\ languages>
.RS
Set the languages to work on. This parameter is passed to the tokenizer.
The strings are assumed to be ISO 639\-2 codes.

The first language in the list will be the default, unspecified languages are
asumed to be of that default.

e.g. \-\-language=nld,eng,por
means: detect Dutch, English and Portuguese, with Dutch being the default,
using TextCat. Mainly useful for XML processing.

Specifying a unsupported language is a fatal error. However, you can add the
special language 'und' which assures that sentences in an unknown languages
will be labeled as such, and processed no further.

.B IMPORTANT
Frog can at the moment handle only one language at a time, as determined by the
config file. So other languages mentioned here will be tokenized correctly, but
further they will be handled as that language.
.RE

.BR \-n
.RS
assume inputfile to have one sentence per line. (newline separators)

Very useful when running interactive, otherwise an empty line is needed to
signal end of input.
.RE

.BR \-\-nostdout
.RS
suppress the 'Tabbed' or JSON output to stdout. (when no outputfile was
specified with \-o or \-\-outputdir)

Especially useful when XML output is specified with \-X or \-\-xmldir.
.RE


.BR \-o " <file>"
.RS
send 'Tabbed' output to 'file' instead of stdout. Defaults to the name of the
inputfile with '.out' appended.
.RE

.BR \-\-outputdir " <dir>"
.RS
send all 'Tabbed' or JSON output to 'dir' instead of stdout. Creates filenames
from the inputfilename(s) with '.out' appended.
.RE

.BR \-\-retry
.RS
assume a re-run on the same input file(s). Frog wil only process those files
that haven't been processed yet.
.RE


.BR \-\-skip =[tlacnmp]
.RS
skip parts of the process: Tokenizer (t), Lemmatizer (l), Morphological
Analyzer (a), Chunker (c), Named Entity Recognition (n), Multi-Word Units (m)
or Parser (p).

The Tagger cannot be skipped.

Skipping the Multiword Unit implies disabling the Parser too.
.RE

.BR \-\-alpino
.RS
Use a locally installed Alpino parser. Disables our build-in Dependency parser
.RE

.BR \-\-alpino =server
.RS
use a remote installed Alpino server, as specified in the frog configuration
file.
.RE

\" .BR \-Q
\" .RS
\" Enable quotedetection in the tokenizer. NOT USED.
\" .RE

.BR \-S " <port>"
.RS
Run Frog as a server on 'port'
.RE

.BR \-t " <file>"
.RS
process 'file'.

This option can be omitted. Frog will run on any <file> found on the
qcommand-line.
Wildcards are allowed too. When NO files are specified, Frog will start in
interactive mode.

Files with the extension '.gz' or '.bz2' are handled too. The corresponding
output-files will be compressed using the same compression again. Except
when an explicit output filename is specified.
.RE

.BR \-x " <xmlfile>"
.RS
process 'xmlfile', which is supposed to be in FoLiA format! If 'xmlfile' is
empty, and
.BR \-\-testdir =<dir>
is provided, all '.xml' files in 'dir' will be processed as FoLia XML.

This option can be omitted. Frog will process files with the 'xml' extension
as FoLiA files.

Files with the extension '.xml.gz' or '.xml.bz2' are handled too. The
corresponding output-files will be compressed using the same compression again.
Except when an explicit output filename is specified.
.RE

.BR \-X " <xmlfile>"
.RS
When 'xmlfile' is specified, create a FoLiA XML output file with that name.

When 'xmlfile' is empty, generate FoLiA XML output for every inputfile.
.RE

.BR \-\-textclass "=<cls>"
.RS
When
.BR \-x
is given, use 'cls' to find AND store text in the FoLiA document(s).
Using \-\-inputclass and \-\-\outputclass is in general a better choice.
.RE

.BR \-\-inputclass "=<cls>"
.RS
use 'cls' to find text in the FoLiA input document(s).
.RE

.BR \-\-outputclass "=<cls>"
.RS
use 'cls' to output text in the FoLiA input document(s).
Preferably this is another class then the inputclass.
.RE

.BR \-\-testdir =<dir>
.RS
process all files in 'dir'. When the input mode is XML, only '.xml' files,
'.xml.gz' or '.xml.bz2' files are taken from 'dir'. see also
.B \-\-outputdir
.RE

.BR \-\-uttmarker =<mark>
.RS
assume all utterances are separated by 'mark'. (the default is none).
.RE

.BR \-\-threads =<n>
.RS
use a maximum of 'n' threads. The default is to take whatever is needed.
In servermode we always run on 1 thread per session.
.RE

.BR \-V " or " \-\-version
.RS
show version info
.RE

.BR \-\-xmldir =<dir>
.RS
generate FoLiA XML output and send it to 'dir'. Creates filenames from the
inputfilename with '.xml' appended. (Except when it already ends with '.xml')
.RE

.BR \-X " <file>"
.RS
generate FoLiA XML output and send it to 'file'. Defaults to the name of the
inputfile(s) with '.xml' appended. (Except when it already ends with '.xml')
.RE

.BR \-\-id "=<id>"
.RS
When
.BR \-X
for FoLia is given, use 'id' to give the doc an ID. The default is an xml:id
based on the filename.
.RE

.BR \-\-allow\-word\-corrections
.RS
Allow the
.BR ucto
tokenizer to apply simple corrections on words while processing FoLiA output.
For instance splitting punctuation.
.RE

.BR \-\-max\-parser\-tokens "=<num>"
.RS
Limit the size of sentences to be handled by the Parser. (Default 500 words).

The Parser is very memory consuming. 500 Words will already need 16Gb of RAM.
.RE

.BR \-\-JSONin
.RS
The input is in JSON format. Mainly for Server mode, but works on files too.

This implies \-\-JSONout too!
.RE

.BR \-\-JSONout
.RS
Output will be in JSON instead of 'Tabbed'.
.RE

.BR \-\-JSONout "=<indent>"
.RS
Output will be in JSON instead of 'Tabbed'. The JSON will be idented by value
 'indent'. (Default is indent=0. Meaning al the JSON will be on 1 line)
.RE

.BR \-T
or
.BR \-\-textredundancy "=[full|medium|none]"
.RS
Set the text redundancy level in the tokenizer for text nodes in FoLiA output:
.B full
add text to all levels: <p> <s> <w> etc.
.B minimal
don't introduce text on higher levels, but retain what is already there.
.B none
only introduce text on <w>, AND remove all text from higher levels
.RE

.BR \-\-override "=<section>.<parameter>=<value>"
.RS
Override a parameter from the configuration file with a different value.

This option may be repeated several times.
.RE

.SH BUGS
likely

.SH AUTHORS
Maarten van Gompel

Ko van der Sloot

Antal van den Bosch

e\-mail: lamasoftware@science.ru.nl
.SH SEE ALSO
.BR ucto (1)
.BR mblem (1)
.BR mbma (1)
.BR ner (1)