File: textchk.html

package info (click to toggle)
textchk 2002.01.10
  • links: PTS
  • area: main
  • in suites: sarge, woody
  • size: 284 kB
  • ctags: 22
  • sloc: perl: 590; makefile: 134
file content (438 lines) | stat: -rw-r--r-- 15,152 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
<html lang="en"><head>
<title>Textchk</title>
<meta http-equiv="Content-Type" content="text/html">
<meta name=description content="Textchk">
<meta name=generator content="makeinfo 4.0">
<link href="http://texinfo.org/" rel=generator-home>
</head><body>


<h1>Table of Contents</h1>
<ul>
<li><a href="#Introduction">Introduction</a>
<ul>
<li><a href="#Introduction">License</a>
<li><a href="#Introduction">Obtain Textchk</a>
<li><a href="#Introduction">How to contact the author</a>
</ul>
<li><a href="#The%20problem%20to%20solve">The problem to solve</a>
<li><a href="#Configuration">Configuration</a>
<ul>
<li><a href="#Configuration">Configuration hierarchy</a>
<li><a href="#Configuration">Special cases</a>
</ul>
<li><a href="#Input">Input for the analysis</a>
<li><a href="#How%20to%20use">How to use Textchk</a>
<ul>
<li><a href="#How%20to%20use">How errors are shown</a>
</ul>
<li><a href="#How%20to%20install">How to install Textchk</a>
<ul>
<li><a href="#How%20to%20install">Gettext</a>
<li><a href="#How%20to%20install">Dependencies</a>
</ul>
<li><a href="#Index">Index</a>
</ul>

<p><hr>
Node:<a name="Introduction">Introduction</a>
<br>

<h1>Introduction</h1>

<p>This is the documentation for Textchk. I decided to write this simple
program to help me to find my usual mistakes when I was writing an
italian book about GNU/Linux and free software:
<a href="http://www.pluto.linux.it/ildp/appuntilinux/">Appunti Linux</a>.

<p>I was convinced to translate this program into English and to make it as
more generalized as possible, as before it was made only for my own
formatting system (ALtools).

<p>I am sorry, but my English is very poor. Any comment and
language correction to this manual is appreciated.

<h2>License</h2>

<p>Textchk is released under the GNU General Public License.

<p>This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

<p>This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.

<p>You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

<h2>Obtain Textchk</h2>

<p>At the moment, the main distribution source for Textchk is
the following URI:

<a href="http://master.swlibero.org/~daniele/software/textchk/">http://master.swlibero.org/~daniele/software/textchk/</a>

<h2>How to contact the author</h2>

<pre>Daniele Giacomini
Via Turati, 15
I-31100 Treviso
Italy

daniele @ swlibero.org
</pre>

<p><hr>
Node:<a name="The%20problem%20to%20solve">The problem to solve</a>
<br>

<h1>The problem to solve</h1>

<p>Human writers make mistakes. With the help of a spell checker
can be found only words wrongly spelled, but nothing more. Every one
has it's own typical mistakes, that maybe can be found using simple
regular expression.

<p>Mistakes are not absolute; as languages are dynamic and every author may
decide the style. Textchk helps with the definition of rules that define
a kind of mistake. For example, <code>\b[Tt]his *this\b</code> is a regular
expression that catch the use of the word "this" for two times (the
first time can be capitalized), and this is presumably an error.

<p>Error like these may be typical for one person and very unusual for the
other. Textchk is made to let crate personalized rules, following the
needs. These rules are mainly thought to be part of a particular
documentation project; but can be defined also personal rules (valid for
any personal documentation project) and also general rules to be
extended system-wide.

<p><hr>
Node:<a name="Configuration">Configuration</a>
<br>

<h1>Configuration</h1>

<p>Configuration of Textchk is made of file that defines error rules (with
exceptions) and special situation that are not to be considered mistakes
for some reasons. The file that contains error and exception rules
is organized with records like this:

<p><code>DBL____<var>error-rule</var>[____<var>explanation-text</var>]</code>

<p><code>ERR____<var>error-rule</var>[____<var>explanation-text</var>]</code>

<p><code>EXC____<var>exception-rule</var></code>

<p>Empty lines and lines that start with a <code>#</code> are ignored.

<p>The four <code>_</code> are used to separate fields. The first one defines the
type of record: <code>DBL</code> means that the record describes a word
repeated with no reason; <code>ERR</code> means that the record describes an
error; <code>EXC</code> means that the record describe an exception for the
previous error. The second filed is a regular expression that describe
an error or an exception, depending on the first field. The third field
is available to explain the error. An example may help:

<pre>ERR____\bI'm\b____I'm --&gt; I am
EXC____\bI'm going\b
EXC____\bI'm very proud\b
</pre>

<p>In this case, it is considered an error to use <code>I'm</code>, because the
author like more to expand it to <code>I am</code>. The description to the
error is very simple, <code>I'm --&gt; I am</code>, but can be also more clear
(something like <code>I do not want things like "I'm"</code>). But this error
has two exceptions: <code>I'm going</code> and <code>I'm very proud</code> are
allowed.

<p>When Textchk finds a correspondence with an error rule, it isolates the
text around the error, exactly tree words before and three words after. 
Of course, there may be less of three words available. After that, the
comparison with exceptions is made using this extracted text. This means
that the following exception cannot be ever found, because there are
four words after the text that is identified as an error.

<pre>ERR____\bI'm\b____I'm --&gt; I am
# The following exception cannot be verified.
EXC____\bI'm very very very proud\b
</pre>

<p>Regular expressions that describe errors and exceptions should not
include reference to the beginning and the end of a text line. That is:
regular expression like <code>^...$</code> are not allowed.

<p>The <code>DBL</code> record describes a word what might appear double times,
intended as an error. For example:

<pre>DBL____\w\w+____Doubles
EXC____\b[bB]ye\s+bye\b
</pre>

<p>In that case, any two or more alphanumeric characters, making a word,
are located if written double time. Something like: "I need need money". 
The word "need" is written twice, and it is a mistake. As it can be
seen, the exception showed inside the example means that the sequence
"bye bye", or "Bye bye" must be allowed.

<h2>Configuration hierarchy</h2>

<p>Textchk is thought to be used with configuration specific for every
documentation project that any author can handle. Anyway, it is also
possible to define a personal configuration and a system-wide
configuration. Here are the configuration files for error and
exceptions; at least one of these files is required:

<ol type=1 start=1>
</p><li><code>./.textchk.rules</code> is the current configuration, that
is read before the other;

<li><code>~/.textchk.rules</code> is the personal configuration, that is
read after the current one and before the system-wide configuration;

<li><code>/etc/textchk.rules</code> is the system-wide configuration, that is
read after the others.
</ol>

<p>Generally it is better to avoid the use of a system-wide configuration. 
Anyway, if there is the need to override a system-wide rule, the same
rule can be inserted inside the personal or current configuration file,
followed with an exception with the same regular expression. 
That is; suppose that a system-wide rule is as it follows:

<pre>ERR____\bI'm\b____I'm --&gt; I am
</pre>

<p>If you don't want to be bored with that, you can add this to your
personal or current configuration:

<pre># Override system-wide rule.
ERR____\bI'm\b
EXC____\bI'm\b
</pre>

<h2>Special cases</h2>

<p>Some times it is not convenient to define an exception rule for a
particular error. Textchk generates a file containing the peaces of text
containing the errors found. If some of these peaces of text are no
mistakes, but you don't want to describe an exception to avoid this
warning, you can copy them into <code>./.textchk.special</code> (there is no
personal, nor system-wide one).

<p>Suppose that you run Textchk and you obtain a report made of the
following lines, because you decided that "I'm" is a mistake:

<pre>this is because I'm over the big
I'm out of control
I'm not going anywhere
</pre>

<p>Suppose that you don't want to be warned when the peace of text is
<code>I'm not going anywhere</code>. Just put that line into the file
<code>./.textchk.special</code>, and you will not see this warning anymore.

<pre>I'm not going anywhere
</pre>

<p>Now should be clear that the file <code>./.textchk.special</code> is only for
special exceptions: no regular expressions, but only pure text. 
Eventually, empty lines are ignored, but no comments are allowed.

<p><hr>
Node:<a name="Input">Input</a>
<br>

<h1>Input for the analysis</h1>

<p>Textchk read the input file line by line and the comparison with error
rules is made inside the space of a single line. This way, the text file
that is used as an input, should be transformed so that paragraphs are
joined together; that is: every paragraph should stay on a single line.

<p>This job is made by a front-end for man pages, HTML pages and Texinfo
sources. For other sources, the text must be normalized as a simple text
file with very long lines.

<p><hr>
Node:<a name="How%20to%20use">How to use</a>
<br>

<h1>How to use Textchk</h1>

<p>Textchk is made of one single executable: <code>textchk</code>.

<p>
<table width="100%">
<tr>
<td align="left"><b>textchk</b><i> <var>option</var> <var>file-to-be-analyzed</var> [<var>report-file</var> [<var>diag-file</var>]]
</i></td>
<td align="right">Command</td>
</tr>
</table>
<table width="95%" align="center">
<tr><td>
</TD></TR>
</TABLE>

<p>The option defines the type of the file,
<code>--input-type=<var>type</var></code>, so that it can be transformed before
the real scan. Some key words are available:

<ul>
<li><code>man</code> means that this is a man page;

<li><code>html</code> means that this is an HTML page;

<li><code>texinfo</code>, <code>texi</code> means that this is a Texinfo
source;

<li><code>standard</code> means that this is a normalized text file. 
</ul>

<p>The second argument is the name of the file. The third argument can be
the name of the report file (the one that store the peaces of text
considered mistakes); if not given it is equal to
<code><var>file-to-be-analyzed</var>.err</code>. The fourth argument is the name
for a diagnostic file, that contains all information of the scanning
made, useful to understand where rules doesn't do what is expected. If
this name is not given, it is equal to <code><var>report-file</var>.diag</code> or
<code><var>file-to-be-analyzed</var>.diag</code>.

<p>For example,

<p><code>textchk --input-type=man bash.1</code>

<p>gives two files: <code>bash.1.err</code> and <code>bash.1.diag</code>.

<h2>How errors are shown</h2>

<p>During its work, Textchk shows on screen what it finds, delimiting
errors with <code>&gt;&gt;</code> and <code>&lt;&lt;</code>. For example, if we have the same
old error rule:

<pre>ERR____\bI'm\b____I'm --&gt; I am
EXC____\bI'm going\b
</pre>

<p>we can obtain warning like these:

<pre>I'm --&gt; I am
to be here. &gt;&gt;I'm&lt;&lt; here today and
I'm --&gt; I am
&gt;&gt;I'm&lt;&lt; not mad.
</pre>

<p>Inside the diagnostic report, all the process is shown:

<pre>??? to be here. &gt;&gt;I'm&lt;&lt; here today and
ERR \bI'm\b
!!! to be here. &gt;&gt;I'm&lt;&lt; here today and

??? I know, &gt;&gt;I'm&lt;&lt; going to be
ERR \bI'm\b
EXC \bI'm going\b

??? &gt;&gt;I'm&lt;&lt; not mad.
ERR \bI'm\b
!!! &gt;&gt;I'm&lt;&lt; not mad.

??? Now &gt;&gt;I'm&lt;&lt; here to stay
ERR \bI'm\b
SPC Now I'm here to stay
</pre>

<p>Records starting with <code>???</code> show the problem; record starting with
<code>ERR</code> show the error rule that is responsible; record starting with
<code>EXC</code> show an exception rule that revert the error into a valid
string; record starting with <code>SPC</code> show a special string that is to
be considered valid; record starting with <code>!!!</code> show an error that
persist.

<p><hr>
Node:<a name="How%20to%20install">How to install</a>
<br>

<h1>How to install Textchk</h1>

<p>Textchk is made essentially of one executable: <code>textchk</code>. This
file can be placed everywhere you can run it without giving the path;
that is: inside a directory listed by the environment variable
<code>PATH</code>.

<p>It is needed Perl as <code>/usr/bin/perl</code>. If your system is organized
differently, you should modify the first line of this executable:

<pre>#!/usr/bin/perl
#...
</pre>

<p>After that, you need only a suitable <code>./.textchk.rules</code> and maybe
also <code>./.textchk.special</code>

<h2>Gettext</h2>

<p>The messages that Textchk shows may be translated. To install the already
translated PO files, it is necessary to compile them like this:

<pre>msgfmt -o textchk.mo it.po
</pre>

<p>In this example the file <code>it.po</code> is compiled and it is generated
the file <code>textchk.mo</code>. This generated file must be copied inside
the right directory; in this case, may be
<code>/usr/share/locale/it/LC_MESSAGES/</code>.

<p>If you don't have installed the Perl-gettext module and you don't want
to warry about it, you can comment the following instructions:

<pre># We *don't* want to use gettext.
#use POSIX;
#use Locale::gettext;
#setlocale (LC_MESSAGES, "");
#textdomain ("textchk");
</pre>

<p>Then you have to introduce a dummy <code>gettext()</code> function:

<pre>sub gettext
{
    return $_[0];
}
</pre>

<h2>Dependencies</h2>

<p>Textchk depends on other software to transform manual pages, HTML pages
and Texinfo sources into normalized text. This is Groff, Lynx and
Texinfo. As it is included the use of Gettext, the Perl-gettext module
must be installed.

<p><hr>
Node:<a name="Index">Index</a>
<br>

<h1>Index</h1>

<ul compact>
<li><code>./.textchk.rules</code>: <a href="#Configuration">Configuration</a>
<li><code>./.textchk.special</code>: <a href="#Configuration">Configuration</a>
<li><code>/etc/textchk.rules</code>: <a href="#Configuration">Configuration</a>
<li>configuration: <a href="#Configuration">Configuration</a>
<li>dependencies: <a href="#How%20to%20install">How to install</a>
<li>Gettext: <a href="#How%20to%20install">How to install</a>
<li>input text: <a href="#Input">Input</a>
<li>installation: <a href="#How%20to%20install">How to install</a>
<li>normalized text: <a href="#Input">Input</a>
<li><code>PATH</code>: <a href="#How%20to%20install">How to install</a>
<li><code>textchk</code>: <a href="#How%20to%20use">How to use</a>
<li><code>~/.textchk.rules</code>: <a href="#Configuration">Configuration</a>
</ul>


</body></html>