File: basic-concepts.md

package info (click to toggle)
unpaper 7.0.0-4
  • links: PTS, VCS
  • area: main
  • in suites:
  • size: 66,376 kB
  • sloc: ansic: 3,619; python: 304; makefile: 15; sh: 1
file content (452 lines) | stat: -rw-r--r-- 20,718 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
<!--
SPDX-FileCopyrightText: 2005 The unpaper authors

SPDX-License-Identifier: GPL-2.0-only
-->

Basic Concepts
==============

The terminology to describe `unpaper` makes heavy use of the paper
metaphor, because the software is mainly intended for post-processing
scanned images from printed paper documents.

## Sheets and Pages

The very basic object `unpaper` operates on is a **sheet**. A *sheet*
is an initially blank image in the computer's memory. Think of a
*sheet* as an initially empty piece of paper on which something will
be printed later.

To do something useful with a *sheet*, you will at least want to place
one **page** onto a *sheet*. A *page* is a logical unit of a document
which takes up a rectangular area on a *sheet*. In the most simple
case, one *sheet* carries exactly one *page*, in other cases
(e.g. when using a double-page *layout*) there can be multiple *pages*
placed on one *sheet*.

![Sheets and Pages](img/sheetspages.png)

## Input and Output Image Files

`unpaper` can process either double-page *layout* scans or
individually scanned *pages*. It is up to the user's choice whether an
**image-file** carries a single *page* or a whole *sheet* with two
*pages*.  The program can be configured to either join individual
*image-files* as multiple *pages* onto one *sheet*, or split *sheets*
containing multiple *pages* into several output *image-files* when
saving the output.

By default, `unpaper` places one input *image-file* onto a *sheet*,
and saves one output *image-file* per *sheet*.  Alternatively, the
number of input or output *image-files* per *sheet* can be set to two
using the `--input-pages 2` or `--output-pages 2` options.

If two *image-files* are specified as input, they will successively be
placed on the left-hand half and the right-hand half of the
*sheet*.

![Input Files](img/input-pages.png)

In the same way, if two *image-files* are specified as output, the
*sheet* will be split into two halves which get saved as individual
files.

![Output Pages](img/output-pages.png)

The default value both for `--input-pages` and `--output-pages` is 1.

### File Formats

The *image-file* formats accepted by `unpaper` are those that
[libav](http://libav.org) can handle. In particular it supports the
whole PNM-family: **PBM**, **PGM** and **PPM**. This ensures
interoperability with the [SANE](http://www.sane-project.org/) tools
under Linux. Support for TIFF and other complex file formats is not
guaranteed.

The output format is restricted to the PNM family of formats, and
conversions to other formats need to happen with tools such as
`pnmtopng`, `pnmtotiff` or `pnmtojpeg`. Alternatively you can use the
`convert` tool from [ImageMagick](http://www.imagemagick.org/).

## Layouts and Templates

### Built-In Layout-Templates

**Layouts** are the linking concept between physical *sheets* and
logical *pages*. A *layout* determines a set of rectangular areas at
which *pages* (or other parts of content) appear on a *sheet*. The
most common and simple *layouts* generally used are the single-page
*layout* (one *page* covers the whole *sheet*), and the double-page
*layout* (two *pages* are placed on the left-hand-side and the
right-hand-side of the *sheet*).

`unpaper` provides basic **layout templates** for the above types.
There are 2 *layout templates* built in, a third one deactivates any
template:

 * `single`
 * `double`
 * `none`

![Available Layout Templates](img/layout-templates.png)

A *layout template* is chosen by using the option `--layout`, e.g.

    unpaper --layout double input%03d.pbm output%03d.pbm

Choosing a *template* with the `--layout` option is equivalent to
specifying a set of other options, e.g. setting `--mask-scan-point`.
In order to combine a *template* with other options, make sure that
the more specific options appear behind the `--layout` option, in
order to overwrite the *template* settings.

The default template is `single`, use `none` to deactivate this.

**Note**: A *layout* is completely independent from the number of
 *image-files* used as input or output. That means, you can either
specify `--layout double` together with a single input *image-file*
(in cases where the input *image-file* already contains two scanned
*pages* in a double-page *layout*), or use it together with an
`--input-pages 2` setting, in order to join two individually scanned
*pages* on one *sheet*.

### Complex Layouts

Besides the built-in fixed *templates*, any kind of complex *layout*
can be handled by manually specifying either *mask-scan-points* using
the `--mask-scan-point` option, or setting *masks* at fixed
coordinates using the `--mask` option. Both the `--mask-scan-point`
and the `--mask` option may occur any number of times, in order to
declare as many *masks* in the *layout* as desired. See below for a
further explanation on *masks*.

## Processing Multiple Files

In many cases, especially when post-processing scanned books, there
will be several input *image-files* to process in sequence within a
single run of `unpaper`, and several output *image-files* to be
generated.  Processing of multiple files in a batch job is supported
through the use of wildcards in filenames, e.g.:

    unpaper (...options...) input%03d.pbm output%03.pbm

This will successively read images from files `input001.pbm`,
`input002.pbm`, `input003.pbm` etc., and write output to the files
`output001.pbm`, `output002.pbm`, `output003.pbm` etc., until no more
input *image-files* with the current index number are available.

Using a wildcard of the form `%0nd` will replace each occurrence of
the wildcard with an increasing index number, by default starting with
1 and counting up by 1 each time another files gets loaded. *n*
denotes the number of digits that the replaced number string is
supposed to have, and the *0* requests leading zeros. Thus "%03d" will
get replaced with strings in the sequence `001`, `002`, `003`
etc. This way, a sequence of images named e.g. `input001.pgm`,
`input002.pgm`, `input003.pgm`... can be specified.  There are two
separate index counters for input and output files which get increased
independently from each other.

Wildcards in filenames are also useful when combining a sequence of
individual *pages* onto double-page layouted *sheets*, or when
splitting double-page layouted *sheets* into individual output
files. When using two input or output *image-files* (by specifying
`--input-pages 2` or `--output-pages 2`) the index number replaced for
the wildcard will generally not be the same as the *sheet* number in
the processing sequence, but will grow twice as fast.

The following example will combine single-page *image-files* onto a
double-page *layout* *sheet*:

    unpaper -n --input-pages 2 singlepage%03d.pgm output%03d.pgm

This joins the input images `singlepage001.pgm` and
`singlepage002.pgm` on` output001.pgm`, `singlepage003.pgm` and
`singlepage004.pgm` on `output002.pgm`, and so on. Note that due to
the use of option `-n` (short for `--no-processing`), the images are
simply copied onto the left-hand half and the right-hand half of the
*sheet* without any processing regarding *layout*, *mask-detection*
etc.

Using multiple input *image-files* by setting `--input-pages 2` is
independent from any *layout* possibly specified with the `--layout`
option. However, in order to use `unpaper`'s post-processing features
for more than simply joining two *image-files* to one, you will most
likely want to combine the use of `--input-pages 2` with the `--layout
double` option, as in:

    unpaper --layout double --input-pages 2 (...other options...) singlepage%03d.pgm output%03d.pgm

![Sequence of multiple input images](img/multiple-input-files.png)

Similarly, it is also possible to split up a *sheet* into several
*image-files* when saving. The following line would be used to split
up a sequence of double-page layouted *sheets* into a sequence of
single-page output images, including full image processing (applying
*masking*, *deskewing*, *border-aligning* etc., see below) in order to
make sure that the *pages* in the double-page *layout* are really
placed fully on the left-hand half and the right-hand half of the
*sheet* before the *sheet* gets split up:

    unpaper --layout double (...options...) --output-pages 2 doublepage%03d.pgm singlepage%03d.pgm

![Sequence of multiple output images](img/multiple-output-files.png)

By default, processing of multiple *sheets* starts with *sheet* number
1, and also with input and output *image-files* number 1. `unpaper`
will run as long as input *image-files* with the current index number
can be found. If no more input files are available, processing stops.

### Adjusting Indices

In order to start with a different *sheet* index, the `--start-sheet`
option can be set. Likewise, setting `--end-sheet` specifies a fix
*sheet* number that will the last one processed, even if more
input-files are available.

Using `--sheet`, a single *sheet* or a set of specific *sheet* numbers
to be processed can be specified. For example:

    unpaper --sheet 7,12-15,31 --input-pages 2 (...options...) input%03d.pgm output%03d.pgm

This would generate the output-files `output007.pgm`, `output012.pgm`,
`output013.pgm`, `output014.pgm`, `output015.pgm` and `output031.pgm`,
reading input from the same files as if a whole sequence of *sheets*
and *pages* starting with index 1 had been processed, i.e. reading the
files `input013.pgm` and `input014.pgm` for *sheet* 7, `input023.pgm`
and `input024.pgm` for *sheet* 12, and so on.

To prevent some *sheets* from being processed (i.e., remove them from
the sequence), the option `--exclude` can be used. Note that this is
different from option `--no-processing` or `-n`, which still would
generate the output files but without applying any image processing to
them.

The input and output index numbers to start with can be adjusted using
the options `--start-input` and `--start-output`.  These values apply
to the wildcard replacement in filenames only and are independent from
the *sheet* numbering. In other words, setting these options specifies
an offset at which the file numbering starts relative to
*sheet* 1. For example:

    unpaper --input-pages 2 (...options...) --start-input 7 input%03d.pgm output%03d.pgm

These settings would cause the input-files `input007.pgm` and
`input008.pgm` to be used for *sheet* 1, `input009.pgm` and
`input010.pgm` for *sheet* 2, and so on. The default value for both
options is 1.

### File-Sequence Patterns

More sophisticated **file-sequence patterns** can be specified using
the `--input-file-sequence` or `--output-file-sequence` options. In
cases where the input files are named after a pattern like
e.g. `left01.pbm`, `right01.pbm`, `left02.pbm`, `right02.pbm` etc.,
the use of `--input-pages 2` together with `--input-file-sequence
left%02d.pbm right%02d.pbm` will load to the desired images. The index
counter with which the wildcards in the filenames get replaced is
increased every time the *file-sequence pattern* is iterated through,
it will not be increased after each single replacement of a wildcard.

Note that it would also be possible to use *file-sequence patterns* of
different lengths than the number of *pages* per *sheet*.  In case an
input *file-sequence* like e.g. `a%d.pbm b%d.pbm c%d.pbm` is specified
together with `--input-pages 2`, the input *image-files* used for the
first *sheet* would be `a1.pbm` and `b1.pbm`, the input *image-files*
used for the second *sheet* would be `c1.pbm` and `a2.pbm` (!), for
the third *sheet* they would be `b2.pbm` and `c2.pbm`, and so on. It's
up to the user whether it makes sense to use *file-sequence patterns*
of different length than the corresponding number of input
*image-files* or output *image-files* per *sheet*.

Specifying a filename as the very last argument on the command-line is
equivalent to using `--output-file-sequence <file>` (a sequence of
length 1), specifying a filename as the last-but-one argument on the
command line is equivalent to using `--input-file-sequence <file>`.

### Inserting Blank Content

Input *file-sequences* may be forced to use completely blank images at
some index positions. The `--insert-blank` option allows to specify
one or more input indices at which no file is read, but instead a
blank image is inserted into the sequence of input images. The input
image that would have been loaded at this index position in the
sequence will be used at the following non-blank index position
instead, thus the following indices get shifted to make room for the
blank image inserted.

The `--replace-blank` option also allows to insert blank images into
the sequence, but it suppresses the images that would have been loaded
at the specified index positions and ignores them. No index positions
get shifted to make room for the blank image.

## Masks

**Masks** are rectangular areas on a *sheet* that are affected by
several of the processing steps `unpaper` performs.  Although there
may be as many *masks* on a *sheet* as desired, in most cases it will
be useful to operate with either one or two *masks* per *sheet*
only. A single-page *layout* would operate on only one *mask* covering
the whole *page*, a double-page *layout* would make use of two
*masks*, one placed somewhere in the left-hand half of a *sheet*, the
other somewhere in the right-hand half.

### Automatic Mask-Detection

*Masks* can be set directly by specifying pixel coordinates using the
`--mask` option, but in most cases it is desirable to detect *masks*
automatically. Automatic **mask-detection** allows input images to
contain content which is not perfectly placed at fix areas, but
probably differs slightly in position from *sheet* to *sheet* (which
is usually the case when books are scanned or photocopied manually).

Automatic *mask-detection* uses a starting point somewhere on the
*sheet* called **mask-scan-point**, which marks a position estimated
to be somewhere inside the *mask* to be detected. (When detecting
*masks* that cover a whole *page*, it is useful to place the
*mask-scan-point* right in the center of the *sheet*'s half on which
the *page* appears.) Beginning from the *mask-scan-point*, the image
content is virtually scanned in either the two horizontal directions
(left and right), or the two vertical directions (up and down), or all
four directions, until no more dark pixels are found which means an
edge of the *mask* is considered to have been found.

![Mask-Detection](img/mask-scan.png)

Several parameters control the process of *mask-detection*. At first,
*mask-scan-points* to start detection at get specified either using
the `--layout` option (which automatically sets one *mask-scan-point*
for single-page *layouts*, and two *mask-scan-points* for double-page
*layouts*) or manually with the option `--mask-scan-point`.

*Mask-detection* is performed by the use of a 'virtual bar' which
covers an area of the *sheet* under which the number of dark pixels is
counted.  The 'virtual bar' is moved towards the directions specified
by `--mask-scan-direction`.  (Those directions not given via
`--mask-scan-direction` will use up the whole *sheet*'s size in these
directions for the detected result.)

While moving the 'virtual bar' the number of dark pixels below it is
continually compared to the number that has been counted at the very
first position of the 'virtual bar' above the *mask-scan-point* when
detection started. Once the number of dark pixels drops below the
relative value given by `--mask-scan-threshold`, *mask-detection*
stops and an edge of the *mask* is considered to have been found.

!['Virtual-Bar' for Mask-Detection](img/mask-scan-detail1.png)

The width of the 'virtual bar' can be configured using the
`--mask-scan-size` option, the length of it by setting
`--mask-scan-depth`.  Adjusting the 'virtual bar's' width can help to
fine-tune the process of mask detection according to the content that
is being scanned. The wider the 'virtual bar' is, the more tolerant
the detection process becomes with respect to small gaps in the
content (which is e.g. needed if a *page* is made up of multiple
columns). However, if the 'virtual bar' is too wide, detection might
not stop properly when a mask's edge should have been found.

![Mask-Scan-Threshold](img/mask-scan-detail2.png)

*Mask-detection* can be disabled using the `--no-mask-scan` option,
optionally followed by the *sheet* numbers to disable the filter for.

### Mask-Centering

*Masks* that have been automatically detected or manually set will be
used for several further processing steps. At first they provide the
basis for properly centering the content on the corresponding *page*
area on the *sheet*.

This allows `unpaper` to automatically correct imprecise positions of
*page* content in scanned *sheets* and shift the content to a
normalized position. Especially when processing multiple *pages*, this
leads to more regular positions of *pages* in the sequence of
resulting *sheets*.

![Mask-Centering](img/mask-center.png)

*Mask-centering* can be suppressed using `--no-mask-center`,
optionally followed by the *sheet* numbers to disable the filter for.

## Borders

Unlike *masks*, **borders** are detected by starting at the outer
edges of the *sheet* (or left/right halves of the *sheet*, in a
double-page *layout*), and then scanning towards the middle until some
content-pixels are reached. Again, a 'virtual bar' is used for
detection, the width of which can be set using the option
`--border-scan-size`, and the step-distance with which to move it by
setting the option `--border-scan-step`. The option
`--border-scan-threshold` determines the maximum absolute number of
pixels which are tolerated to be found below the 'virtual bar' until
*border-detection* stops and one edge of the *border* area is
considered to have been found.

![Border-Detection](img/border-scan.png)

### Border-Aligning

*Borders* serve two different purposes: First, the area outside the
detected *border* on the *sheet* will be wiped out, which is another
mechanism to clean the outer *sheet* boundary from unwanted pixels.

Second, a detected *border* can optionally be aligned towards one edge
of the *sheet*. **Border-aligning** means shifting the area inside the
*border* towards one edge of the *sheet*.  The edge towards which to
shift the border is specified with the option `--border-align`.
Additionally, a fixed distance from the edge is kept, which can be set
via `--border-margin`.

This way, it can be assured that e.g. all *pages* of a scanned book
regularly start 2 cm below the upper *sheet* edge.

![Border-Aligning](img/border-align.png)

Note that *border-aligning* is not performed by default, it needs to
be explicitly activated by setting the option `--border-align` to one
of the edge names `top`, `bottom`, `left` or `right`, and by setting
`--border-margin` to the desired distance which is to be kept to this
edge.

Use `--no-border-scan `to disable *border-detection*, or
`--no-border-align` to prevent *border-aligning* on specific *sheets*,
both optionally followed by the *sheet* numbers to disable the filters
for.

## Size Values

Whenever an option expects a **size** value, there are three possible
ways to specify that:

 * as absolute pixel values, e.g. `--sheet-size 4000,3000`
 * as length measurements on one of the scales `cm`, `mm`, `in`,
    e.g. `--size 30cm,20cm` or also `--size 10in,250mm`
 * using one of the following *size* names:
   - `a5`
   - `a4`
   - `a3`
   - `letter`
   - `legal`
   - `a5-landscape` (horizontally oriented A5)
   - `a4-landscape` (horizontally oriented A4)
   - `a3-landscape` (horizontally oriented A3)
   - `letter-landscape` (horizontally oriented letter)
   - `legal-landscape` (horizontally oriented legal)
   Examples: `--sheet-size a4`, `--post-zoom letter``-landscape`

Using one of the last two ways, length measurements get internally
converted to absolute pixel values based on the resolution set via the
option `--dpi`. If the default of 300 DPI should be changed, this
option must appear on the command line before using a length
measurement value. `--dpi` may also appear multiple times, e.g. if the
*size* values of the output image(s) should be based on a different
resolution than those of the input file(s).

Note that using the `--dpi` option will have no effect on the
resolution of the *image-files* that get written as output. (The PNM
format is not capable of storing information about the image
resolution.) The value set via `--dpi` will only have effect on
`unpaper`'s internal conversion of length measurements to absolute
pixel values when *size* values are specified using length
measurements or *size* names.