File: manual.txt

package info (click to toggle)
pdfsandwich 0.1.7-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, sid
  • size: 188 kB
  • sloc: ml: 946; sh: 318; makefile: 142; perl: 110
file content (83 lines) | stat: -rw-r--r-- 5,856 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
NAME
  pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf files
SYNOPSIS
  pdfsandwich [options] inputfile.pdf
DESCRIPTION
  pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.
  Note that pdfsandwich needs the following programs: unpaper, convert, gs, hocr2pdf (for tesseract < 3.03), and tesseract.
  As tesseract >= 3.03 can write pdf files, hocr2pdf is only needed for older versions of tesseract.
  Please visit http://www.tobias-elze.de/pdfsandwich.
OPTIONS
  -convert 	 -convert filename : name of convert binary (default: convert)
  -coo 		 -coo options : additional convert options; make sure to quote;
		  e.g. -coo "-normalize -black-threshold 75%"
		  call convert --help or man convert for all convert options
  -debug 	 keep all temporary files in /tmp (for debugging)
  -enforcehocr2pdf 	 use hocr2pdf even if tesseract >= 3.03
  -first_page 	 -first_page number : number of page to start OCR from (default: 1)
  -gray           use grayscale for images (default: black and white)
  -grayfilter 	 enable unpaper's gray filter; further options can be set by -unpo
  -gs 		 -gs filename : name of gs binary (default: gs); optional, only required for resizing
  -hocr2pdf 	 -hocr2pdf filename : name of hocr2pdf binary (default: hocr2pdf);
		  ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set
  -hoo 		 -hoo options : additional hocr2pdf options; make sure to quote
  -identify 	 -identify filename : name of identify binary (default: identify)
  -last_page 	 -last_page number : number of page up to which to process OCR (default: number of pages in inputfile)
  -lang 	 -lang language : language of the text; option to tesseract (default: eng)
		  e.g: eng, deu, deu-frak, fra, rus, swe, spa, ita, ...
		  see option -list_langs;
		  Multiple languages may be specified, separated by plus characters.
  -layout 	 -layout { single | double | none } : layout of the scanned pages; requires unpaper
		  single: one page per sheet
		  double: two pages per sheet
		  none: no auto-layout (default)
  -list_langs 	 list currently available languages and exit;
		  in case of custom binaries of tesseract, place this after the -tesseract option
  -maxpixels 	 -maxpixels NUM : maximal number of pixels allowed for input file
		  if (resolution/72)^2 *width*height > maxpixels then scale page of input file down
		  prior to OCR so that page size in pixels corresponds to maxpixels;
		  default: 17415167 (A3 @ 300 dpi)
  -noimage 	 do not place the image over the text (requires hocr2pdf; ignored without -enforcehocr2pdf option)
  -nopreproc 	 do not preprocess with unpaper
  -nthreads 	 -nthreads number : number of parallel threads (default: guessed number of CPUs; if guessing fails: 1)
  -o 		 -o filename : output file; default: inputfile_ocr.pdf (if extension is different
		  from .pdf, original extension is kept)
  -omp_thread_limit      -omp_thread_limit number : number of threads tesseract may use for each page (default: 1)
  -pagesize 	 -pagesize { original | NUMxNUM } : set page size of output pdf (requires ghostscript)
		  original: same as input file (default)
		  NUMxNUM: width x height in pixel (e.g. for A4: -pagesize 595x842)
  -pdfinfo 	 -pdfinfo filename : name of pdfinfo binary (default: pdfinfo)
  -pdfunite 	 -pdfunite filename : name of pdfunite binary (default: pdfunite)
  -resolution 	 -resolution NUM : resolution (dpi) used for OCR (default: 300)
  -rgb 		 use RGB color space for images (default: black and white);
		  use with care: causes problems with some color spaces
  -sloppy_text 	 sloppily place text, group words, do not draw single glyphs;
		  ignored for tesseract >= 3.03 unless option -enforcehocr2pdf is set
  -tesseract 	 -tesseract filename : name of tesseract binary (default: tesseract)
  -tesso 	 -tesso options : additional tesseract options; make sure to quote
  -unpaper 	 -unpaper filename : name of unpaper binary (default: unpaper)
  -unpo 	 -unpo options : additional unpaper options; make sure to quote
  -quiet 	 suppress output
  -verbose 	 produce more output
  -version 	 print version and quit
  -help  Display this list of options
  --help  Display this list of options

LANGUAGES
  Via Tesseract, numerous language packagess available - follow this link http://code.google.com/p/tesseract-ocr/downloads/list for a complete list. Here is an incomplete selection of supported languages and their abbreviations:

  ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces (Czech), chi_sim (Simplified Chinese),
  chi_tra (Traditional Chinese), chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), deu (German), ell
  (Greek), eng (English), enm (Old English), epo (Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old
  French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun (Hungarian), ind (Indonesian), ita
  (Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch), nor (Norwegian), pol
  (Polish), por (Portuguese), ron (Romanian), rus (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa
  (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl (Tagalog), tha (Thai), tur (Turkish), ukr
  (Ukrainian), vie (Vietnamese)
  
  Multiple languages may be specified, separated by plus characters. Note that the respective tesseract language package needs to be installed on your system to be usable by pdfsandwich. Option -list_langs lists the languages which are available on your system.
AVAILABILITY
  Sources and packages as well as comprehensive help can be found at http://www.tobias-elze.de/pdfsandwich.
AUTHOR
  Tobias Elze <sourceforge@tobias-elze.de>