File: DataConversion.ref.txt

package info (click to toggle)
debian-reference 2.24
  • links: PTS
  • area: main
  • in suites: lenny
  • size: 20,088 kB
  • ctags: 35
  • sloc: xml: 70,510; sh: 616; makefile: 352; perl: 221; sed: 3
file content (593 lines) | stat: -rw-r--r-- 41,985 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
= Do not use Edit(GUI) button. =

[[TableOfContents(4)]]

Copyright 2007, 2008  Osamu Aoki GPL, (Please agree to GPL, GPL2, and any version of GPL which is compatible with DSFG if you update any part of wiki page)

Generated HTML is at "[http://people.debian.org/~osamu/pub/getwiki/html/ch12.en.html Debian Reference: Chapter 12. Data conversion]".

I welcome your contributions to update this wiki page. You must follow these rules:
 * Do not use Edit(GUI) button of MoinMoin.
 * You can update anytime for:
  * grammar errors
  * spelling errors
  * moved URL location
  * package name transition adjustment (emacs23 etc.)
  * clearly broken script.
 * Before updating this wiki content:
  * Read "[http://wiki.debian.org/DebianReference/Test Guide for contributing to Debian Reference]".

= Data conversion =

Standard based tools are in very good shape but support for proprietary data formats are limited.  

== Text data conversion tools ==

Following packages for the text data conversion caught my eyes:

|| List of text data conversion tools. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{libc6}}} || 37751 || - || charset || The text encoding conversion between locales with {{{iconv}}} command. (fundamental) ||
|| {{{recode}}} || 1039 || - || charset+eol || The text encoding conversion between locales. (versatile, more aliases and features) ||
|| {{{konwert}}}|| 250 || - || charset || The text encoding conversion between locales. (fancy) ||
|| {{{nkf}}} || 235 || - || charset || The character set translator for Japanese. ||
|| {{{tcs}}} || 27 || - || charset || The character set translator. ||
|| {{{unaccent}}} || 20 || - || charset || Replace accented letters by their unaccented equivalent. ||
|| {{{tofrodos}}} || 851 || - || eol || The text format converter between DOS and Unix: {{{fromdos}}} and {{{todos}}} ||
|| {{{macutils}}} || 136 || - || eol || The text format converter between Macintosh and Unix: {{{frommac}}} and {{{tomac}}} ||

=== To convert a text file with iconv ===

The {{{iconv}}} command is provided as a part of the {{{libc6}}} package and it is always available on all system to convert the encoding of characters:

{{{
$ iconv -f encoding1 -t encoding2 input.txt >output.txt
}}}

Encoding values are case insensitive and ignore "{{{-}}}" and "{{{_}}}" for matching.  The supported encodings can be checked by the "{{{iconv -l}}}" command.

|| List of encoding values and their usage. || ||
|| '''encoding value''' || '''usage''' ||
|| [http://en.wikipedia.org/wiki/ASCII ASCII]. || [http://en.wikipedia.org/wiki/ASCII American Standard Code for Information Interchange]. 7 bit code w/o accented characters. ||
|| [http://en.wikipedia.org/wiki/UTF-8 UTF-8] || Standard multilingual compatibility] for all modern OSs. ||
|| [http://en.wikipedia.org/wiki/ISO/IEC_8859-1 ISO-8859-1] || Old standard for western European languages, ASCII + accented characters. ||
|| [http://en.wikipedia.org/wiki/ISO/IEC_8859-2 ISO-8859-2] || Old standard for eastern European languages, ASCII + accented characters. ||
|| [http://en.wikipedia.org/wiki/ISO/IEC_8859-15 ISO-8859-15] || Old standard for western European languages, [http://en.wikipedia.org/wiki/ISO/IEC_8859-1 ISO-8859-1] with euro sign. ||
|| [http://en.wikipedia.org/wiki/Code_page_850 CP850] || Code page 850, Microsoft DOS characters with graphics for western European languages. [http://en.wikipedia.org/wiki/ISO/IEC_8859-1 ISO-8859-1] variant. ||
|| [http://en.wikipedia.org/wiki/Code_page_932 CP932] || Code page 932, Microsoft Windows style  [http://en.wikipedia.org/wiki/Shift-jis Shift-JIS] variant, for Japanese. ||
|| [http://en.wikipedia.org/wiki/Code_page_936 CP936] || Code page 936, Microsoft Windows style [http://en.wikipedia.org/wiki/GB2312 GB2312], [http://en.wikipedia.org/wiki/GBK GBK], or [http://en.wikipedia.org/wiki/GB18030 GB18030] variant, for Simplified Chinese. ||
|| [http://en.wikipedia.org/wiki/Code_page_949 CP949] || Code page 949, Microsoft Windows style [http://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-KR EUC-KR] or Unified Hangul Code variant, for Korean. ||
|| [http://en.wikipedia.org/wiki/Code_page_950 CP950] || Code page 950, Microsoft Windows style [http://en.wikipedia.org/wiki/Big5 Big5] variant, for Traditional Chinese. ||
|| [http://en.wikipedia.org/wiki/Windows-1251 CP1251] || Code page 1251, Microsoft Windows style encoding for the Cyrillic alphabet. ||
|| [http://en.wikipedia.org/wiki/Windows-1252 CP1252] || Code page 1252, Microsoft Windows style [http://en.wikipedia.org/wiki/ISO/IEC_8859-15 ISO-8859-15] variant for western European languages. ||
|| [http://en.wikipedia.org/wiki/KOI8-R KOI8-R] || Old Russian UNIX standard for the Cyrillic alphabet. ||
|| [http://en.wikipedia.org/wiki/ISO/IEC_2022 ISO-2022-JP] || Standard encoding for Japanese e-mail which uses only 7 bit codes. ||
|| [http://en.wikipedia.org/wiki/EUC eucJP] || Old Japanese UNIX standard 8 bit code and completely different from [http://en.wikipedia.org/wiki/Shift-jis Shift-JIS]. ||
|| [http://en.wikipedia.org/wiki/Shift-jis Shift-JIS] || JIS X 0208 Appendix 1 standard, for Japanese. See [http://en.wikipedia.org/wiki/Code_page_932 CP932] above. ||

(!) Some encodings are only supported for the data conversion and are not used as locale values (@{@basicsofencoding@}@).

For character sets which fit in single byte such as [http://en.wikipedia.org/wiki/ASCII ASCII] and [http://en.wikipedia.org/wiki/ISO/IEC_8859 ISO-8859] character sets, the [http://en.wikipedia.org/wiki/Character_encoding character encoding] means almost the same thing as the character set.

For character sets with many characters such as [http://en.wikipedia.org/wiki/JIS_X_0213 JIS X 0213] for Japanese or [http://en.wikipedia.org/wiki/Universal_Character_Set Universal Character Set (UCS, Unicode, ISO-10646-1)] for practically all languages, there are many encoding schemes to fit them into the sequence of the byte data:
 * [http://en.wikipedia.org/wiki/EUC EUC] and [http://en.wikipedia.org/wiki/ISO/IEC_2022 ISO/IEC 2022 (also known as JIS X 0202)] for Japanese, or 
 * [http://en.wikipedia.org/wiki/UTF-8 UTF-8] and [http://en.wikipedia.org/wiki/UTF-32/UCS-4 UTF-32/UCS-4] for Unicode.  
For these, there is clear differentiation between the character set and the character encoding.

The [http://en.wikipedia.org/wiki/Code_page code page] is used as the synonym to the character encoding tables for some vendor specific ones.

(!) Please note most encoding systems share the same code with ASCII for the 7 bit characters.  But there are some exceptions. If you are converting old Japanese C programs and URLs data from the casually-called shift-JIS encoding format to UTF-8 format, use "{{{CP932}}}" as the encoding name instead of "{{{shift-JIS}}}" to get the expected results: 0x5C -> "{{{\}}}" and 0x7E -> "{{{~}}}" .  Otherwise, these are converted to wrong characters.

{i} The {{{recode}}} command may be used too and offers more than the combined functionality of the {{{iconv}}}, {{{fromdos}}}, {{{todos}}}, {{{frommac}}}, and {{{tomac}}} commands.  For more, see pertinent description in the "{{{info recode}}}".

=== To convert file names with iconv ===

Here is an example script to convert encoding of the file name from ones created under older OS to modern UTF-8 ones for the simple case.
{{{
#!/bin/sh
ENCDN=iso-8859-1
for x in *;
 do
 mv "$x" $(echo "$x" | iconv -f $ENCDN -t utf-8)
done
}}}

The "{{{$ENCDN}}}" variable should be set by the encoding values from @{@listofencodingvauesandtheirusage@}@ .

For more complicated case, please mount disk drive containing such file names with proper encoding as the {{{mount}}}(8) option (see @{@filenameencoding@}@) and copy entire disk to another disk drive mounted as UTF-8 with "{{{cp -a}}}" command.

=== EOL conversion ===

The text file format, specifically the end-of-line (EOL) code, is dependent on the platform:

|| List of EOL conversion tools. || || || ||
|| '''platform''' || '''EOL code''' || '''EOL control sequence''' || '''EOL ASCII value''' ||
|| Debian (unix) || LF || {{{^J}}} || 10 ||
|| MSDOS and Windows || CR-LF || {{{^M^J}}} || 13, 10 ||
|| Apple's Macintosh || CR || {{{^M}}} || 13 ||

The EOL format conversion programs, {{{fromdos}}}(1), {{{todos}}}(1), {{{frommac}}}(1), and {{{tomac}}}(1), are quite handy.  The {{{recode}}} command is also useful.

{i} The use of "{{{sed -e '/\r$/!s/$/\r/'}}}" instead of "{{{todos}}}" is better when you want to unify the EOL style to the MSDOS style from the mixed MSDOS and Unix style.  (e.g., after merging 2 MSDOS style files with {{{diff3}}}.)  This is because "{{{todos}}}" adds CR to all lines.

(!) Some data on the Debian system, such as the wiki page data for the {{{python-moinmoin}}}, use MSDOS style CR-LF as the EOL code.  So the above rule is just general rule.

(!) Most editors (eg. {{{vim}}}, {{{emacs}}}, {{{gedit}}}, ...) can handle files in MSDOS style EOL transparently.

=== TAB conversion ===

You can expand the tab code in the text to the multiple spaces in {{{vim}}} using the "{{{:retab}}}" command.

There are few popular specialized programs to convert the tab codes:

|| List of TAB conversion commands from {{{bsdmainutils}}} and {{{coreutils}}} packages. || || ||
|| '''function''' || '''{{{bsdmainutils}}}''' || '''{{{coreutils}}}''' ||
|| expand tab to spaces || "{{{col -x}}}" || {{{expand}}} ||
|| unexpand tab from spaces || "{{{col -h}}}" || {{{unexpand}}} ||

The {{{indent}}}(1) from the {{{indent}}} package completely reformats whitespaces in the C program.

=== Editors with auto-conversion ===

Intelligent modern editors such as the {{{vim}}} program are quite smart and copes well with any encoding systems and any file formats.  You should use these editors under the UTF-8 locale in the UTF-8 capable console for the best compatibility.

An old western European Unix text file, "{{{u-file.txt}}}", stored in the latin1 encoding can be edited simply with {{{vim}}} as:
{{{
$ vim u-file.txt
}}}
This is possible since the auto detection mechanism of the file encoding in {{{vim}}} assumes the UTF-8 encoding first and, if it fails, assumes it to be latin1.

An old Polish Unix text file, "{{{pu-file.txt}}}", stored in the latin2 encoding can be edited with {{{vim}}} as:
{{{
$ vim '+e ++enc=latin2 pu-file.txt'
}}}

An old Japanese unix text file, "{{{ju-file.txt}}}", stored in the eucJP encoding can be edited with {{{vim}}} as:
{{{
$ vim '+e ++enc=eucJP ju-file.txt'
}}}

An old Japanese MS-Windows text file, "{{{jw-file.txt}}}", stored in the so called shift-JIS encoding (more precisely: CP932) can be edited with {{{vim}}} as:
{{{
$ vim '+e ++enc=CP932 ++ff=dos jw-file.txt'
}}}

When a file is opened with "{{{++enc}}}" and "{{{++ff}}}" options, the "{{{w}}}" in the Vim command line  stores it in the original format and overwrite the original file.  You can also specify the saving format and the file name in the Vim command line, e.g., "{{{w ++enc=utf8 new.txt}}}".

Please refer to the mbyte.txt "multi-byte text support" in {{{vim}}} on-line help.

The {{{emacs}}} family of programs can perform the equivalent functions.

### I do not know easy description for EMACS.  Please update this for EMACS.

=== Plain text extraction ===

Following will read a web page into a text file.  This is very useful when copying configurations off the Web or applying basic Unix text tools such as {{{grep}}} on the web page.

{{{
$ lynx -dump http://www.remote-site.com/help-info.html >textfile
}}}

Similarly, you can extract plain text data from other formats using followings:

|| List of tools to extract plain text data. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{html2text}}} || 4926 || - || html->text || An advanced HTML to text converter. (Better than "{{{lynx -dump}}}") ||
|| {{{w3m}}} || 6313 || - || html->text || An HTML to text converter with the "{{{w3m -dump}}}" command. ||
|| {{{lynx}}} || 4662 || - || html->text || An HTML to text converter with the "{{{lynx -dump}}}" command. ||
|| {{{elinks}}} || 1343 || - || html->text || An HTML to text converter with the "{{{elinks -dump}}}" command. ||
|| {{{links}}} || 1148 || - || html->text || An HTML to text converter with the "{{{links -dump}}}" command. ||
|| {{{links2}}} || 598 || - || html->text || An HTML to text converter with the "{{{links2 -dump}}}" command. ||
|| {{{antiword}}} || 477 || - || MSWord->text,ps || This converts !MSWord files to plain text or ps. ||
|| {{{catdoc}}} || 333 || - || MSWord->text,TeX || This converts !MSWord files to plain text or TeX. ||
|| {{{pstotext}}} || 199 || - || ps/pdf->text || Extract text from !PostScript and PDF files. ||
|| {{{unhtml}}} || 39 || - || html->text || Remove the markup tags from an HTML file. ||
|| {{{odt2txt}}} || 33 || - || odt->text || The converter from !OpenDocument Text to text. ||
|| {{{wpd2sxw}}} || 33 || - || WordPerfect->sxw || WordPerfect to OpenOffice.org/!StarOffice writer document converter. ||

=== Highlighting and formatting plain text data ===

|| List of tools to highlight plain text data. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{vim-runtime}}} || 1849 || - || highlight || Vim can convert source code to HTML with {{{:source $VIMRUNTIME/syntax/html.vim}}} (vim MACRO) ||
|| {{{cxref}}} || 163 || - || c->html || The converter for the C program to latex and HTML. (C) ||
|| {{{src2tex}}} || 66 || - || highlight || This convert many source codes to TeX. (C) ||
|| {{{source-highlight}}} || 56 || - || highlight || This convert many source codes to HTML, XHTML, LaTeX, Texinfo, ANSI color escape sequences and !DocBook files with highlight. (C++) ||
|| {{{highlight}}} || 47 || - || highlight || This convert many source codes to HTML, XHTML, RTF, LaTeX, TeX or XSL-FO files with highlight. (C++) ||
|| {{{grc}}} || 30 || - || text->color || The generic colouriser for everything. (Python) ||
|| {{{txt2html}}} || 88 || - || text->html || Text to HTML converter. (Perl) ||
|| {{{markdown}}} || 74 || - || text->html || The converter from text to (X)HTML. (Perl) ||
|| {{{asciidoc}}} || 67 || - || text->any || A text document formatter to XML. (Python) ||
|| {{{txt2tags}}} || 60 || - || text->any || The document conversion from text to HTML, SGML, LaTeX, man page, !MoinMoin, Magic Point and PageMaker. (Python) ||
|| {{{udo}}} || 17 || - || text->any || universal document - text processing utility. (C) ||
|| {{{stx2any}}} || 16 || - || text->any || The document converter from structured plain text to other formats. (m4) ||
|| {{{rest2web}}} || 16 || - || text->html || The document converter from !ReStructured Text to html. (Python) ||
|| {{{aft}}} || 16 || - || text->any || The "free form" document preparation system. (Perl) ||
|| {{{yodl}}} || 16 || - || text->any || A pre-document language and tools to process it. (C) ||
|| {{{sdf}}} || 12 || - || text->any || The simple document parser. (Perl) ||
|| {{{sisu}}} || 11 || - || text->any || The document structuring, publishing and search framework. (Ruby) ||

== XML data ==

[http://en.wikipedia.org/wiki/XML The Extensible Markup Language (XML)] is a markup language for documents containing structured information.  

[http://xml.com/ XML.COM] has good introductory information:
 * [http://www.xml.com/pub/a/98/10/guide0.html "What is XML?"]
 * [http://xml.com/pub/a/2000/08/holman/index.html "What Is XSLT?"]
 * [http://xml.com/pub/a/2002/03/20/xsl-fo.html "What Is XSL-FO?"]
 * [http://xml.com/pub/a/2000/09/xlink/index.html "What Is XLink?"]

=== Basic hints for XML ===

XML text looks somewhat like HTML.  It enables us to manage multiple formats of output for a document.  One easy XML system is {{{docbook-xsl}}}, which is used here.

Each XML file starts with standard XML declaration:
{{{
<?xml version="1.0" encoding="UTF-8"?>
}}}
The basic syntax for one XML element is marked up as:
{{{
<name attribute="value">content</name>
}}}
XML element with empty content is marked up in the short form as:
{{{
<name attribute="value"/>
}}}
The "{{{attribute="value"}}}" in the above examples are optional.

The comment section in XML is marked up as:
{{{
<!-- comment -->
}}}

Other than adding markups, XML requires minor conversion to the content using predefined entities for the following character:

|| List of predefined entities for XML. || ||
|| '''predefined entity''' || '''character to be converted from''' ||
|| {{{&quot;}}} || {{{"}}} : quote ||
|| {{{&apos;}}} || {{{'}}} : apostrophe ||
|| {{{&lt;}}}   || {{{<}}} : less-than ||
|| {{{&gt;}}}   || {{{>}}} : greater-than ||
|| {{{&amp;}}}  || {{{&}}} : ampersand ||

<!> "{{{<}}}" or "{{{&}}}" can not be used in attributes or elements.

(!) When SGML style user defined entities, e.g. "{{{&some-tag:}}}", are used, the first definition wins over others.  The entity definition is expressed in "{{{<!ENTITY some-tag "entity value">}}}".

(!) As long as the XML markup are done consistently with certain set of the tag name (either some data as content or attribute value), conversion to another XML is trivial task using XSLT.

=== XML processing ===

There are many tools available to process XML files such as [http://en.wikipedia.org/wiki/Extensible_Stylesheet_Language the Extensible Stylesheet Language (XSL)].

Basically, once you create well formed XML file, you can convert it to any format using Extensible Stylesheet Language for Transformation (XSLT). 

Although the Extensible Stylesheet Language for Formatting Object (XSL-FO) is supposed to be solution for formatting, FOP program is not in the Debian main (yet?). So the LaTeX code is usually generated from XML using XSLT and the LaTeX system is used to create printable file such as DVI, PostScript, and PDF.

|| List of XML tools. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{docbook-xml}}} || 17472 || - || xml || This package contains the XML document type definition (DTD) for !DocBook. ||
|| {{{xsltproc}}} || 3804 || - || xslt || XSLT command line processor. (XML-> XML, HTML, plain text, etc.) ||
|| {{{docbook-xsl}}} || 422 || - || xml/xslt || This contains XSL stylesheets for processing !DocBook XML to various output formats with XSLT. ||
|| {{{xmlto}}} || 245 || - || xml/xslt || XML-to-any converter with XSLT. ||
|| {{{dblatex}}} || 29 || - || xml/xslt || This converts Docbook files to DVI, !PostScript, PDF documents with XSLT. ||

Since XML is subset of [http://en.wikipedia.org/wiki/SGML Standard Generalized Markup Language (SGML)], it can be processed by the extensive tools available for SGML, such as [http://en.wikipedia.org/wiki/Document_Style_Semantics_and_Specification_Language Document Style Semantics and Specification Language (DSSSL)]. 

|| List of DSSL tools. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{openjade}}} || 585 || - || dsssl || Implementation of the DSSSL language based on James Clark's Jade software. ||
|| {{{jade}}} || 531 || - || dsssl || James lark's DSSSL language. ||
|| {{{docbook-dsssl}}} || 821 || - || xml/dsssl || This contains DSSSL stylesheets for processing !DocBook XML to various output formats with DSSSL. ||
|| {{{docbook-utils}}} || 275 || - || xml/dsssl || The utilities for Docbook files including conversion to other formats (HTML, RTF, PS, man, PDF) with {{{docbook2*}}} commands with DSSSL. ||
|| {{{sgml2x}}} || 23 || - || SGML/dsssl || The converter from SGML and XML using DSSSL stylesheets. ||

=== The XML data extraction ===

You can extract HTML or XML data from other formats using followings:

|| List of XML data extraction tools. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{wv}}} || 589 || - || MSWord->any || The document converter from Microsoft Word to HTML, LaTeX, etc.. ||
|| {{{texi2html}}} || 555 || - || texi->html || The converter from Texinfo to HTML. ||
|| {{{man2html}}} || 375 || - || manpage->html || The converter from manpage to HTML. (CGI support) ||
|| {{{tex4ht}}} || 217 || - || tex<->html || The converter between (La)TeX and HTML. ||
|| {{{xlhtml}}} || 202 || - || MSExcel->html || The converter from !MSExcel .xls to HTML. ||
|| {{{ppthtml}}} || 182 || - || MSPowerPoint->html || The converter from !MSPowerPoint to HTML. ||
|| {{{unrtf}}} || 167 || - || rtf->html || The document converter from RTF to HTML, etc.. ||
|| {{{info2www}}} || 127 || - || info->html || The converter from GNU info to HTML. (CGI support) ||
|| {{{ooo2dbk}}} || 35 || - || sxw->xml ||  The converter from OpenOffice.org SXW documents to !DocBook XML. ||
|| {{{wp2x}}} || 19 || - || WordPerfect->any || !WordPerfect 5.0 and 5.1 files to TeX, LaTeX, troff, GML and HTML. ||
|| {{{doclifter}}} || 13 || - || troff->xml || The converter from troff to !DocBook XML. ||

For non-XML HTML files, you can convert them to XHTML which is an instance of well formed XML and can be processed by XML tools.

|| List of XML pretty print tools. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{libxml2-utils}}} || 3673 || - || xml<->html<->xhtml || The command line XML tool with "{{{xmllint}}}" command. (syntax check, reformat, lint, ...) ||
|| {{{tidy}}} || 1962 || - || xml<->html<->xhtml ||  HTML syntax checker and reformatter. ||

Once proper XML is generated, you can use XSLT technology to extract data based on the mark-up context etc.

== Printable data ==

Printable data is expressed in the [http://en.wikipedia.org/wiki/PostScript PostScript] format on the Debian system.   The [http://en.wikipedia.org/wiki/Common_Unix_Printing_System Common Unix Printing System (CUPS)] uses the Ghostscript as its rasterizer backend for non-PostScript printers.

=== The Ghostscript ===

The core of printable data manipulation is the Ghostscript PostScript interpreter which generates raster image.

The latest upstream Ghostscript from Artifex was re-licensed from AFPL to GPL and merged all the latest ESP version changes such as CUPS related ones at 8.60 release as unified release.  

|| List of Ghostscript PostScript interpreters. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''description''' ||
|| {{{ghostscript}}} || || || [http://en.wikipedia.org/wiki/Ghostscript The GPL Ghostscript PostScript/PDF interpreter] ||
|| {{{ghostscript-x}}} || || || The GPL Ghostscript PostScript/PDF interpreter - X Display support ||
|| {{{gs-cjk-resource}}} || || || Resource files for gs-cjk, ghostscript CJK-TrueType extension ||
|| {{{cmap-adobe-cns1}}} || || || CMaps for Adobe-CNS1 (for traditional Chinese support) ||
|| {{{cmap-adobe-gb1}}} || || || CMaps for Adobe-GB1 (for simplified Chinese support) ||
|| {{{cmap-adobe-japan1}}} || || || CMaps for Adobe-Japan1 (for Japanese standard support) ||
|| {{{cmap-adobe-japan2}}} || || || CMaps for Adobe-Japan2 (for Japanese extra support) ||
|| {{{cmap-adobe-korea1}}} || || || CMaps for Adobe-Korea1 (for Korean support) ||
|| {{{libpoppler3}}} || || || PDF rendering library based on xpdf PDF viewer ||
|| {{{libpoppler-glib3}}} || || || PDF rendering library based (GLib-based shared library) ||
|| {{{poppler-data}}} || || || Encoding data for the poppler PDF rendering library (for CJK support) ||

{i} "{{{gs -h}}}" to display Ghostscript configuration.

=== Merge two PS or PDF files ===

You can merge two PS or PDF files using the {{{gs}}}(1) command of the Ghostscript.
{{{
$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pswrite -sOutputFile=bla.ps -f foo1.ps foo2.ps
$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=bla.pdf -f foo1.pdf foo2.pdf
}}}

(!) The [http://en.wikipedia.org/wiki/Portable_Document_Format Portable Document Format (PDF)], which is widely used cross-platform printable data format, is essentially the compressed PS format with few additional features and extensions.

{i} From command line, {{{psmerge}}}(1) and other commands from the {{{psutils}}} package are useful for manipulating PostScript documents.  Commands in the {{{pdfjam}}} package work similarly for manipulating PDF documents. {{{pdftk}}}(1) from the {{{pdftk}}} package is useful for manipulating PDF documents, too.

=== Printable data utilities ===

The following packages for the printable data utilities caught my eyes:

|| List of printable data utilities. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{poppler-utils}}} || 3324 || - || pdf->ps,text,...  || PDF utilities. (pdftops, pdfinfo, pdfimages, pdftotext, and pdffonts) ||
|| {{{psutils}}} || 2950 || - || ps->ps || PostScript document conversion tools ||
|| {{{poster}}} || 2656 || - || ps->ps || Create large posters out of PostScript pages. ||
|| {{{xpdf-utils}}} || 2210 || - || pdf->ps,text,...  || PDF utilities. (pdftops, pdfinfo, pdfimages, pdftotext, and pdffonts) ||
|| {{{enscript}}} || 2732(2007/12) || - || text->ps, html, rtf || Converts ASCII text to Postscript, HTML, RTF or Pretty-Print. ||
|| {{{a2ps}}} || 905 || - || text->ps || 'Anything to PostScript' converter and pretty-printer. ||
|| {{{pdftk}}} || 449 || - || pdf->pdf || PDF document conversion tool: ({{{pdftk}}}) ||
|| {{{mpage}}} || 350 || - || text,ps->ps || Print multiple pages per sheet. ||
|| {{{html2ps}}} || 317 || - || html->ps || The converter from HTML to PostScript. ||
|| {{{pdfjam}}} || 260 || - || pdf->pdf || PDF document conversion tools: {{{pdf90}}}, {{{pdfjoin}}}, and {{{pdfnup}}} ||
|| {{{gnuhtml2latex}}} || 191 || - || html->latex || The converter from html to latex. ||
|| {{{latex2rtf}}} || 131 || - || latex->rtf || This converts documents from LaTeX to RTF which can be read by MS Word. ||
|| {{{ps2eps}}} || 92 || - || ps->eps || The converter from PostScript to EPS (Encapsulated PostScript). ||
|| {{{e2ps}}} || 42 || - || text->ps || Text to PostScript converter with Japanese encoding support. ||
|| {{{impose+}}} || 35 || - || ps->ps || Postscript utilities. ||
|| {{{trueprint}}} || - || - || text->ps || This pretty print many source codes (C, C++, Java, Pascal, Perl, Pike, Sh, and Verilog) to PostScript. (C) ||
|| {{{pdf2svg}}} || - || - || ps->svg || Converter from PDF to [http://en.wikipedia.org/wiki/Scalable_Vector_Graphics Scalable vector graphics] format. ||
|| {{{pdftoipe}}} || - || - || ps->ipe || Converter from PDF to IPE's XML format. ||

## removed from archive
## || {{{rtf2latex}}} || 44 || - || rtf->latex || This converts documents from RTF which can be created by MS Word to LaTeX. ||

=== Printing with CUPS ===

Both {{{lp}}} and {{{lpr}}} commands offered by [http://en.wikipedia.org/wiki/Common_Unix_Printing_System Common Unix Printing System (CUPS)] provides options for customized printing the printable data.

For printing 3 copies of a file collated:
{{{
$ lp -n 3 -o Collate=True filename
}}}
, or 
{{{
$ lpr -#3 -o Collate=True filename
}}}

You can further customize printer operation by using printer option such as "{{{-o number-up=2}}}", "{{{-o page-set=even}}}", "{{{-o page-set=odd}}}", "{{{-o scaling=200}}}", "{{{-o natural-scaling=200}}}", etc., documented at [http://localhost:631/help/options.html Command-Line Printing and Options].

== Type setting ==

The Unix [http://en.wikipedia.org/wiki/Troff troff] originally developed by AT&T can be used for simple type setting.  It is usually used to create manpages.

[http://en.wikipedia.org/wiki/TeX TeX] created by Donald Knuth is very powerful type setting tool and is the de facto standard . [http://en.wikipedia.org/wiki/LaTeX LaTeX] originally written by Leslie Lamport enables a high-level access to the power of TeX. 

|| List of type setting tools. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{texlive-base}}} || 1074 || - || (La)TeX || TeX system for typesetting, previewing and printing. ||
|| {{{groff}}} || 840 || - || troff || GNU troff text-formatting system. ||

=== roff typesetting ===

Traditionally, {{{roff}}} is the main Unix text processing system.

See {{{roff}}}(7), {{{groff}}}(7), {{{groff}}}(1), {{{grotty}}}(1), {{{troff}}}(1), {{{groff_mdoc}}}(7), {{{groff_man}}}(7), {{{groff_ms}}}(7), {{{groff_me}}}(7), {{{groff_mm}}}(7), and {{{info groff}}}.

A good tutorial on {{{-me}}} macros exists.  If you have {{{groff}}} (1.18 or newer), find {{{/usr/share/doc/groff/meintro.me.gz}}} and do the following:

{{{
$ zcat /usr/share/doc/groff/meintro.me.gz | \
     groff -Tascii -me - | less -R
}}}

The following will make a completely plain text file:

{{{
$ zcat /usr/share/doc/groff/meintro.me.gz | \
    GROFF_NO_SGR=1 groff -Tascii -me - | col -b -x > meintro.txt
}}}

For printing, use PostScript output.

{{{
$ groff -Tps meintro.txt | lpr
$ groff -Tps meintro.txt | mpage -2 | lpr
}}}

=== TeX/LaTeX ===

Preparation:
{{{
# aptitude install texlive
}}}

References for LaTeX:
 * The teTeX HOWTO: The Linux-teTeX Local Guide (http://www.tldp.org/HOWTO/TeTeX-HOWTO.html)
 * {{{tex}}}(1)
 * {{{latex}}}(1)
 * "The TeXbook", by Donald E. Knuth, (Addison-Wesley)
 * ''LaTeX - A Document Preparation System'', by Leslie Lamport, (Addison-Wesley)
 * ''The LaTeX Companion'', by Goossens, Mittelbach, Samarin, (Addison-Wesley)

This is the most powerful typesetting environment.  Many SGML processors use this as their back end text processor.  Lyx provided by {{{lyx}}}, {{{lyx-xforms}}}, or {{{lyx-qt}}} and GNU TeXmacs provided by {{{texmacs}}} package offers nice WYSIWYG editing environment for LaTeX while many use Emacs and Vim as the choice for the source editor.

There are many online resources available:
 * The TEX Live Guide - TEX Live 2007 (/usr/share/doc/texlive-doc-base/english/texlive-en/live.html) ({{{texlive-doc-base}}} package)
 * A Simple Guide to Latex/Lyx (http://www.stat.rice.edu/~helpdesk/howto/lyxguide.html)
 * Word Processing Using LaTeX (http://www-h.eng.cam.ac.uk/help/tpl/textprocessing/latex_basic/latex_basic.html)
 * Local User Guide to teTeX/LaTeX (http://supportweb.cs.bham.ac.uk/documentation/LaTeX/lguide/local-guide/local-guide.html)

## * A Quick Introduction to LaTeX (http://www.msu.edu/user/pfaffben/writings/)

## The following needs to be checked.

When documents become bigger, sometimes TeX may cause errors.  You must increase pool size in {{{/etc/texmf/texmf.cnf}}} (or more appropriately edit {{{/etc/texmf/texmf.d/95NonPath}}} and run {{{update-texmf}}}) to fix this.

(!) The TeX source of "The TeXbook" is available at http://tug.ctan.org/tex-archive/systems/knuth/dist/tex/texbook.tex .

This file contains most of the required macros.  I heard that you can process this document with {{{tex}}} after commenting lines 7 to 10 and adding "{{{\input manmac \proofmodefalse}}}". It's strongly recommended to buy this book (and all other books from Donald E. Knuth) instead of using the online version but the source is a great example of TeX input!

=== Pretty print a manual page ===

The following will print a manual page into a PostScript file/printer.

{{{
$ man -Tps some_manpage | lpr
$ man -Tps some_manpage | mpage -2 | lpr
}}}

=== Creating a manual page ===

Although writing manpage in plain troff is possible, there are few helper packages to create the manpage.

|| List of packages to help creating the manpage. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{docbook-to-man}}} || 436 || - || SGML->manpage || The converter from DocBook SGML into roff man macros. ||
|| {{{help2man}}} || 104 || - || text->manpage || Automatic manpage generator from --help. ||
|| {{{info2man}}} || 41 || - || info->manpage || The converter from GNU info to POD or man pages. ||
|| {{{txt2man}}} || 35 || - || text->manpage || Converts flat ASCII text to man page format. ||

== The mail data conversion ==

The following packages for the mail data conversion caught my eyes:

|| List of packages to help mail data conversion. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{sharutils}}} || 5059 || - || mail || {{{shar}}}, {{{unshar}}}, {{{uuencode}}}, {{{uudecode}}} ||
|| {{{mpack}}} || 4177 || - || mail || The encoder and decoder MIME messages: {{{mpack}}} and {{{munpack}}}. ||
|| {{{tnef}}} || 277 || - || mail || unpacking MIME attachments of type "application/ms-tnef" which is a Microsoft only format. ||
|| {{{uudeview}}} || 246 || - || mail || The encoder and decoder for the following formats: uuencode, xxencode, BASE64, quoted printable, and BinHex ||
|| {{{mimedecode}}} || 146 || - || mail || This decodes transfer encoded text type mime messages. ||
|| {{{readpst}}} || 33 || - || windows/mail || This converts Outlook PST files to mbox format. ||

{i} The [http://en.wikipedia.org/wiki/Internet_Message_Access_Protocol Internet Message Access Protocol] version 4 (IMAP4) server (see: @{@popdimapeserver@}@) may be used to move mails out from the proprietary mail system if the mail client software can be configured to use IMAP4 server too.

=== Mail data basics ===

Mail (SMTP) data should be limited to 7 bit.  So binary data and 8 bit text data are encoded into 7 bit format with the [http://en.wikipedia.org/wiki/MIME Multipurpose Internet Mail Extensions (MIME)] and the selection of the charset (see: @{@basicsofencoding@}@).

The standard mail storage format is mbox formatted according to [http://tools.ietf.org/html/rfc2822 RFC2822 (updated RFC822)].  See {{{man 5 mbox}}} (provided by the {{{mutt}}} package).

For European languages, "{{{Content-Transfer-Encoding: quoted-printable}}}" with the ISO-8859-1 charset is usually used since there are no much 8 bit characters. If the text is in UTF-8, "{{{Content-Transfer-Encoding: quoted-printable}}}" is also used since it is mostly 7 bit data.

For Japanese, traditionally "{{{Content-Type: text/plain; charset=ISO-2022-JP}}}" should be used to keep text in 7 bits. But mails from older Microsoft systems may use in Shift-JIS without proper declaration.  For Japanese, if the text is in UTF-8, it contains many 8 bit data and is encoded into 7 bit data by [http://en.wikipedia.org/wiki/Base64 Base64].  The situation of other Asian languages is similar.

(!) If your non-Unix mail data is accessible by a non-Debian client software which can talk to the IMAP4 server, you may be able to move them out by running your own IMAP4 server (see: @{@popdimapeserver@}@).

(!) If you use other mail storage formats, moving them to mbox format is the good first step.  The versatile client program such as {{{mutt}}} may be handy for this. 

You can split mailbox contents to each message using {{{procmail}}}(1) and {{{formail}}}(1).

Each mail message can be unpacked using the {{{munpack}}}(1) command from the {{{mpack}}} package (or other specialized tools) to obtain the MIME encoded contents. 

== Graphic data tools ==

The following packages for the graphic data conversion, editing, and organization tools caught my eyes:

|| List of graphic data tools. || 1 || 2 || 3 || ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{gimp}}} || 8507 || - || image(bitmap) || The GNU Image Manipulation Program. ||
|| {{{imagemagick}}} || 5479 || - || image(bitmap) || Image manipulation programs. ||
|| {{{graphicsmagick}}} || 244 || - || image(bitmap) || Image manipulation programs. (folk of {{{imagemagick}}}) ||
|| {{{xsane}}} || 4757 || - || image(bitmap) || GTK+-based X11 frontend for SANE (Scanner Access Now Easy). ||
|| {{{netpbm}}} || 2446 || - || image(bitmap) || Graphics conversion tools. ||
|| {{{icoutils}}} || - || - || png<->ico(bitmap) || Converts [http://en.wikipedia.org/wiki/ICO_(icon_image_file_format) MS Windows icons and cursors to and from PNG formats] ([http://en.wikipedia.org/wiki/Favicon favicon.ico]) ||
|| {{{xpm2wico}}} || - || - || xpm->ico(bitmap) || Converts XPM to [http://en.wikipedia.org/wiki/ICO_(icon_image_file_format) MS Windows icon formats] ||
|| {{{scribus}}} || - || - || ps/pdf/SVG/... || The [http://en.wikipedia.org/wiki/Scribus Scribus] DTP editor. ||
|| {{{openoffice.org-draw}}} || - || - || image(vector) || OpenOffice.org office suite - drawing ||
|| {{{inkscape}}} || 1747 || - || image(vector) || The [http://en.wikipedia.org/wiki/Scalable_Vector_Graphics  SVG (Scalable Vector Graphics)] editor. ||
|| {{{dia-gnome}}} || 890 || - || image(vector) || Diagram editor (Gnome) ||
|| {{{dia}}} || 732 || - || image(vector) || Diagram editor (Gtk) ||
|| {{{xfig}}} || - || - || image(vector) || Facility for Interactive Generation of figures under X11 ||
|| {{{pstoedit}}} || 652 || - || ps/pdf->image(vector) || !PostScript and PDF files to editable vector graphics converter. (SVG) ||
|| {{{libwmf-bin}}} || 570 || - || Windows/image(vector) || Windows metafile (vector graphic data) conversion tools. ||
|| {{{fig2sxd}}} || - || - || fig->sxd(vector) || Convert XFig files to OpenOffice.org Draw format ||
|| {{{unpaper}}} || 88 || - || image->image || Post-processing tool for scanned pages for [http://en.wikipedia.org/wiki/Optical_character_recognition OCR]. ||
|| {{{tesseract-ocr}}} || 73 || - || image->text || Free [http://en.wikipedia.org/wiki/Optical_character_recognition OCR] software based on the HP's commercial OCR engine. ||
|| {{{tesseract-ocr-eng}}} || - || - || image->text || OCR engine data: tesseract-ocr language files for English text. ||
|| {{{clara}}} || 83 || - || image->text || Free OCR software. ||
|| {{{gocr}}} || 871 || - || image->text || Free OCR software. ||
|| {{{ocrad}}} || 501 || - || image->text || Free OCR software. ||
|| {{{gtkam}}} || - || - || image(Exif) || Manipulates digital camera photo files (GNOME) - GUI ||
|| {{{gphoto2}}} || - || - || image(Exif) || Manipulates digital camera photo files (GNOME) - command line ||
|| {{{kamera}}} || - || - || image(Exif) || Manipulates digital camera photo files (KDE) ||
|| {{{jhead}}} || - || - || image(Exif) || Manipulates the non-image part of Exif compliant JPEG (digital camera photo) files ||
|| {{{exif}}} || - || - || image(Exif) || Command-line utility to show EXIF information in JPEG files ||
|| {{{exiftags}}} || - || - || image(Exif) || Utility to read Exif tags from a digital camera JPEG file ||
|| {{{exiftran}}} || - || - || image(Exif) || Transforms digital camera jpeg images ||
|| {{{exifprobe}}} || - || - || image(Exif) || Reads metadata from digital pictures ||
|| {{{dcraw}}} || - || - || image(Raw)->ppm || Decodes raw digital camera images ||
|| {{{findimagedupes}}} || - || - || image->fingerprint ||  Finds visually similar or duplicate images ||
|| {{{ale}}} || - || - || image->image || Merges images to increase fidelity or create mosaics ||
|| {{{imageindex}}} || - || - || image(Exif)->html || Generates static HTML galleries from images ||
|| {{{f-spot}}} || - || - || image(Exif) || Personal photo management application (GNOME) ||
|| {{{bins}}} || - || - || image(Exif)->html || Generates static HTML photo albums using XML and EXIF tags ||
|| {{{galrey}}} || - || - || image(Exif)->html || Generates browsable HTML photo albums with thumbnails ||
|| {{{outguess}}} || - || - || jpeg,png || Universal Steganographic tool ||
|| {{{qcad}}} || - || - || DXF || CAD data editor (KDE) ||
|| {{{blender}}} || - || - || blend, TIFF, VRML, ... || 3D content editor for animation etc. ||
|| {{{open-font-design-toolkit}}} || - || - || ttf, ps, ... || Metapackage for open font design ||
|| {{{fontforge}}} || - || - || ttf, ps, ... || Font editor for PS, TrueType and OpenType fonts ||
|| {{{xgridfit}}} || - || - || ttf || a program for gridfitting, or "hinting," TrueType fonts ||
|| {{{gbdfed}}} || - || - || bdf || Editor for BDF fonts ||

## || {{{gocr-gtk}}} || 41 || - || image->text || Free OCR software. GTK-GUI. ||
## || {{{stegdetect}}} || - || - || jpeg || Detects and extracts [http://en.wikipedia.org/wiki/Steganography steganography] messages inside JPEG ||

{i} Search more image tools using regex "{{{~Gworks-with::image}}}" in {{{aptitude}}} (see @{@searchmethodoptionswithaptitude@}@).

Although GUI programs such as {{{gimp}}} are very powerful, command line tools such as {{{imagemagik}}} are quite useful for automating image manipulation with the script.  

The de facto image file format of the digital camera is the [http://en.wikipedia.org/wiki/Exchangeable_image_file_format Exchangeable Image File Format] (EXIF) which is the [http://en.wikipedia.org/wiki/JPEG JPEG] image file format with additional metadata tags.  It can hold information such as date, time, and camera settings.

[http://en.wikipedia.org/wiki/Lempel-Ziv-Welch The Lempel-Ziv-Welch (LZW) lossless data compression] patent has been expired.  [http://en.wikipedia.org/wiki/Graphics_Interchange_FormatnThe Graphics Interchange Format (GIF)] utilities which use the LZW compression method are now freely available on the Debian system.

{i} Any digital camera or scanner with removable recording media will work with Linux through [http://en.wikipedia.org/wiki/USB_mass_storage_device_class USB Mass Storage] readers.

== Miscellaneous data conversion ==

There are many other programs for converting data. Following packages caught my eyes using regex "{{{~Guse::converting}}}" in {{{aptitude}}} (see @{@searchmethodoptionswithaptitude@}@):

|| List of miscellaneous data conversion tools. || 1 || 2 || 3 ||
|| '''package''' || '''popcon''' || '''size''' || '''keyword''' || '''function''' ||
|| {{{alien}}} || 1775 || - || rpm/tgz->deb || The converter for the foreign package into the Debian package. ||
|| {{{freepwing}}} || 6 || - || EB->EPWING || The converter from "Electric Book" (popular in Japan) to a single JIS X 4081 format (a subset of the EPWING V1). ||

You can also extract data from RPM format with:
{{{
$ rpm2cpio file.src.rpm | cpio --extract
}}}