File: hxunent.1

package info (click to toggle)
html-xml-utils 6.1-1
  • links: PTS, VCS
  • area: main
  • in suites: wheezy
  • size: 1,620 kB
  • sloc: ansic: 10,027; sh: 2,135; lex: 189; yacc: 125; perl: 123; makefile: 122
file content (56 lines) | stat: -rw-r--r-- 1,601 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
.de d \" begin display
.sp
.in +4
.nf
..
.de e \" end display
.in -4
.fi
.sp
..
.TH "HXUNENT" "1" "10 Jul 2011" "6.x" "HTML-XML-utils"
.SH NAME
hxunent \- replace HTML predefined character entities by UTF-8
.SH SYNOPSIS
.B hxunent
.RB "[\| " \-b " \|]"
.RB "[\| " \-f " \|]"
.RI "[\| " file " \|]"
.SH DESCRIPTION
.LP
The
.B hxunent
command reads the
.I file
(or standard input) and copies it to standard output with &-entities
by their equivalent character (encoded as UTF-8). E.g., " is
replaced by " and &lt; is replaced by <.
.SH OPTIONS
The following options are supported:
.TP 10
.B -b
The five builtin entities of XML (&lt; &gt; &quot; &apos; &amp;) are not
replaced but copied unchanged. This is necessary if the output has to
be valid XML or SGML.
.TP
.B -f
This option changes how unknown entities or lone ampersands are handled. Normally they are copied unchanged, but this option tries to "fix" them by replacing ampersands by &amp;. Often such stray ampersands are the result of copy and paste of URLs into a document and then this option indeed fixes them and makes the document valid.
.SH "DIAGNOSTICS"
The program's exit value is 0 if all went well, otherwise:
.TP 10
.B 1
The input couldn't be read (file not found, file not readable...)
.TP
.B 2
Wrong command line arguments.
.SH "SEE ALSO"
.BR asc2xml (1),
.BR xml2asc (1),
.BR UTF-8 " (RFC 2279)"
.SH BUGS
.LP
The program assumes entities are as defined by HTML. It doesn't read a
document's DTD to find the actual definitions in use in a document. 
With
.BR \-f ,
it will even remove all entities that are not HTML entities.