File: hxunent.1

package info (click to toggle)
html-xml-utils 7.7-1.1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye, forky, sid, trixie
  • size: 2,488 kB
  • sloc: ansic: 11,213; sh: 7,996; lex: 243; makefile: 193; yacc: 125
file content (62 lines) | stat: -rw-r--r-- 1,626 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
.TH "HXUNENT" "1" "10 Jul 2011" "7.x" "HTML-XML-utils"

.de d \" begin display
.sp
.in +4
.nf
.ft CR
.CDS
..
.de e \" end display
.CDE
.in -4
.fi
.ft R
.sp
..

.SH NAME
hxunent \- replace HTML predefined character entities by UTF-8
.SH SYNOPSIS
.B hxunent
.RB "[\| " \-b " \|]"
.RB "[\| " \-f " \|]"
.RI "[\| " file " \|]"
.SH DESCRIPTION
.LP
The
.B hxunent
command reads the
.I file
(or standard input) and copies it to standard output with &-entities
by their equivalent character (encoded as UTF-8). E.g., " is
replaced by " and &lt; is replaced by <.
.SH OPTIONS
The following options are supported:
.TP 10
.B -b
The five builtin entities of XML (&lt; &gt; &quot; &apos; &amp;) are not
replaced but copied unchanged. This is necessary if the output has to
be valid XML or SGML.
.TP
.B -f
This option changes how unknown entities or lone ampersands are handled. Normally they are copied unchanged, but this option tries to "fix" them by replacing ampersands by &amp;. Often such stray ampersands are the result of copy and paste of URLs into a document and then this option indeed fixes them and makes the document valid.
.SH "DIAGNOSTICS"
The program's exit value is 0 if all went well, otherwise:
.TP 10
.B 1
The input couldn't be read (file not found, file not readable...)
.TP
.B 2
Wrong command line arguments.
.SH "SEE ALSO"
.BR asc2xml (1),
.BR xml2asc (1),
.BR UTF-8 " (RFC 2279)"
.SH BUGS
.LP
The program assumes entities are as defined by HTML. It doesn't read a
document's DTD to find the actual definitions in use in a document. 
With
.BR \-f ,
it will even remove all entities that are not HTML entities.