File: tagsoup.txt

package info (click to toggle)
tagsoup 1.2.1%2B-1
  • links: PTS, VCS
  • area: main
  • in suites: buster, jessie, jessie-kfreebsd, stretch
  • size: 1,064 kB
  • ctags: 1,020
  • sloc: java: 3,734; xml: 243; makefile: 30
file content (160 lines) | stat: -rw-r--r-- 6,123 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
´ This file is part of TagSoup and is Copyright 2002‐2008 by John
Cowan.  ´ ´ TagSoup is licensed under the Apache License, ´  Ver‐
sion   2.0.   You  may  obtain  a  copy  of  this  license  at  ´
http://www.apache.org/licenses/LICENSE‐2.0 .  You may also have ´
additional legal rights not granted by this license.  ´ ´ TagSoup
is distributed in the hope that it will be useful, but  ´  unless
required  by applicable law or agreed to in writing, TagSoup ´ is
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
´  OF  ANY  KIND, either express or implied; not even the implied
warranty ´ of MERCHANTABILITY or FITNESS FOR  A  PARTICULAR  PUR‐
TAGSOUP(1)                       User Commands                      TAGSOUP(1)



POSE.  ´

NAME
       tagsoup - convert nasty, ugly HTML to clean XHTML

SYNOPSIS
       java -jar tagsoup-1.2 [ options ] [ files ]

DESCRIPTION
       Rectify  arbitrary  HTML into clean XHTML, using a tailored description
       of HTML.  The output will be well-formed XML, but not necessarily valid
       XHTML.


       --files
              multiple input files should be processed into corresponding out‐
              put files

       --encoding=encoding
              specifies the encoding of input files

       --output-encoding=encoding
              specifies the encoding of  the  output  (if  the  encoding  name
              begins with ‘‘utf’’, the output will not contain character enti‐
              ties; otherwise, all non-ASCII  characters  are  represented  as
              entities)

       --html output rectified HTML rather than XML, omitting the XML declara‐
              tion and any namespace declarations

       --method=html
              output rectified HTML rather than XML (end-tags are omitted  for
              empty  elements, and no character escaping is done in script and
              style elements)

       --omit-xml-declaration
              omit the XML declaration

       --lexical
              output lexical features (specifically comments and  any  DOCTYPE
              declaration)

       --nons suppress namespaces in output

       --nobogons
              suppress unknown non-HTML elements in output

       --nodefaults
              suppress default attribute values

       --nocolons
              change  explicit colons in element and attribute names to under‐
              scores

       --norestart
              don’t restart any restartable elements

       --ignorable
              pass through ignorable whitespace  (whitespace  in  element-only
              content) via SAX method handler ignorableWhitespace

       --any  treat   unknown   non-HTML  elements  as  allowing  any  content
              (default)

       --emptybogons
              treat unknown non-HTML elements as empty elements

       --norootbogons
              don’t allow unknown non-HTML elements to be root elements

       --doctype-system=system-id
              force DOCTYPE declaration to be  output  with  specified  system
              identifier

       --doctype-public=public-id
              force  DOCTYPE  declaration  to  be output with specified public
              identifier

       --standalone=[yes|no]
              specify standalone pseudo-attribute in output XML declaration

       --version=version
              specify version pseudo-attribute in output XML declaration (does
              not affect actual version of XML output)

       --nocdata
              treat  the  CDATA-content  elements script and style as ordinary
              elements (mostly for testing)

       --pyx  output PYX format rather than XML (mostly for testing)

       --pyxin
              input is PYX-format HTML (mostly for testing)

       --reuse
              reuse the same Parser object internally (for testing only)

       --help output basic help

       --version
              output version number

       TagSoup is a parser and reformatter for nasty, ugly HTML.   Its  normal
       processing  mode  is  to accept HTML files on the command line, or from
       the standard input if none are given, and output them as clean  XML  to
       the  standard output.  The encoding is assumed to be the platform-local
       encoding on input, and is always UTF-8 on output.

       When the --files option is given, each input file is processed into  an
       output  file  of  the corresponding name, with the extension changed to
       xhtml.  If the extension is already xhtml, it is changed to xhtml_.

       TagSoup will repair, by whatever means  necessary,  violations  of  XML
       well-formedness.   In  particular,  it  will fix up malformed attribute
       names and supply missing attribute-value quotation marks.  More signif‐
       icantly, it supplies end-tags where HTML allows them to be omitted, and
       sometimes where it doesn’t.  It will even supply start-tags where  nec‐
       essary; for example, if a document begins with a <li> tag, TagSoup will
       automatically prefix it with <html><body><ul>.


BUGS
       TagSoup can be fooled by missing close quotes after  attribute  values,
       and  by  incorrect character encodings (it does not contain an encoding
       guesser).

       TagSoup doesn’t understand namespace declarations, which are not  prop‐
       erly  part  of  HTML.  Instead, any element or attribute name beginning
       foo: will be put into the artificial namespace urn:x-prefix:foo.

       For the same reasons,  namespace-qualified  attributes  like  xml:space
       can’t  be  returned  as default values, though an explicit attribute in
       the xml namespace will be returned with the proper namespace URI.

AUTHOR
       John Cowan <cowan@ccil.org>

COPYRIGHT
       Copyright © 2002-2008 John Cowan
       TagSoup is free software; see the source for copying conditions.  There
       is  NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICU‐
       LAR PURPOSE.



TagSoup 1.2                      January 2008                       TAGSOUP(1)