Syntax error handling
[This file explains how netrik handles syntax errors in HTML pages, why
it does so, and what to do about that. See index.txt or index.html for an
overview of available netrik documentation.]
Why does netrik always complain about HTML syntax errors??
No matter what site I load, (almost) always I get error messages. That can't
Well, it *is* OK... The problem is: Actually almost all sites *are* more or
Other browsers simply ignore the errors, hoping that the effect will resemble
more or less what the author intended -- and nobody ever knows, but for
some little blemish maybe, where no one cares about. (If Netscape wasn't
so tolerant about syntax errors in the first place, we wouldn't have that
problem today :-( ) However, the browser can only "guess" what the author
intended, and that doesn't always work out; such errors can cause the page
to be layouted completely wrong, including missing text parts etc. That's
why netrik warns about them, so the user at least knows what the matter is,
and also gets the chance to tell the page author about the problem. (s.b.)
Note that XHTML even *requires* a browser to abort on syntax errors -- sadly,
XHTML is not very popular. (Yet?...)
Of course, it might also happen that netrik sees an error where there is none
-- but this is rather rare, and would indicate a bug in netrik. Of course,
if you encounter such a situation, don't hesitate to send a bug report. (
Can't I turn that off?
Well, actually, you can: There are the two options "--broken-html" and
"--ignore-broken". "--ignore-broken" will prevent netrik complaining about
*any* syntax errors, while "--broken-html" only turns off warnings about common
errors which in most cases can be guessed correctly and nicely worked around.
(But not always!)
Presently, "--broken-html" is the default, so less critical erros won't cause
netrik to stop and wait for keypress. (The error messages are still displayed,
but they just scroll through in this mode.) This is because the current
interface makes the warnings quite obtrusive; we will change the default again
when a better UI will allow displaying the messages in a less disturbing manner.
However, please avoid usage of these options if possible. While "--broken-html"
probably can't be helped for now, it seems generally a bad idea to use
Still, please take the (little) trouble to tell the page author about the
Reporting a problem
Normally you will find an e-mail address of the right contact at the bottom
of the page, or on some "contact" page. Most authors will be greatful for
the information, and will gladly fix the problem -- once and for all.
You can improve your chances by telling exactly what the problem is. To find
out, you can load the page again, but using the "--debug" option. While
parsing the page, netrik will dump the source in this mode. When an error
is encountered, the last charactacter printed before the error message is
the offending one. Note however that in some cases the real reason for the
problem may be somewhere before.
If the output is too big, you may either use the --dump option and pipe the
output through some pager (but note that the debug output is written to stderr,
so you'll need to redirect it: "netrik --debug --dump <url> 2>&1 |less -R"
or something the like), or you can use the "--fussy-html" option, causing
netrik to abort after the first error is detected.
Alternatively, you can use the W3Cs (WWW Consortium) HTML validating service.
( http://validator.w3.org ) This will create a report with all problems
nicely listed -- very helpful for the page author.
A note on comments
Netrik implements full SGML comment parsing. The problem is that the comment
syntax in SGML is fairly complicated; most authors do not know it, and thus do
not know that putting a "--" string inside a comment normally generates errors.
In most cases these are quite obvious; netrik will print an error message,
and use a workaround which will work in most cases, but not always -- just
like for other typical errors.
However, there are some constructs which *are* valid SGML -- but do not make
what the author probably intended. "<!------>" is a typical example: It is
valid, but the comment doesn't terminate; everything behind it will also be
treated as part of the comment! As it is not an error, netrik can't print an
error message or apply a workaround; instead, just a warning is printed.
Often such an unterminated comment will generate other errors later, because
it interferes with other comments or because it stretches to the end of the
file; in such a case, the warning may help finding the problem. In other
cases however, no error is generated at all -- the warning is the only cue
that something went wrong.
Warnings in general
Other similar situations are thinkable. That's why netrik also prints warnings
on things that are strictly speaking valid hatml, but explicitly discouraged
in the HTML standard -- while some browsers will handle them correctly, others
might do otherwise; or sometimes a really correct behaviour even produces
results different from what the author intended and sees in popular browsers.
Warnings will normally (in --valid-html or --broken-html mode) only scroll
through, but without pausing with an additional message after parsing is
finished. To see all warnings, use --clean-html .