Syntax error handling

[This file explains how netrik handles syntax errors in HTML pages, why it does so, and what to do about that. See index.txt or index.html for an overview of available netrik documentation.]


Why does netrik always complain about HTML syntax errors??


No matter what site I load, (almost) always I get error messages. That can't be OK?!

Well, it *is* OK... The problem is: Actually almost all sites *are* more or less broken.

Other browsers simply ignore the errors, hoping that the effect will resemble more or less what the author intended -- and nobody ever knows, but for some little blemish maybe, where no one cares about. (If Netscape wasn't so tolerant about syntax errors in the first place, we wouldn't have that problem today :-( ) However, the browser can only "guess" what the author intended, and that doesn't always work out; such errors can cause the page to be layouted completely wrong, including missing text parts etc. That's why netrik warns about them, so the user at least knows what the matter is, and also gets the chance to tell the page author about the problem. (s.b.)

Note that XHTML even *requires* a browser to abort on syntax errors -- sadly, XHTML is not very popular. (Yet?...)

Of course, it might also happen that netrik sees an error where there is none -- but this is rather rare, and would indicate a bug in netrik. Of course, if you encounter such a situation, don't hesitate to send a bug report. ($$ )

Can't I turn that off?

Well, actually, you can: There are the two options "--broken-html" and "--ignore-broken". "--ignore-broken" will prevent netrik complaining about *any* syntax errors, while "--broken-html" only turns off warnings about common errors which in most cases can be guessed correctly and nicely worked around. (But not always!)

Presently, "--broken-html" is the default, so less critical erros won't cause netrik to stop and wait for keypress. (The error messages are still displayed, but they just scroll through in this mode.) This is because the current interface makes the warnings quite obtrusive; we will change the default again when a better UI will allow displaying the messages in a less disturbing manner.

However, please avoid usage of these options if possible. While "--broken-html" probably can't be helped for now, it seems generally a bad idea to use "--ignore-broken".

Still, please take the (little) trouble to tell the page author about the problem.

Reporting a problem

Normally you will find an e-mail address of the right contact at the bottom of the page, or on some "contact" page. Most authors will be greatful for the information, and will gladly fix the problem -- once and for all.

You can improve your chances by telling exactly what the problem is. To find out, you can load the page again, but using the "--debug" option. While parsing the page, netrik will dump the source in this mode. When an error is encountered, the last charactacter printed before the error message is the offending one. Note however that in some cases the real reason for the problem may be somewhere before.

If the output is too big, you may either use the --dump option and pipe the output through some pager (but note that the debug output is written to stderr, so you'll need to redirect it: "netrik --debug --dump <url> 2>&1 |less -R" or something the like), or you can use the "--fussy-html" option, causing netrik to abort after the first error is detected.

Alternatively, you can use the W3Cs (WWW Consortium) HTML validating service. ( This will create a report with all problems nicely listed -- very helpful for the page author.

A note on comments

Netrik implements full SGML comment parsing. The problem is that the comment syntax in SGML is fairly complicated; most authors do not know it, and thus do not know that putting a "--" string inside a comment normally generates errors. In most cases these are quite obvious; netrik will print an error message, and use a workaround which will work in most cases, but not always -- just like for other typical errors.

However, there are some constructs which *are* valid SGML -- but do not make what the author probably intended. "<!------>" is a typical example: It is valid, but the comment doesn't terminate; everything behind it will also be treated as part of the comment! As it is not an error, netrik can't print an error message or apply a workaround; instead, just a warning is printed.

Often such an unterminated comment will generate other errors later, because it interferes with other comments or because it stretches to the end of the file; in such a case, the warning may help finding the problem. In other cases however, no error is generated at all -- the warning is the only cue that something went wrong.

Warnings in general

Other similar situations are thinkable. That's why netrik also prints warnings on things that are strictly speaking valid hatml, but explicitly discouraged in the HTML standard -- while some browsers will handle them correctly, others might do otherwise; or sometimes a really correct behaviour even produces results different from what the author intended and sees in popular browsers.

Warnings will normally (in --valid-html or --broken-html mode) only scroll through, but without pausing with an additional message after parsing is finished. To see all warnings, use --clean-html .