HTML Parser and Generator Implementations

This is a sort of "Family Tree" of HTML parser implementations, annotated with notes on features and bugs.

I'm working on updating the HTML parser in our reference code. See: A Lexical Analyzer for HTML and Basic SGML.

SGML.c in LibWWW
The first HTML parser ever released was in the library/linemode distribution back in '92 or so. It supported broken markup such as:
<xmp>... </foo> ... </xmp>
<a href=http://foo.bar/>...</a>
htmllib.py used in grail
Based on regexps. Guido wrote the first web spider, I believe. This parser treats P, LI, DT, DD as empty elements. Nifty formatter code.
SGML Lexical Analyzer

Tools that Write HTML

Creates documents with missing quotes around the attribute values.

