HTML Parser and Generator Implementations

This is a sort of "Family Tree" of HTML parser implementations, annotated with notes on features and bugs.

I'm working on updating the HTML parser in our reference code. See: A Lexical Analyzer for HTML and Basic SGML.

SGML.c in LibWWW

The first HTML parser ever released was in the library/linemode distribution back in '92 or so. It supported broken markup such as:

<xmp>... </foo> ... </xmp>
<a href=http://foo.bar/>...</a>

NCSA Mosaic 2.4 -- didn't use CERN code, but was inspired by it.
- Spyglass Mosaic -- re-write of NCSA code
- Netscape -- re-implementation of NCSA code
  - MS IE -- inspired by Netscape

Based on regexps. Guido wrote the first web spider, I believe. This parser treats P, LI, DT, DD as empty elements. Nifty formatter code.

Tools that Write HTML

LaTeX2HTML: Creates documents with missing quotes around the attribute values.

Connolly
$Id: implementations.html,v 1.1 2000/06/19 17:13:03 janet Exp $