HTMLAsSheAreSpoke

There's a gap between W3C Specifications for HTML and HTML as she are spoke (aka tag soup). The MarkupValidator is intended to close that gap by providing automated feedback to authors.

Tag Soup: Crazy parsing adventures in Hixie's Natural Log in Jan 2006 tells of his attempts to write a spec for HTML lexical details that's not constrained by XML/SGML. In the browser panel at the W3C Workshop on Usability and Transparency of Web Authentication, some banks asked what they could do to help the browser guys, and chaals said "give me a spec for the HTML you publish."

The number of projects that have solved this problem independently suggests that a standard is worthwhile. In March 2007, W3C chartered an HTML Working Group to work on, among other things, a "non-XML syntax compatible with the 'classic HTML' parsers of existing Web browsers."

This isn't the only task useful or necessary to close the gap. Perhaps an HtmlTaskBrainstorm is in order.

Implementations

html5 parsers

Previous generations of HTML parsers

C, C++

Java

Lisp

lhtml (Lisp)

Perl

HTML Parser (perl)

PHP

Python

HTMLParser (Python doc)
htmllib (Python doc)
sgmllib (Python doc)
BeautifulSoup
html2text
lxml's HTML parser

Ruby

Hpricot (Ruby)

Others

Since this sort of spec is messy by nature, a huge pile of HtmlTestMaterials will be really important. Is there a tidy regression test suite? Do any of these other projects maintain test suites? I wonder if Yahoo/Google/etc. have HTML test suites they'd be willing to share.

There are risks involved in straying from XML, meanwhile. See also Draft description of new TAG issue TagSoupIntegration-54 Henry Thompson to www-tag 24 Oct 2006.

Markup Errors:

XHTML is not beginners we could say that also for HTML. Markup languages are difficult in general. The strictness of XML add to the difficulties.

QA