Tag Soup: Crazy parsing adventures in Hixie's Natural Log in Jan 2006 tells of his attempts to write a spec for HTML lexical details that's not constrained by XML/SGML. In the browser panel at the W3C Workshop on Usability and Transparency of Web Authentication, some banks asked what they could do to help the browser guys, and chaals said "give me a spec for the HTML you publish."
The number of projects that have solved this problem independently suggests that a standard is worthwhile. In March 2007, W3C chartered an HTML Working Group to work on, among other things, a "non-XML syntax compatible with the 'classic HTML' parsers of existing Web browsers."
This isn't the only task useful or necessary to close the gap. Perhaps an HtmlTaskBrainstorm is in order.
Previous generations of HTML parsers
- lhtml (Lisp)
- HTML Parser (perl)
- HTMLParser (Python doc)
- htmllib (Python doc)
- sgmllib (Python doc)
- lxml's HTML parser
- Hpricot (Ruby)
Since this sort of spec is messy by nature, a huge pile of HtmlTestMaterials will be really important. Is there a tidy regression test suite? Do any of these other projects maintain test suites? I wonder if Yahoo/Google/etc. have HTML test suites they'd be willing to share.
There are risks involved in straying from XML, meanwhile. See also Draft description of new TAG issue TagSoupIntegration-54 Henry Thompson to www-tag 24 Oct 2006.
- XHTML is not beginners we could say that also for HTML. Markup languages are difficult in general. The strictness of XML add to the difficulties.