The How-To for html 5 parsing
You have read a lot about the html 5 specification. You heard that there were hidden dragons and acid rains. But what about looking by yourself practically how html 5 parsing is working? There are already some tools to play with html 5.
DOM in actual browsers
DOM (Document Object Model) is the representation that browsers are using in memory to manipulate Web content. Browsers have bugs and the content on the Web is largely not conforming. It results in very different DOM representations in browsers. If you are interested by seeing what a document looks like in different browsers, you can use the Live DOM Viewer. Open this link with each browser you know and paste code into the window.
This helps you to see how the Web content is understood today by different tools.
DOM after html 5 parsing
Now you might be interested to see how a document will be represented by a tool implementing html 5 parsing rules. An important note, html 5 is a specification in development. Things might change. The following tools might be incomplete and contain bugs as well. But it will give you an idea of the DOM. It is very practical when you are developing another language which is not html 5 but might be sent as text/html (by mistake or practical choice).
There are at least two online services:
- Live html 5 parser by Philip Taylor
- html5lib Based HTML5 Parser
Henri Sivonen developed a standalone application that you can use on your desktop. Here are the instructions to get it running. It worked fine on my macintosh.
- Check out the source: svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser
- Download and untar GWT 1.5 RC1: http://code.google.com/webtoolkit/versions.html
- On Linux, install libstdc++5 and a JDK (Ubuntu's OpenJDK-based package worked for me).
- Edit the paths in HtmlParser-shell (Mac) or HtmlParser-linux (Linux) to point to the location of GWT.
- Run HtmlParser-shell (Mac) or HtmlParser-linux (Linux)
Henri gave a list of limitations and bugs
Using html 5 parsing in your own code
There are for now three implementations of the html 5 parsing algorithm.
- html5lib python 0.11.1
- html5lib ruby 0.10.0
- html 5 parser java
There is an attempt at implementing in C# for .Net 2.0, but no code has been released yet.
If you know other tools implementing it, leave a comment.
Can this be updated to reflect the current state of HTML 5 parsers, please?
There is a lot of other information in the posting that's now out of date, so I don't think it makes much sense to update this posting itself. But the answer about the current state of HTML5 parsers is that all major browsers now have a parser that conforms to the parsing algorithm in the HTML5 spec. But outside the browser there are still really not any parsers other than html5lib and Henri's validator.nu parser that are in any kind of wide use.