The How-To for html 5 parsing
You have read a lot about the html 5 specification. You heard that there were hidden dragons and acid rains. But what about looking by yourself practically how html 5 parsing is working? There are already some tools to play with html 5.
DOM in actual browsers
DOM (Document Object Model) is the representation that browsers are using in memory to manipulate Web content. Browsers have bugs and the content on the Web is largely not conforming. It results in very different DOM representations in browsers. If you are interested by seeing what a document looks like in different browsers, you can use the Live DOM Viewer. Open this link with each browser you know and paste code into the window.
This helps you to see how the Web content is understood today by different tools.
DOM after html 5 parsing
Now you might be interested to see how a document will be represented by a tool implementing html 5 parsing rules. An important note, html 5 is a specification in development. Things might change. The following tools might be incomplete and contain bugs as well. But it will give you an idea of the DOM. It is very practical when you are developing another language which is not html 5 but might be sent as text/html (by mistake or practical choice).
There are at least two online services:
- Check out the source: svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser
- Download and untar GWT 1.5 RC1: http://code.google.com/webtoolkit/versions.html
- On Linux, install libstdc++5 and a JDK (Ubuntu's OpenJDK-based package worked for me).
- Edit the paths in HtmlParser-shell (Mac) or HtmlParser-linux (Linux) to point to the location of GWT.
- Run HtmlParser-shell (Mac) or HtmlParser-linux (Linux)
Henri gave a list of limitations and bugs
Using html 5 parsing in your own code
There are for now three implementations of the html 5 parsing algorithm.
There is an attempt at implementing in C# for .Net 2.0, but no code has been released yet.
If you know other tools implementing it, leave a comment.