The How-To for html 5 parsing

Part of Tools

Author(s) and publish date

By:
Published:
Skip to 2 comments

You have read a lot about the html 5 specification. You heard that there were hidden dragons and acid rains. But what about looking by yourself practically how html 5 parsing is working? There are already some tools to play with html 5.

DOM in actual browsers

DOM (Document Object Model) is the representation that browsers are using in memory to manipulate Web content. Browsers have bugs and the content on the Web is largely not conforming. It results in very different DOM representations in browsers. If you are interested by seeing what a document looks like in different browsers, you can use the Live DOM Viewer. Open this link with each browser you know and paste code into the window.

This helps you to see how the Web content is understood today by different tools.

DOM after html 5 parsing

Now you might be interested to see how a document will be represented by a tool implementing html 5 parsing rules. An important note, html 5 is a specification in development. Things might change. The following tools might be incomplete and contain bugs as well. But it will give you an idea of the DOM. It is very practical when you are developing another language which is not html 5 but might be sent as text/html (by mistake or practical choice).

There are at least two online services:

Henri Sivonen developed a standalone application that you can use on your desktop. Here are the instructions to get it running. It worked fine on my macintosh.

  1. Check out the source: svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser
  2. Download and untar GWT 1.5 RC1: http://code.google.com/webtoolkit/versions.html
  3. On Linux, install libstdc++5 and a JDK (Ubuntu's OpenJDK-based package worked for me).
  4. Edit the paths in HtmlParser-shell (Mac) or HtmlParser-linux (Linux) to point to the location of GWT.
  5. Run HtmlParser-shell (Mac) or HtmlParser-linux (Linux)

Henri gave a list of limitations and bugs

Using html 5 parsing in your own code

There are for now three implementations of the html 5 parsing algorithm.

There is an attempt at implementing in C# for .Net 2.0, but no code has been released yet.

If you know other tools implementing it, leave a comment.

Related RSS feed

Comments (2)

Comments for this post are closed.