Clean the Web with libxml2

Introduction

The Web (of HTML/XHTML documents) is largely defined by tag soup: Invalid and non well-formed syntax.

In 1996 at WWW5, a paper "An Investigation of Documents from the World Wide Web" reports data collected over 2.6 million HTML documents collected by the Inktomi Web Crawler. Authors found out that over 40% of the documents in our study contain at least one error. Since there has been a number of surveys, The Web Authoring Statistics by Ian Hickson at Google is one of the most recent ones. 90% to 95% of the Web is invalid and/or non well-formed according to surveys.

HTML 5 goals

On March 2007, the W3C has restarted the work on HTML using the work done by the WHAT WG and its editor, Ian Hickson, defining HTML 5. HTML 5 is far more than an evolution of HTML 4.01. It includes the DOM, some APIs and a custom parsing algorithm. For the first time, HTML is defined in terms of a DOM which is the way the browsers interpret the Web. Once this DOM tree has been created, there is a choice between two serializations, xml and html. The xml serialization has to be served with application/xhtml+xml, the html serialization has to be served with text/html.

Html5 Serializations

HTML 5, one vocabulary, two serializations, W3C Q&A blog, January 15, 2008

When reading the document on the Web (likely to be invalid) and creating the DOM tree, clients have to recover for syntax errors. HTML 5 Parsing algorithm describes precisely how to recover from erroneous syntax.

Cleaning the Web - Implementing HTML 5 parsing in libxml2

The html5 parsing algorithm starts to be implemented in some clients. Some libraries have been developed. In the "How-To for html 5 parsing", there is a list of ongoing implementations (python, java, ruby). Some of them are quite slow.

The original idea was to have an Apache module that could clean up the content before pushing the page to clients. So clients which have not taken care about having to recover for broken documents could be more effective. At the same time it would be a lot easier to create quality reporting tools for webmasters and/or CMSes. The error analyses being done on the server. Basically it raises the quality of the content step by step.

Nick Kew weighed in and proposed that we should target libxml which includes an HTML parser and is already supported by Apache server and many other tools.

From here it would be interesting to implement HTML 5 parsing algorithm into libxml2. It would benefit the community as large.

HTML 5 Community

More references