HTML/XML Task Force Report

W3C Editor's Draft 22 December 2011

This Version:
Latest Version:
Previous version:
Norman Walsh, MarkLogic Corporation

This document is also available in these non-normative formats: XML


This document is a very rough working draft of the HTML/XML Task Force report.

Status of this Document

This document is an editor's draft that has no official standing.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is an Editor's Working Draft and does not necessarily reflect the consensus of the task force.

Please report errors in this document to the HTML/XML Task Force mailing list public-html-xml@w3.org (public archives are available).

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.


1 Introduction

HTML and XML share a common ancestor in SGML. The precise details of that ancestry are not strictly important, its significant consequence is that HTML and XML have a quite similar surface syntax. Both use angle brackets and ampersands to distinguish “markup” characters from “content” characters. Both have elements which contain other content and elements which are empty.

This high level of surface similarity suggests, at least to some and at least at first, that there should be a high level of interoperability between HTML and XML systems. This notion is amplified by the fact that when XML arrived on the scene, well after HTML was widely deployed, efforts were made to recast HTML as an XML application rather than an SGML application. HTML was never broadly implemented as an “SGML application”, but it was defined as one in the early HTML specifications.

However, if you look beyond those high-level generalities, the languages are quite different and serve quite different purposes. Where HTML is a single language, XML is a framework for defining languages. Where HTML defines how a tree is constructed from any input, XML only defines tree construction for a small subset of all possible inputs. Where HTML defines explicit extension points within a single vocabulary, XML encourages the use of multiple vocabularies defined in a distributed fashion. Where HTML is in a small, explicit set of namespaces, XML provides for an unbounded number of namespaces.

Against the backdrop of this tension, the TAG formed this Task Force in order to explore how interoperability between HTML and XML could be improved. The Task Force began by collecting use cases to focus its efforts. The original expectation was that a set of use cases would highlight those areas were additional work could aid in the interoperability between XML and HTML. However, as all of the use cases appear to have plausible solutions today, solutions that do not appear amenable to significant improvement, it appears that there is little that can be done beyond documenting these circumstances.

In the following section, we'll describe a set of use cases that the Task Force considered, and how the needs of those use cases can be met today. Readers are particularly encouraged to report additional use cases that they feel are not represented or specific examples where the solutions outlined are not appropriate.

A note about terminology: there are a great many ways to represent the “object model” of an HTML or XML document. There are specifications for both abstract and concrete representations. As a simplification, we use the term “DOM” (Document Object Model) throughout as a general term for any of these possible representations.

2 Use Cases

The task force set out to examine a number of use cases. These are outlined in the following sections.

2.1 How can an XML toolchain be used to consume HTML?

Problem statement A great many systems exist which process XML. These include, but are not limited to, validation tools, a broad spectrum of editors, browsers, query and transformation languages, and countless ad hoc tools. Many of these tools could be applied equally to HTML content, if such content was accessible to them.

Resolution The principal impediment to using XML tools with HTML is that HTML is not guaranteed (or even likely, in the context of the internet at large) to be well-formed. XML parsers reject documents which are not well-formed, so the overwhelming majority of HTML documents cannot be used by systems which only process XML.

The Task Force recognizes two approaches to address this problem: use polyglot markup or introduce an HTML parser into your processing toolchain.

Polyglot markup refers to documents which have been carefully crafted such that they are simultaneously XML and HTML compatible. It seems that the world at large is unlikely to adopt polyglot markup as the standard way to encode all HTML documents, so this solution has limited applicability.

However, the vast majority of HTML documents could be written using polyglot markup and doing so would make them immediately available for processing by tools that anticipate either XML or HTML markup. If you have control over the authoring environments that are used to create content for your system, then it may be entirely feasible to address the “consume HTML with XML tools” problem simply by being more careful about the HTML that you produce.

Where it is applicable, polyglot markup constrains the text/html content in such a way that when parsed as XML, it produces the same parse tree (except for certain minor, specified differences) as it would produce if parsed using an implementation of the HTML parsing algorithm.

Alternatively, rather than attempting to constrain the HTML input so that it conforms to the polyglot constraints, an HTML parser can be introduced to the front of the XML toolchain. Such a parser reads the HTML markup “as she is writ” in the world at large and produces a representation of that tree that an XML processor can use. It is still possible to encounter HTML documents whose document tree needs to be modified slightly for the document tree to be representable as XML. For conforming input, the modifications are on the level of replacing form feeds with spaces.

Like XML parsing, HTML parsing produces a tree. Exposing that tree to the XML toolchain (as a sequence of events, such as SAX events, or an in-memory tree model, or through any other appropriate implementation mechanism) makes all of the XML power available to any HTML document.

2.2 How can an HTML toolchain be used to consume XML?

Problem statement HTML toolchains are widespread and popular. Users may encounter XML documents and want to process them using familiar tools. This use case is the logical reciprocal of the former use case; it's about allowing developers of HTML-only tools to provide useful functionality to users who have non-XHTML, non-SVG, non-MathML content, even though the tool developer doesn't have a business need to address it explicitly.

Resolution HTML5 doesn't have an extensibility story that admits the possibility of content in arbitrary namespaces. XML markup from vocabularies totally unlike HTML, SVG, or MathML will be parsed and interpreted according to the HTML5 rules. These rules are very unlikely to produce the same DOM that an XML parser would have produced.

For XML content that needs to be textually embedded in HTML5, the most successful approach may be to simply translate the XML to HTML5 before passing it to the HTML5 tool. Of course, translating an arbitrary XML vocabulary into HTML5 may be very difficult. If a faithful translation isn't possible, even the simple transformation that strips out processing instructions and non-HTML namespaces may help.

Processing a real XML document with an HTML5 parser is probably never going to be possible with complete fidelity.

In an environment where the HTML toolchain includes access to an XML parser and the HTML and XML resources can be managed separately, the most successful approach is likely to involve parsing the XML with an XML parser and the HTML with an HTML parser.

2.3 How can islands of HTML be embedded in XML?

Problem statement In XML vocabularies that are not intrinsically about representing prose, it's often useful to provide elements into which documentation or “prose annotations” can be provided. One common design pattern in these cases is to establish HTML as one of the common vocabularies marking up documentation in those elements.

This pattern establishes the practice of embedding islands of HTML in XML documents that are not otherwise anything like HTML or intended to be processed directly by HTML tools.

The question naturally arises, how can HTML5 be embedded in an XML document?

Resolution Broadly speaking, there are two techniques for addressing the question of how HTML is to be embedded in XML.

  1. Make sure that the HTML markup is well-formed XML. This is typically done by explicitly or implicitly asserting that the content is XHTML. This makes the HTML a natural part of the XML document at the expense of imposing XML markup requirements on the author.

  2. Within the container element, escape all characters that might be interpreted as markup. This absolves the author of the responsibility to construct well-formed XML, at the expense of requiring tools to escape and unescape the markup and support non-well-formed markup “downstream”.

Both of these techniques can be applied to HTML5 markup. In the former case, use the XML serialization of HTML5. In the latter case, escape the HTML5 markup.

If the HTML subsystem has an interface that allows document trees to be passed to it, the XHTML subtree should be extracted from the larger XML tree and passed to the HTML subsystem. If the HTML subsystem only accepts HTML source text as its input, the XHTML subtree needs to be serialized as HTML and passed to the HTML subsystem for parsing using an HTML parser. In the latter case, some non-conforming constructs may not round-trip to the same tree shape when serialized as HTML and reparsed as HTML. Also, conforming trees that have tr elements as children of table elements will be replaced with semantically equivalent but tree-wise different constructs where the tr elements gain a tbody parent which is a child of the table.

2.4 How can islands of XML be embedded in HTML?

Problem statement In principle, the same powerful scripting and styling facilities that allow users to create rich internet applications with HTML5 can operate on XML documents. Users may attempt to engage in a “progressive enhancement” strategy for building such applications by adding islands of more richly structured XML markup to existing HTML5 documents.

The user's expectation is that these XML islands will appear in the DOM where they can be addressed with JavaScript and formatted with CSS.

Resolution In fact, this is not the case. When the HTML5 parser encounters unfamiliar markup, it assumes that such markup is an erroneous attempt to generate well-defined HTML5. Consequently, it applies error correction strategies which result in a DOM representation that can differ radically from the DOM that an XML parser would have produced. In particular, open elements may end prematurely and additional elements may be opened.

The practical result is that a “naked” XML island in an HTML5 document will not reliably produce anything that resembles the DOM one would expect from casual inspection of the XML island.

In order to conceal the XML markup from the HTML5 parser's attempts to correct errors, the XML must be stored within a script element. The script can identify the content as XML by specifying the content type “application/xml” or any other applicable media type.

What the HTML5 parser produces when it processes this script element is a script element node in the DOM which contains the literal character representation of the XML. That representation can be extracted by JavaScript when the page is loaded, parsed into an actual XML DOM, and processed by the application.

This technique allows arbitrary XML islands to be embedded in HTML5, but such islands are only accessible to processors that are able and willing to execute the necessary JavaScript shim.

2.5 How can XML be made more forgiving of errors?

Problem statement Some significant portion of HTML5 is generated by server-side tools that do little more than string-concatenation. Markup generated by naive string concatenation often results in minor markup errors. The HTML5 parser consistently (and often correctly) corrects for these mistakes and constructs a useful DOM from a not-quite-perfectly constructed inputs.

The XML parser is utterly unforgiving in the face of even small markup errors. As a result, XML constructed using otherwise straightforward techniques in many programming languages is sometimes not well-formed unless great care is taken.

Resolution This aspect of XML parsing could be addressed by a more lenient parser (such as XML5). Working out all of the details to assure that the necessary error correction produces expected results in all cases might be tedious, but some efforts have already been undertaken to examine the issue.

3 Conclusions

The Task Force considered several areas of interoperability that arose in these use cases: consuming HTML with XML tools, consuming XML with HTML tools, and embedding islands of one within the other. As described above, there are well understood boundaries within which any solution to each use case can operate. And within those boundaries, there exists today a solution that, while perhaps not wholly satisfactory, sits within those boundaries. No wholly satisfying solution appears possible within the accepted constraints; it would appear that we have already achieved the practical solutions.

With respect to the question of making XML more forgiving to errors, it's clear that some work has been done in this area and that it is possible to articulate coherent proposals for such change. However, it's entirely unclear that the XML community would be motivated to adopt such changes and, in any event, making such proposals is outside the scope of this Task Force. We recommend further study within the XML community before determining how best to explore these changes.

On the question of Polyglot markup, there seems to be little consensus. One line of argument suggests that, to the extent that it is practical to obey the Robustness principle, it makes sense to do so. That is, if you're generating HTML markup for the web, and you can generate Polyglot markup that is also directly consumable as XML, you should do so. Another line of argument suggests that even under the most optimistic of projections, so tiny a fraction of the web will ever be written in Polyglot that there's no practical benefit to pursuing it. If you want to consume HTML content, use an HTML parser that produces an XML-compatible DOM or event stream.

A References

[XML5] XML5. van Kesteren, Anne. Weblog posting. 23 October 2007.

[Polyglot] Polyglot Markup: HTML-Compatible XHTML Documents. Graff, Eliot, editor. W3C Working Draft. 25 May 2011.

[HTML5] HTML5: A vocabulary and associated APIs for HTML and XHTML. Hickson, Ian, editor. W3C Working Draft. 25 May 2011.