HTML/XML Task Force Report

Introduction HTML and XML share a common ancestor in SGML. The precise details of that ancestry are not strictly important, its significant consequence is that HTML and XML have a quite similar surface syntax. Both use angle brackets and ampersands to distinguish “markup” characters from “content” characters. Both have elements which contain other content and elements which are empty. This high level of surface similarity suggests, at least to some and at least at first, that there should be a high level of interoperability between HTML and XML systems. This notion is amplified by the fact that when XML arrived on the scene, well after HTML was widely deployed, efforts were made to recast HTML as an XML application rather than an SGML application. HTML was never broadly implemented as an “SGML application”, but it was defined as one in the early HTML specifications. However, if you look beyond those high-level generalities, the languages are quite different and serve quite different purposes. Where HTML is a single language, XML is a framework for defining languages. Where HTML defines how a tree is constructed from any input, XML only defines tree construction for a small subset of all possible inputs. Where HTML defines explicit extension points within a single vocabulary, XML encourages the use of multiple vocabularies defined in a distributed fashion. Where HTML is in a small, explicit set of namespaces, XML provides for an unbounded number of namespaces. Against the backdrop of this tension, the TAG formed this Task Force in order to explore how interoperability between HTML and XML could be improved. The Task Force worked in public; an archive of its deliberations is preserved. The Task Force began by collecting use cases to focus its efforts. The original expectation was that a set of use cases would highlight those areas where additional work could aid in the interoperability between XML and HTML. However, as all of the use cases appear to have plausible solutions today, solutions that do not appear amenable to significant improvement, it appears that there is little that can be done beyond documenting these circumstances. In the following section, we'll describe a set of use cases that the Task Force considered, and how the needs of those use cases can be met today. Additional notes and other background material for many of these use cases is available in the wiki that the Task Force used to organize its early notes. Readers are particularly encouraged to report additional use cases that they feel are not represented or specific examples where the solutions outlined are not appropriate.

Terminology A few notes about about terminology: In general, we refer the family of documents that are colloquially understood to be HTML (HTML, XHTML, HTML5) using the term “HTML”. In those cases where we want to draw attention to XHTML or HTML5 specifically, we use the more specific terms. There are a great many ways to represent the “object model” of an HTML or XML document. There are specifications for both abstract and concrete representations. As a simplification, we use the term “DOM” (Document Object Model) throughout as a general term for any of these possible representations. An “HTML parser” is one that consumes HTML markup and produces a DOM. We use the term “HTML5 parser” in those cases where we wish to draw attention explicitly to the parsing behaviors described by . An “XML parser” is one that consumes well-formed XML and produces a DOM.

Use Cases The task force set out to examine a number of use cases for a world in which XML and HTML are both important. These are outlined in the following sections.

How can an XML toolchain be used to consume HTML?

Problem statement A great many systems exist which process XML. These include, but are not limited to, validation tools, a broad spectrum of editors, browsers, query and transformation languages, and countless ad hoc tools. Many of these tools could be applied equally to HTML content, if such content was accessible to them.

Resolution The principal impediment to using XML tools with HTML is that HTML is not guaranteed (or even likely, in the context of the internet at large) to be well-formed. XML parsers reject documents which are not well-formed, so the overwhelming majority of HTML documents cannot be used by systems which only process XML. The Task Force found two approaches to address this problem: use polyglot markup or introduce an HTML parser into your processing toolchain. Polyglot markup refers to documents which have been carefully crafted such that they are simultaneously XML and HTML compatible. It seems that the world at large is unlikely to adopt polyglot markup as the standard way to encode all HTML documents, so this solution has limited applicability. However, the vast majority of HTML documents could be written using polyglot markup and doing so would make them immediately available for processing by tools that anticipate either XML or HTML markup. If you have control over the authoring environments that are used to create content for your system, then it may be entirely feasible to address the “consume HTML with XML tools” problem simply by being more careful about the HTML that you produce. Where it is applicable, polyglot markup constrains the text/html content in such a way that when parsed as XML, it produces the same parse tree (except for certain minor, specified differences) as it would produce if parsed using an implementation of the HTML parsing algorithm. Alternatively, rather than attempting to constrain the HTML input so that it conforms to the polyglot constraints, an HTML parser can be introduced to the front of the XML toolchain. Such a parser reads the HTML markup “as she is writ” in the world at large and produces a representation of that tree that an XML processor can use. It is still possible to encounter HTML documents whose document tree needs to be modified slightly for the document tree to be representable as XML. For conforming input, the modifications are on the level of replacing form feeds with spaces. Like XML parsing, HTML parsing produces a tree. Exposing that tree to the XML toolchain (as a sequence of events, such as SAX events, or an in-memory tree model, or through any other appropriate implementation mechanism) makes all of the XML power available to any HTML document.

How can an HTML toolchain be used to consume XML?

Problem statement HTML toolchains will become widespread and popular. Users may encounter XML documents and want to process them using familiar tools. This use case is the logical reciprocal of the former use case; it's about allowing developers of HTML-only tools to provide useful functionality to users who have non-XHTML, non-SVG, non-MathML content, even though the tool developer doesn't have a business need to address it explicitly. (If the XML in question is entirely XHTML or XHTML with only SVG and MathML embedded, then the differences are likely to be small and the HTML toolchain is likely to do the right thing; the focus of this use case is on XML vocabularies that are not in the HTML family.)

Resolution HTML5 doesn't have an extensibility story that admits the possibility of content in arbitrary namespaces. XML markup from vocabularies totally unlike HTML, SVG, or MathML will be parsed and interpreted according to the HTML5 rules. These rules are very unlikely to produce the same DOM that an XML parser would have produced. For XML content that needs to be textually embedded in HTML5, the most successful approach may be to simply translate the XML to HTML5 before passing it to the HTML5 tool. A wide variety of XML tools exist to simplify the technical challenge of transforming XML; of course, the semantic challenge of translating an arbitrary XML vocabulary into HTML5 may be very difficult. If a faithful translation isn't possible, even the simple transformation that strips out processing instructions and non-HTML namespaces may help. Processing a real XML document with an HTML5 parser is probably never going to be possible with complete fidelity. In an environment where the HTML toolchain includes access to an XML parser and the HTML and XML resources can be managed separately, the most successful approach is likely to involve parsing the XML with an XML parser and the HTML with an HTML parser.

How can islands of HTML be embedded in XML?

Problem statement In XML vocabularies that are not intrinsically about representing prose, it's often useful to provide elements into which documentation or “prose annotations” can be provided. One common design pattern in these cases is to establish HTML as one of the common vocabularies marking up documentation in those elements. This pattern establishes the practice of embedding islands of HTML in XML documents that are not otherwise anything like HTML or intended to be processed directly by HTML tools. The question naturally arises, how can HTML5 be embedded in an XML document?

Resolution Broadly speaking, there are two techniques for addressing the question of how HTML is to be embedded in XML. Make sure that the HTML markup is well-formed XML. This is typically done by explicitly or implicitly asserting that the content is XHTML. This makes the HTML a natural part of the XML document at the expense of imposing XML markup requirements on the author. Within the container element, escape all characters that might be interpreted as markup. This absolves the author of the responsibility to construct well-formed XML, at the expense of requiring tools to escape and unescape the markup and support non-well-formed markup “downstream”. Both of these techniques can be applied to HTML5 markup. In the former case, use the XML serialization of HTML5. In the latter case, escape the HTML5 markup. If the HTML subsystem has an interface that allows document trees to be passed to it, the XHTML subtree should be extracted from the larger XML tree and passed to the HTML subsystem. If the HTML subsystem only accepts HTML source text as its input, the XHTML subtree needs to be serialized as HTML and passed to the HTML subsystem for parsing using an HTML parser. In the latter case, some non-conforming constructs may not round-trip to the same tree shape when serialized as HTML and reparsed as HTML. Also, conforming trees that have tr elements as children of table elements will be replaced with semantically equivalent but tree-wise different constructs where the tr elements gain a tbody parent which is a child of the table.

How can islands of XML be embedded in HTML?

Problem statement In principle, the same powerful scripting and styling facilities that allow users to create rich internet applications with HTML5 can operate on XML documents. Users may attempt to engage in a “progressive enhancement” strategy for building such applications by adding islands of more richly structured XML markup to existing HTML5 documents. The user's expectation is that these XML islands will appear in the DOM where they can be addressed with JavaScript and formatted with CSS.

Resolution In fact, this is not the case for content served as text/html. When an HTML5 parser encounters unfamiliar markup, it assumes that such markup is an erroneous attempt to generate well-defined HTML5. Consequently, it applies error correction strategies which result in a DOM representation that can differ radically from the DOM that an XML parser would have produced. In particular, open elements may end prematurely and additional elements may be opened. The practical result is that a “naked” XML island in an HTML5 document will not reliably produce anything that resembles the DOM one would expect from casual inspection of the XML island. In order to conceal the XML markup from an HTML5 parser's attempts to correct errors, the XML must be stored within a script element. The script can identify the content as XML by specifying the content type “application/xml” or any other applicable media type. What an HTML5 parser produces when it processes this script element is a script element node in the DOM which contains the literal character representation of the XML. That representation can be extracted by JavaScript when the page is loaded, parsed into an actual XML DOM, and processed by the application. This technique allows arbitrary XML islands to be embedded in HTML5, but such islands are only accessible to processors that are able and willing to execute the necessary JavaScript shim. Note: XHTML content served as application/xhtml+xml is, in fact, XML and so embedded islands of richly structured XML markup are preserved. Serving XML content to user agents carries its own set of problems, however. Note also that polyglot markup is not an aid here as it forbids arbitrary XML content from the document.

How can XML be made more forgiving of errors?

Problem statement Some significant portion of HTML5 is generated by server-side tools that do little more than string-concatenation. Markup generated by naive string concatenation often results in minor markup errors. An HTML5 parser consistently (and often correctly) corrects for these mistakes and constructs a useful DOM from a not-quite-perfectly constructed inputs. An XML parser is utterly unforgiving in the face of even small markup errors. As a result, XML constructed using otherwise straightforward techniques in many programming languages is sometimes not well-formed unless great care is taken.

Resolution This aspect of XML parsing could be addressed by a more lenient parser (such as XML5). Working out all of the details to assure that the necessary error correction produces expected results in all cases might be tedious, but some efforts have already been undertaken to examine the issue. However, it's entirely unclear that the XML community would be motivated to adopt such changes and, in any event, making such proposals is outside the scope of this Task Force.