<?xml version="1.0" encoding="UTF-8"?>
<!--
<?publication-root http://www.w3.org/2010/html-xml/snapshot/?>
<?latest-version http://www.w3.org/2010/html-xml/snapshot/report.html?>
-->
<specification xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xi="http://www.w3.org/2001/XInclude" class="note" version="5.0-extension w3c-xproc">
<info>
<title>HTML/XML Task Force Report</title>
<w3c-shortname>html-xml-tf-report</w3c-shortname>
<pubdate>2012-02-09</pubdate>
<bibliorelation type="isformatof" xlink:href="report.xml">XML</bibliorelation>
<!--
<bibliorelation type="isformatof" xlink:href="diff-2011-06-28.html">Diffs from 22 Mar draft</bibliorelation>
<bibliorelation type="replaces" xlink:href="http://www.w3.org/2010/html-xml/snapshot/report-2012-01-12.html"/>
-->

<authorgroup>
  <author>
    <personname>Norman Walsh</personname>
    <affiliation>
      <orgname>MarkLogic Corporation</orgname>
    </affiliation>
    <email>norman.walsh@marklogic.com</email>
  </author>
</authorgroup>

<abstract>
<para>This document is the report of the TAG Task Force
established to
explore how interoperability between HTML and XML could be
improved. It describes several use cases that the Task Force considered relevant
and proposed resolutions to those cases.</para>
</abstract>

<legalnotice role="status">

<para><emphasis>This section describes the status of this document at
the time of its publication. Other documents may supersede this
document. A list of current W3C publications and the latest revision
of this technical report can be found in the <link xlink:href="http://www.w3.org/TR/">W3C technical reports index</link>
at http://www.w3.org/TR/.</emphasis></para>

<para>This Note is a report from an
<link xlink:href="http://lists.w3.org/Archives/Public/public-html-xml/">XML/HTML
task force</link> formed
<link xlink:href="https://www.w3.org/2001/tag/group/track/actions/437">at the
request</link> of the W3C
<link xlink:href="http://www.w3.org/2001/tag/">Technical Architecture Group</link>.
Comments are welcome; send comments about the contents of this report
to the task force mailing list
<link xlink:href="mailto:public-html-xml@w3.org">public-html-xml@w3.org</link> (public
<link xlink:href="http://lists.w3.org/Archives/Public/public-html-xml/">archives</link> are available)
and discussion
of next steps to the TAG
<link xlink:href="mailto:www-tag@w3.org">www-tag@w3.org</link>
(<link xlink:href="http://lists.w3.org/Archives/Public/public-www-tag/">archives</link>).
</para>

<para>Publication as a Working Group Note does not imply endorsement
by the W3C Membership. This is a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to cite this document as other than work in
progress.</para>

<para>This document was produced by a group operating under the <link xlink:href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5
February 2004 W3C Patent Policy</link>. W3C maintains a <link xlink:href="http://www.w3.org/2001/tag/disclosures">public
list of any patent disclosures</link> made in connection with the
deliverables of the group; that page also includes instructions for
disclosing a patent. An individual who has actual knowledge of a
patent which the individual believes contains <link xlink:href="http://www.w3.org/Consortium/Patent-Policy-20040205/#def-essential">Essential
Claim(s)</link> must disclose the information in accordance with <link xlink:href="http://www.w3.org/Consortium/Patent-Policy-20040205/#sec-Disclosure">section
6 of the W3C Patent Policy</link>.</para>
</legalnotice>
</info>

<section xml:id="introduction">
<title>Introduction</title>

<para>HTML and XML share a common ancestor in SGML. The precise
details of that ancestry are not strictly important, its significant
consequence is that HTML and XML have a quite similar surface syntax.
Both use angle brackets and ampersands to distinguish “markup”
characters from “content” characters. Both have elements which contain
other content and elements which are empty.</para>

<para>This high level of surface similarity suggests, at least to some
and at least at first, that there should be a high level of
interoperability between HTML and XML systems. This notion is
amplified by the fact that when XML arrived on the scene, well after
HTML was widely deployed, efforts were made to recast HTML as an XML
application rather than an SGML application. HTML was never broadly
implemented as an “SGML application”, but it was defined as one in
the early HTML specifications.</para>

<para>However, if you look beyond those high-level generalities, the
languages are quite different and serve quite different purposes.
Where HTML is a single language, XML is a framework for defining
languages.
Where HTML defines how a tree is
constructed from any input, XML only defines tree construction for a
small subset of all possible inputs.
Where HTML defines explicit extension points within a
single vocabulary, XML encourages the use of multiple vocabularies
defined in a distributed fashion. Where HTML is in a small, explicit set
of namespaces, XML provides for an unbounded number of namespaces.</para>

<para>Against the backdrop of this tension, the TAG formed this Task
Force in order to explore how interoperability between HTML and XML
could be improved. The Task Force worked in
public; an archive of its deliberations
<link xlink:href="http://lists.w3.org/Archives/Public/public-html-xml/">is preserved</link>.
The Task Force began
by collecting use cases to focus its efforts. The original expectation
was that a set of use cases would highlight those areas where
additional work could aid in the interoperability between XML and
HTML. However, as all of the use cases appear to have plausible
solutions today, solutions that do not appear amenable to significant
improvement, it appears that there is little that can be done beyond
documenting these circumstances.
</para>

<para>In the following section, we'll describe a set of use cases that
the Task Force considered, and how the needs of those use cases can be
met today.
Additional notes and other background material
for many of these use cases is available in
<link xlink:href="http://www.w3.org/wiki/HTML_XML_Use_Cases">the wiki</link>
that the Task Force used to organize its early notes.
Readers are particularly encouraged to report additional use
cases that they feel are not represented or specific examples where the
solutions outlined are not appropriate.</para>

<section xml:id="terminology">
<title>Terminology</title>

<para>A few notes about about terminology:</para>

<para>In general, we refer the family of documents that are
colloquially understood to be HTML (HTML, XHTML, HTML5) using the term
“HTML”. In those cases where we want to draw attention to XHTML or HTML5
specifically, we use the more specific terms.</para>

<para>There are a great many ways to
represent the “object model” of an HTML or XML document. There are
specifications for both abstract and concrete representations. As a
simplification, we use the term “DOM” (Document Object Model)
throughout as a general term for any of these possible
representations.</para>

<para>An “HTML parser” is one that consumes HTML markup and produces a DOM.
We use the term “HTML5 parser” in those cases where we wish to draw attention
explicitly to the parsing behaviors described by <biblioref linkend="HTML5"/>.
An “XML parser” is one that consumes well-formed XML and produces a DOM.</para>

</section>

</section>
<section xml:id="usecases">
<title>Use Cases</title>

<para>The task force set out to examine a number of
<link xlink:href="http://www.w3.org/wiki/HTML_XML_Use_Cases">use cases</link>
for a world in which XML and HTML are both important.
These are outlined in the following sections.</para>

<section xml:id="uc01">
<title>How can an XML toolchain be used to consume HTML?</title>

<section role="statement" xml:id="uc01p">
<info>
<title>Problem statement</title>
</info>

<para>A great many systems exist which process XML. These include, but
are not limited to, validation tools, a broad spectrum of editors, browsers,
query and transformation languages, and countless ad hoc tools. Many of these tools
could be applied equally to HTML content, if such content was accessible to them.
</para>

</section>

<section role="resolution" xml:id="uc01r">
<info>
<title>Resolution</title>
</info>

<para>The principal
impediment to using XML tools with HTML is that HTML is not guaranteed
(or even likely, in the context of the internet at large) to be well-formed. XML parsers
reject documents which are not well-formed, so the overwhelming majority of HTML documents
cannot be used by systems which only process XML.</para>

<para>The Task Force
found
two approaches to address this problem: use polyglot
markup or introduce an HTML parser into your processing toolchain.</para>

<para><link xlink:href="http://www.w3.org/TR/html-polyglot/">Polyglot
markup</link> refers to documents which have been carefully crafted
such that they are simultaneously XML and HTML compatible. It seems
that the world at large is unlikely to adopt polyglot markup as the
standard way to encode all HTML documents, so this solution has
limited applicability.</para>

<para>However, the vast majority of HTML documents
<emphasis>could</emphasis> be written using polyglot markup and doing
so would make them immediately available for processing by tools that
anticipate either XML or HTML markup. If you have control over the authoring
environments that are used to create content for your system, then it may be
entirely feasible to address the “consume HTML with XML tools” problem simply
by being more careful about the HTML that you produce.</para>

<para>Where it is applicable, polyglot markup constrains the <literal role="media-type">text/html</literal> content in such a way that when
parsed as XML, it produces the same parse tree (except for
certain minor,
specified differences) as it would produce if parsed using an implementation of
the HTML parsing algorithm.</para>

<para>Alternatively, rather than attempting to constrain the HTML
input so that it conforms to the polyglot constraints, an HTML parser
can be introduced to the front of the XML toolchain. Such a parser
reads the HTML markup “as she is writ” in the world at large and
produces a representation of that tree that an XML processor can use.
It
<link xlink:href="http://www.w3.org/TR/html5/the-end.html#coercing-an-html-dom-into-an-infoset">is
still possible</link> to encounter HTML documents whose document tree
needs to be modified slightly for the document tree to be representable
as XML. For conforming input, the modifications are on the level of
replacing form feeds with spaces.
</para>

<para>Like XML parsing, HTML parsing produces a tree. Exposing that tree
to the XML toolchain (as a sequence of events, such as SAX events, or an
in-memory tree model, or through any other appropriate implementation mechanism)
makes all of the XML power available to any HTML document.</para>

</section>

</section>

<section xml:id="uc02">
<title>How can an HTML toolchain be used to consume XML?</title>

<section role="statement" xml:id="uc02p">
<info>
<title>Problem statement</title>
</info>

<para>HTML toolchains
will become
widespread and popular.
Users may encounter XML
documents and want to process them using familiar tools. This use case
is the logical reciprocal of the former use case; it's about allowing
developers of HTML-only tools to provide useful functionality to users
who have non-XHTML, non-SVG, non-MathML content, even though the tool
developer doesn't have a business need to address it
explicitly.</para>
<para>(If the XML in question is entirely XHTML or
XHTML with only SVG and MathML embedded, then the differences are likely
to be small and the HTML toolchain is likely to do the right thing; the focus of
this use case is on XML vocabularies that are not in
the HTML family.)</para>
</section>

<section role="resolution" xml:id="uc02r">
<info>
<title>Resolution</title>
</info>

<para>HTML5 doesn't have an extensibility story that admits the possibility
of content in arbitrary namespaces. XML markup from vocabularies totally
unlike HTML, SVG, or MathML will be parsed and interpreted according to the
HTML5 rules. These rules
are
very unlikely to produce the same DOM that an XML
parser would have produced.</para>

<para>For XML content that needs to be textually embedded in HTML5, the
most successful approach may be to simply translate the XML to HTML5
before passing it to the HTML5 tool.
A wide variety of XML tools exist to simplify
the technical challenge of transforming XML;
of
course, the semantic challenge of
translating an arbitrary XML vocabulary into HTML5 may be very
difficult. If a faithful translation isn't possible, even the
simple transformation that strips out processing instructions and
non-HTML namespaces may help.</para>

<para>Processing a real XML document with an HTML5 parser is probably never
going to be possible with complete fidelity.</para>

<para>In an environment where the HTML toolchain
includes access to an XML parser and the HTML and XML resources can be
managed separately, the most successful approach is likely to involve parsing
the XML with an XML parser and the HTML with an HTML parser.</para>

</section>
</section>

<section xml:id="uc03">
<title>How can islands of HTML be embedded in XML?</title>

<section role="statement" xml:id="uc03p">
<info>
<title>Problem statement</title>
</info>

<para>In XML vocabularies that are <emphasis>not</emphasis>
intrinsically about representing prose, it's often useful to provide
elements into which documentation or “prose annotations” can be
provided. One common design pattern in these cases is to establish
HTML as one of the common vocabularies marking up documentation in
those elements.</para>

<para>This pattern establishes the practice of embedding islands of
HTML in XML documents that are not otherwise anything like HTML or
intended to be processed directly by HTML tools.</para>

<para>The question naturally arises, how can HTML5 be embedded in an
XML document?</para>
</section>

<section role="resolution" xml:id="uc03r">
<info>
<title>Resolution</title>
</info>

<para>Broadly speaking, there are two techniques for addressing the question
of how HTML is to be embedded in XML.</para>

<orderedlist>
<listitem>
<para>Make sure that the HTML markup is well-formed XML. This is typically
done by explicitly or implicitly asserting that the content is XHTML. This
makes the HTML a natural part of the XML document at the expense of imposing
XML markup requirements on the author.
</para>
</listitem>
<listitem>
<para>Within the container element, escape all characters that might
be interpreted as markup. This absolves the author of the
responsibility to construct well-formed XML, at the expense of
requiring tools to escape and unescape the markup and support
non-well-formed markup “downstream”.
</para>
</listitem>
</orderedlist>

<para>Both of these techniques can be applied to HTML5 markup. In the
former case, use the XML serialization of HTML5. In the latter case,
escape the HTML5 markup.</para>

<para>If the HTML subsystem has an interface that
allows document trees to be passed to it, the XHTML subtree should be
extracted from the larger XML tree and passed to the HTML subsystem.
If the HTML subsystem only accepts HTML source text as its input, the
XHTML subtree needs to be serialized as HTML and passed to the HTML
subsystem for parsing using an HTML parser. In the latter case, some
non-conforming constructs may not round-trip to the same tree shape
when serialized as HTML and reparsed as HTML. Also, conforming trees
that have <tag>tr</tag> elements as children of <tag>table</tag> elements will be replaced
with semantically equivalent but tree-wise different constructs where
the <tag>tr</tag> elements gain a <tag>tbody</tag> parent which is a child of the
<tag>table</tag>.</para>

</section>

</section>

<section xml:id="uc04">
<title>How can islands of XML be embedded in HTML?</title>

<section role="statement" xml:id="uc04p">
<info>
<title>Problem statement</title>
</info>
<para>In principle, the same powerful scripting and styling facilities
that allow users to create rich internet applications with HTML5 can
operate on XML documents. Users may attempt to engage in a “progressive
enhancement” strategy for building such applications by adding islands
of more richly structured XML markup to existing HTML5 documents.</para>

<para>The user's expectation is that these XML islands will appear in
the DOM where they can be addressed with JavaScript and formatted with
CSS.</para>
</section>

<section role="resolution" xml:id="uc04r">
<info>
<title>Resolution</title>
</info>

<para>In fact, this is not the case
for content served as <code>text/html</code>. When
an
HTML5 parser encounters
unfamiliar markup, it assumes that such markup is an erroneous
attempt to generate
well-defined
HTML5. Consequently, it applies error
correction strategies which result in a DOM representation that can differ
radically from the DOM that an XML parser would have produced.
In particular, open elements may end prematurely and additional elements
may be opened.</para>

<para>The practical result is that a “naked” XML island in an HTML5 document
will not reliably produce anything that resembles the DOM one would expect
from casual inspection of the XML island.</para>

<para>In order to conceal the XML markup from
an
HTML5 parser's attempts
to correct errors, the XML must be stored within a <tag>script</tag> element.
The script can identify the content as XML by specifying
the content type “<literal>application/xml</literal>”
or any other applicable media type.
</para>

<para>What
an
HTML5 parser produces when it processes this script element is
a <literal>script</literal> element node in the DOM which contains the
literal
character representation of the XML. That representation
can be extracted by JavaScript when the page is loaded, parsed into an actual
XML DOM, and processed by the application.</para>

<para>This technique allows arbitrary XML islands to be embedded in HTML5,
but such islands are only accessible to processors that are able and willing
to execute the necessary JavaScript shim.</para>

<para>Note: XHTML
content served as <code>application/xhtml+xml</code> is, in fact, XML and so
embedded islands of richly structured XML markup are preserved.
Serving XML content to user agents carries its own set
of problems, however. Note also that polyglot markup is not an aid here
as it forbids arbitrary XML content from the document.</para>
</section>

</section>

<section xml:id="uc05">
<title>How can XML be made more forgiving of errors?</title>

<section role="statement" xml:id="uc05p">
<info>
<title>Problem statement</title>
</info>

<para>Some significant portion of HTML5 is generated by server-side tools
that do little more than string-concatenation. Markup generated by naive
string concatenation often results in minor markup errors.
An
HTML5 parser
consistently (and often correctly) corrects for these mistakes and constructs
a useful DOM from a not-quite-perfectly constructed inputs.</para>

<para>An
XML parser is utterly unforgiving in the face of even small
markup errors. As a result, XML constructed using otherwise
straightforward techniques in many programming languages is sometimes
not well-formed unless great care is taken.</para>
</section>

<section role="resolution" xml:id="uc05r">
<info>
<title>Resolution</title>
</info>

<para>This aspect of XML parsing could be addressed by a more lenient parser
(such as XML5).
Working out all of the details to assure that the necessary error correction
produces expected results in all cases might be tedious, but some efforts have
already been undertaken to examine the issue.
However, it's entirely unclear that the XML community
would be motivated to adopt such changes and, in any event, making
such proposals is outside the scope of this Task Force.
</para>

</section>
</section>
</section>
<section xml:id="conclusions">
<title>Conclusions</title>

<para>The Task Force considered several areas of
interoperability that arose in these use cases: consuming HTML with XML
tools, consuming XML with HTML tools, and embedding islands of one
within the other. As described above, there are well understood
boundaries within which any solution to each use case can operate. And within
those boundaries, there exists today a solution that, while perhaps not wholly
satisfactory, sits within those boundaries. No wholly satisfying solution appears
possible within the accepted constraints; it would appear that we have already achieved
the practical solutions.</para>

<para>With respect to the question of making XML
more forgiving to errors, it's clear that some work has been done in
this area and that it is possible to articulate coherent proposals for
such change. We recommend
further study within the XML community before determining how best to explore
these changes.</para>

<para>On the question of Polyglot markup, there seems to be little consensus.
One line of argument suggests that, to the extent that it is practical to obey
the <link xlink:href="https://secure.wikimedia.org/wikipedia/en/wiki/Robustness_Principle">Robustness
principle</link>, it makes sense to do so. That is, if you're generating HTML markup
for the web, and you can generate Polyglot markup that is also directly consumable as
XML, you should do so. Another line of argument suggests that even under the most
optimistic of projections, so tiny a fraction of the web will ever be written in Polyglot
that there's no practical benefit to pursuing it
as a general strategy for consuming documents from the
web. If you want to consume HTML content, use
an HTML parser that produces an XML-compatible DOM or event stream.</para>
</section>

<appendix xml:id="references">
<title>References</title>

<bibliolist>
<bibliomixed xml:id="XML5"><abbrev>XML5</abbrev>
<citetitle xlink:href="http://annevankesteren.nl/2007/10/xml5">XML5</citetitle>.
van Kesteren, Anne. Weblog posting. 23 October 2007.</bibliomixed>

<bibliomixed xml:id="Polyglot"><abbrev>Polyglot</abbrev>
<citetitle xlink:href="http://www.w3.org/TR/html-polyglot/">Polyglot Markup: HTML-Compatible
XHTML Documents</citetitle>.
Graff, Eliot, editor. W3C Working Draft. 25 May 2011.</bibliomixed>

<bibliomixed xml:id="HTML5"><abbrev>HTML5</abbrev>
<citetitle xlink:href="http://www.w3.org/TR/html5/">HTML5: A vocabulary and associated APIs for
HTML and XHTML</citetitle>.
Hickson, Ian, editor. W3C Working Draft. 25 May 2011.</bibliomixed>

<bibliomixed xml:id="XHTML"><abbrev>XHTML</abbrev>
<citetitle xlink:href="http://www.w3.org/TR/xhtml1/">XHTML™ 1.0 The Extensible
HyperText Markup Language (Second Edition)</citetitle>.
W3C Recommendation. 26 January 2000.</bibliomixed>
</bibliolist>
</appendix>

<appendix xml:id="contributors">
<title>Contributors</title>

<para>The Task Force is indebted to the contributors on the public
mailing list, the wiki, and those individuals who participated in
teleconferences and meetings. In particular, the following individuals
devoted time and energy to the construction of this document:</para>

<itemizedlist>
<listitem>
<para>Robin Berjon</para>
</listitem>
<listitem>
<para>David Carlisle</para>
</listitem>
<listitem>
<para>Michael Champion</para>
</listitem>
<listitem>
<para>John Cowan</para>
</listitem>
<listitem>
<para>Anne van Kesteren</para>
</listitem>
<listitem>
<para>Noah Mendelsohn</para>
</listitem>
<listitem>
<para>Henri Sivonen</para>
</listitem>
<listitem>
<para>Norman Walsh</para>
</listitem>
</itemizedlist>
</appendix>

</specification>
