HTML XML Use Case 03

From W3C Wiki

Islands of HTML in XML

Problem statement

The user wishes to use HTML as part of some XML vocabulary. Quite likely, the HTML is being used as a means of conveying rich text and/or a renderable compound document fragment.

There are at least two variants of this use case that are potentially of interest:

  • The user is prepared to provide a well formed XHTML-style fragment for each such bit of HTML. This could be because the HTML is already well formed, or because the user is prepared to make it well formed (perhaps by running it through an HTML parser and reserializing the resulting DOM).
  • The user requires, for whatever reason, to embed HTML that is potentially not well formed. This might be because the HTML is generated by a tool over which the user has no control, etc.

Either way, we need to consider several aspects of the problem, including: is there an agreed serialization for doing such embedding, and what sort of processing model is the user likely to use, e.g., to render the HTML? We also need to consider things like what degree of scripting support is required (e.g. what does it mean to do "onload" script processing in a context like this?)

Solutions for serialization

Use XHTML subtree serializations

For those users who are willing to provide well formed XML, that can be included in the obvious way. There is still a question as to what the processing model is: do we presume the HTML will be clipped out and handed to an HTML processor, or do we presume that the entire XML document will be parsed into an XML DOM? In the latter case, prefix bindings in the container would apply to the HTML.

Use escaped text/html-style serializations

Non-well formed HTML5 can in principle be conveyed as escaped text or CDATA.

Use MTOM

In environments like SOAP, MTOM or XOP could be used to carry the HTML in what would be, under the covers, multipart/related.

Solutions for processing

Process the XML container and HTML together

When XHTML-style markup is contained directly, there is the option to parse the entire document into into an XML DOM. We would then need tools that could render from such a DOM. There are also questions about where scripts can be conveyed, if there are multiple HTML islands whether they are processed as separate HTML documents or not, etc.

Clip out the HTML and process separately

Regardless of how the HTML is serialized, one can envision "clipping out" a particular island and handing that to a conventional HTML parser, typed either as text/html or as application/xhtml+xml.