HTML XML Use Case 01

From W3C Wiki

Using an XML toolchain to consume HTML

Problem statement

The user has some software that operates on XML input and would operate sensibly when given XHTML5 input or the user has libraries that makes it convenient to write such software. The user would like to feed text/html content to the piece of software also.

(Note that it is assumed that the tooling supports XHTML5 sensibly. I.e. an XML parser is available and whatever the program does after parsing solves some problem for the user when the user supplies XHTML5 input to the program.)

Solutions

Using an HTML parser as the first processing step (generally applicable)

The first (and the more versatile and generally useful) solution is to add an HTML5 parser to the pipeline. In the simplest scenario, one would use an HTML parser to parse the text/html input into a tree, apply the rules of the infoset coercion section of the HTML5 spec (to make sure the tree represents a well-formed XML Infoset), serialized the tree into XML and feed the resulting XML to the original tool.

If the source code of the XML tool is available for modification, the serialization to XML step and the second parsing step can be optimized away. The HTML parser can expose the same API as the XML parser that was originally part of the toolchain exposes. This way, when consuming text/html content, an HTML parser can feed data into the rest of the program and the rest of the program can be written as if it were interfacing with an XML parser.

When the intermediate XML serialization has been optimized away like this, the infoset coercion step can also be optimized away if the rest of the application doesn't rely on the well-formedness of the infoset. That is, if the rest of the application doesn't throw an exception if a non-NCName local name is given to it, etc., the infoset coercion step mentioned above can be left out.

Using polyglot markup (not generally applicable)

The second (and less versatile and in the general case inapplicable) solution that has been mentioned is using polyglot markup. This solution requires restricting the problem statement by stipulating additional restrictions on the properties of the text/html content to be consumed.

The text/html content is constrained in such a way that when parsed as XML, it produces the same parse tree (except for how the xmlns attribute on the root element is represented) as it would produce if parsed using an implementation of the HTML parsing algorithm.

For the user to be able to consume text/html content using an XML parser, every document (s)he wants to consume has to be polyglot. If the content to be consumed is Web content in general, there's no way to force all of it to be polyglot, and the polyglot solution isn't applicable.

Thus, the user either has to be the producer of the text/html content to be consumed (in order to have full control over it) or has to have bilateral agreements with each producer of text/html content to be consumed to make sure the producer only supply polyglot text/html content.

If the user applied the first solution (HTML parser) instead, there'd be no need to enforce such bilateral agreements. In fact, if the user starts applying the HTML parser solution to content received from one source, it no longer makes sense not to apply it to other text/html sources as well. Thus, from the point of view of a would-be polyglot content supplier, making a document polyglot won't be of value if someone else whose document needs to be consumed by the same consumer makes a monoglot document. The would-be polyglot author might as well be the first one to make a monoglot document that forces the consumer to deal. Thus, having content suppliers give the user polyglot markup is not a stable equilibrium: Every one of the content suppliers has the incentive (because producing polyglot is harder than producing monoglot) to push the user to use the HTML parser solution instead.