Talk:HTML XML Use Case 01

From W3C Wiki

Using an HTML parser as the first processing step (Adaptation by consumer)

The first solution is to add an HTML5 parser to the consumer's pipeline. In the simplest scenario, one would use an HTML parser to parse the text/html input into a tree, apply the rules of the infoset coercion section of the HTML5 spec (to make sure the tree represents a well-formed XML Infoset), serialized the tree into XML and feed the resulting XML to the original tool.

If the source code of the XML tool is available for modification, the serialization to XML step and the second parsing step can be optimized away. The HTML parser can expose the same API as the XML parser that was originally part of the toolchain exposes. This way, when consuming text/html content, an HTML parser can feed data into the rest of the program and the rest of the program can be written as if it were interfacing with an XML parser.

When the intermediate XML serialization has been optimized away like this, the infoset coercion step can also be optimized away if the rest of the application doesn't rely on the well-formedness of the infoset. That is, if the rest of the application doesn't throw an exception if a non-NCName local name is given to it, etc., the infoset coercion step mentioned above can be left out.


Using polyglot markup (Adaptation by the producer)

The second solution that has been mentioned is using polyglot markup. This solution involves an adaptation on the part of the producer. The producer constrains the text/html content produced to be both HTML and XML at the same time.

This can be done either by using XML tools, or by making it a site policy throughout the workflow, or by taking whatever is available at the end of the workflow and adding a step using existing tools such as tidy -asxml.

This is not about forcing everyone to produce polyglot, or about consumers expecting polyglot everywhere.

If you have a lot of HTML, you can include a tidy to xml as last processing step. If you are processing using XML, this happens naturally.

The text/html content is constrained in such a way that when parsed as XML, it produces the same parse tree (except for how the xmlns attribute on the root element is represented) as it would produce if parsed using an implementation of the HTML parsing algorithm.

If you generate polyglot, you have two audiences for your data, those who handle HTML and those who handle XML. You provide a better service, addressing a larger audience.

No one can assume that all content on the web is polyglot, but there is little cost and in some cases gain from possibly both internal and external consumption. This may become a norm within certain communities. These communities may grow through a "race to the top", from producers competing to be be of service to the wider audience.

As consumers apply the first solution above (HTML parser), then the motivation for the race to the top is decreased. A "race to the bottom" occurs, in which consumers compete to be able to be able to consume a wider range of information.

In fact, if the user starts applying the HTML parser solution to content received from one source, it no longer makes sense not to apply it to other text/html sources as well. Thus, from the point of view of a would-be polyglot content supplier, making a document polyglot won't be of value if someone else whose document needs to be consumed by the same consumer makes a monoglot document. The would-be polyglot author might as well be the first one to make a monoglot document that forces the consumer to deal. Thus, having content suppliers give the user polyglot markup is not a stable equilibrium: Every one of the content suppliers has the incentive (because producing polyglot is harder than producing monoglot) to push the user to use the HTML parser solution instead.