Re: Fwd: HTML5 and XHTML2 combined (a new approach) from Philip Taylor on 2009-01-27 (www-html@w3.org from January 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Tue, 27 Jan 2009 10:51:58 +0000
To: www-html@w3.org, Giovanni Campagna <scampa.giovanni@gmail.com>
Message-ID: <497EE74E.8000306@cam.ac.uk>

Giovanni Campagna wrote:
>>> I asked a different question: why an author that doesn't rely on script
>>> (or an implementation that cannot, for various reason, implement
>>> scripts) should learn a plenty of DOM interfaces and APIs?
>>>
>>
>> DOM is the abstract model that serializations express. So if an
>> implementation is parsing a serialization, it's producing a DOM, regardless
>> of whether it supports scripting.
> 
> What about SAX parsers? They don't build any DOM. An implementation is
> required to build an Infoset (abstract concept), not a DOM (a set of objects
> implementing certain interfaces)

HTML5 only requires that implementations act the same as if they were 
producing a DOM - it doesn't require that they actually do produce a DOM 
internally. (It specifically says "Note: Implementations that do not 
support scripting do not have to actually create a DOM Document object, 
but the DOM tree in such cases is still used as the model for the rest 
of the specification.")

You can write a streaming SAX parser for HTML5 without buffering 
anything into a tree, as long as you treat some errors as fatal (e.g. 
"<table>foo" is non-streamable because the text "foo" comes before the 
<table> in the parsed document). If you don't hit a non-streamable 
error, the output from the SAX parser has to be equivalent to what you'd 
get by parsing into a DOM and then emitting it as SAX, but there's no 
need to actually create a DOM.

The parser algorithm uses phrases like "Append a Comment node to the 
Document object with the data attribute set to the data given in the 
comment token.", which are fairly high-level (it's not saying e.g. 
"document.appendChild(document.createComment(token.data))") and easy to 
understand in terms of any tree model, and don't require a detailed 
knowledge of DOM.

So the DOM is being used largely as an abstract concept and not as a set 
of objects. Since scripting relies on the DOM, the spec has to define 
how to get a DOM from a serialised document (and how to handle e.g. 
scripts mutating the document while it's being parsed), and that's much 
easier if the parser's abstract model is the DOM instead of using some 
other model that has to be explicitly mapped onto the DOM 
implementation. As far as I'm aware, implementers of non-scripted 
parsers have not had any problems mapping the concepts onto different 
output formats (html5lib has several tree formats, Validator.nu has XOM 
and SAX, etc), so it seems to work fine in practice.

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Tuesday, 27 January 2009 10:52:34 UTC