This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 13410 - XML serialisation incompletely defined.
Summary: XML serialisation incompletely defined.
Status: RESOLVED WORKSFORME
Alias: None
Product: WebAppsWG
Classification: Unclassified
Component: DOM Parsing and Serialization (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P2 critical
Target Milestone: ---
Assignee: Travis Leithead [MSFT]
QA Contact: public-webapps-bugzilla
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-07-28 15:52 UTC by David Carlisle
Modified: 2014-10-20 16:35 UTC (History)
12 users (show)

See Also:


Attachments

Description David Carlisle 2011-07-28 15:52:09 UTC
http://www.whatwg.org/specs/web-apps/current-work/#xml-fragment-serialization-algorithm
http://www.whatwg.org/specs/web-apps/current-work/#html-fragment-serialization-algorithm
http://html5.org/specs/dom-parsing.html#serializing

The XML fragment serialisation is not defined in the same detail as
the HTML fragment serialisation algorithm. The latter specifies in some
detail how the HTML syntax corresponding to a node is constructed, and
how strings are quoted. The XML version does neither.

In particular the algorithm described here is not clearly consistent
with that defined in the draft DOM Parsing and Serialization spec.
For example in the html spec it is implied (although not particularly
explicitly) that a CDATA node may need to be serialised as multiple
CDATA sections if certain otherwise illegal character data is contained
in the node, whereas dom-parsing says:

If data doesn't match the CData production, throw an INVALID_STATE_ERR
exception and terminate the entire algorithm. 

Note that the XQuery/XSLT serialisation spec specifies in detail rules
for serialising CDATA and other problematic cases
http://www.w3.org/TR/xslt-xquery-serialization/#XML_CDATA-SECTION-ELEMENTS
It would be good if the xml and html fragment serialisation algorithms
were at least compatible with the xml, xhtml and html serialisations
defined there.
Comment 1 Ms2ger 2011-07-28 17:55:31 UTC
As you might have noticed, I haven't put a lot of time into that algorithm. Improvements welcome; otherwise I'll get to it at some point.
Comment 2 Aryeh Gregor 2011-07-29 18:34:17 UTC
FWIW, currently DOM Core mandates that CDATA doesn't exist, and at least Gecko is interested in trying to get rid of it:

https://bugzilla.mozilla.org/show_bug.cgi?id=660660
Comment 3 Michael[tm] Smith 2011-08-04 05:34:27 UTC
mass-move component to LC1
Comment 4 Ian 'Hixie' Hickson 2011-08-17 21:51:51 UTC
Isn't this just XML's problem? Why would we need to redefine the XML spec here? I don't understand the problem. What is the interoperability risk here?
Comment 5 Anne 2011-08-17 22:07:09 UTC
I have heard of cases where sites serialize some XML and then break because we included an XML declaration. And when specifications fail to deliver, we address the gaps. At least, we have been doing that to some extent.
Comment 6 David Carlisle 2011-08-18 08:44:47 UTC
(In reply to comment #4)
> Isn't this just XML's problem? Why would we need to redefine the XML spec here?
> I don't understand the problem. What is the interoperability risk here?


XML doesn't define an algorithm for serialising a DOM tree.
You could just say that the "xml fragment serialisation" as referred to by the html spec meant "any string which would parse to the same tree given an XML parser" but then subsequent processing needs to be defined to be tolerant to the implementation-specific differences that that generates. As Anne just mentioned there are choices about whether to use xml declarations, whether to use foo/> or foo></foo> whether to use &gt; or > etc etc. If it woul dbe sufficient to define the xml serialisation in this way one would expect that it would be sufficient to define the html serialisation the same way: any string which parses to the same dom using the html parse algorithm. However that is not the way the html serialisation is defined, one particular serialisation is defined (which seems like a good thing to me from an interop point of view)
Comment 7 Ian 'Hixie' Hickson 2011-09-23 22:59:22 UTC
I agree that it needs to be defined, but how to serialise XML interoperably seems like an issue for the XML spec, not the HTML spec.
Comment 8 David Carlisle 2011-09-23 23:23:35 UTC
(In reply to comment #7)
> I agree that it needs to be defined, but how to serialise XML interoperably
> seems like an issue for the XML spec, not the HTML spec.

Not really. The XML spec doesn't define any object model that the xml syntax parses to, so there is nothing for it to define the serialisation of. The place in the "XML family of specs" where an object model for which serialisation could be defined was/is the DOM specs, but as you know, most of the parts of that relevant to the web browser context are either incorporated into the html spec or being refreshed as part of the same suite. In this case since one of that family of specs is explicitly about serialising the DOM, that seems like the natural place for it.
Comment 9 Ian 'Hixie' Hickson 2011-09-26 22:28:04 UTC
If you think how to serialise XML should be defined by the DOM specs and not the XML specs then I guess we can see what the DOM spec people think. But personally I think that's absurd.

In either case, having the _HTML_ spec define XML's serialisation is simply inappropriate. The HTML spec defines HTML's syntax, parsing rules (including error handling), and serialisation. The CSS specs define CSS's syntax, parsing rules (including error handling), and serialisation. The XML spec should define XML's syntax, parsing rules, and serialisation.
Comment 10 Anne 2011-09-27 08:12:40 UTC
DOM Parsing and Serialization has a start. Someone should finish it.
Comment 11 David Carlisle 2011-09-27 08:25:46 UTC
(In reply to comment #9)
> If you think how to serialise XML should be defined by the DOM specs and not
> the XML specs then I guess we can see what the DOM spec people think. But
> personally I think that's absurd.

As I say, you can only specify the serialisation of an in-memory structure, you can't specify the serialisation of a syntax. So in XML's case it s the serialisation of the DOM that needs to be specified.

> 
> In either case, having the _HTML_ spec define XML's serialisation is simply
> inappropriate. 

I agree. As Anne commented in comment 10, the relevant part has already been split off into DOM parsing and serialisation, so this entire bug is a comment on that (and I see someone has changed the component of this report appropriately)
Comment 12 John Thomas 2012-02-11 22:12:39 UTC
It made sense for XML serialization to be in the DOM specs when DOM applied to both XML and HTML, if the current DOM group is only concerned with HTML, then it may make sense to have a new XML DOM group separate from HTML DOM which will concern itself with matters like this.

If there is no interest within the W3C on further building standard DOM apis for XML, then perhaps its time for the W3C to seek other stewards for those specs. (Of course an argument could be made that DOM Level 2 is already enough for the needs of non-browsers, I'm not going to make that argument, but I'd be interested in hearing from someone who wants to)


(In reply to comment #9)
> If you think how to serialise XML should be defined by the DOM specs and not
> the XML specs then I guess we can see what the DOM spec people think. But
> personally I think that's absurd.
> 
> In either case, having the _HTML_ spec define XML's serialisation is simply
> inappropriate. The HTML spec defines HTML's syntax, parsing rules (including
> error handling), and serialisation. The CSS specs define CSS's syntax, parsing
> rules (including error handling), and serialisation. The XML spec should define
> XML's syntax, parsing rules, and serialisation.
Comment 13 C. Scott Ananian 2014-04-02 02:41:53 UTC
As a concrete use case for a well-defined XML serialization, I'll note that there is no way currently specified to generate an HTML serialization of a document once it is parsed as XML.  Because the XML serialization is ill-defined, browsers can (and do) emit constructs like "<span/>" which can't be parsed by the HTML parser.

I'd suggest that the XML serialization be defined to conform to the recommendations in http://dev.w3.org/html5/html-polyglot/html-polyglot.html such that (in the absence of unsupported XML features such as namespaces) it is also a valid HTML document.

That is, as far as possible, doc1 and doc2 should be identical in:

var parser = new DOMParser();
var doc1 = parser.parseFromString(stringContainingXMLSource, "application/xml");
var out = (new XMLSerializer()).serializeToString(doc);
var doc2 = parser.parseFromString(out, "text/html")
Comment 14 Simon Pieters 2014-04-02 09:45:20 UTC
(In reply to C. Scott Ananian from comment #13)
> As a concrete use case for a well-defined XML serialization, I'll note that
> there is no way currently specified to generate an HTML serialization of a
> document once it is parsed as XML.  Because the XML serialization is
> ill-defined, browsers can (and do) emit constructs like "<span/>" which
> can't be parsed by the HTML parser.

Don't parse XML as HTML.

> I'd suggest that the XML serialization be defined to conform to the
> recommendations in http://dev.w3.org/html5/html-polyglot/html-polyglot.html
> such that (in the absence of unsupported XML features such as namespaces) it
> is also a valid HTML document.

Please no, that introduces unnecessary complexity to the XML serializer. The polyglot spec is supposed to only affect authors that want to waste their time following it, not UAs.

> That is, as far as possible, doc1 and doc2 should be identical in:
> 
> var parser = new DOMParser();
> var doc1 = parser.parseFromString(stringContainingXMLSource,
> "application/xml");
> var out = (new XMLSerializer()).serializeToString(doc);
> var doc2 = parser.parseFromString(out, "text/html")

Why would you want to do this? It doesn't make sense. Either you serialize and parse as HTML, or you serialize and parse as XML. You wouldn't serialize an image as GIF and then parse it as JPEG, either.
Comment 15 C. Scott Ananian 2014-04-02 15:56:49 UTC
The user who was doing this explained they were passing/processing data as XML "so that IE won't screw it up".  But then they wanted to pass it to my service, which only accepts HTML (and uses domino, a spec-compliant HTML parser).  Running document.outerHTML on their XHTML document gave them XML (which is spec-compliant, although surprising), and they complained that I "didn't parse <span/> correctly".

I think it's best not to try to cast aspersions on the user here, but instead try to figure out what the best way to handle XML serialization and conversion.  I see no good reason for the XML serializer to emit "<span/>" instead of "<span></span>", for instance.
Comment 16 Simon Pieters 2014-04-02 17:47:38 UTC
OK.

I still think the best solution to your problem is to use the same format in the serializer and the parser, whether that means you gain an XML parser or they use an HTML serializer. It's possible that the Web platform doesn't provide a way to serialize an XML document as HTML currently, that seems like a bug.

The formats are not compatible. <span/> is just the tip of the iceberg. Even if the XML serializer somehow implemented things from the Polyglot spec, it still wouldn't work for arbitrary DOMs. The DOM also needs to follow the Polyglot spec, and the UA can't enforce that. So it's a futile exercise. It would also lead people into thinking that it's a good idea to serialize in one format and parse as another, which it isn't.
Comment 17 C. Scott Ananian 2014-04-02 18:01:06 UTC
In this bug I'm just politely asking for the XML serialization spec, if/when it is defined, to at least try to avoid gratuitous HTML incompatibilities like shorttags.  I understand that there are XML/HTML incompatibilities, but if the document author has made the effort to create a polyglot document it is unfortunate if the XML serialization doesn't preserve it.

(Bug 25225 asks for an HTML serialization mechanism for an XML document.)
Comment 18 Simon Pieters 2014-04-03 11:54:11 UTC
I think you need to be more specific about what you want exactly. How do you want a "br" element to be serialized, for instance?
Comment 19 C. Scott Ananian 2014-04-03 13:00:39 UTC
@Simon: consistent with http://dev.w3.org/html5/html-polyglot/html-polyglot.html#empty-elements
That is,
"All elements listed as void in the HTML specification or in an extension spec, MUST in polyglot markup have the syntactic form of an XML empty-element tag (<foo/>)."
Comment 20 Simon Pieters 2014-04-04 06:48:51 UTC
OK so the XML serializer would keep a list of the HTML void elements. If that's all there is I guess that's acceptable, but it seems like a slipperly slope to me. In any case, if you want this, please file a new bug.
Comment 21 Travis Leithead [MSFT] 2014-10-13 23:06:24 UTC
I'm not sure if David is monitoring this bug anymore. In any case, it looks like C. Scott Ananian is still interested in seeing this bug through.

C. Scott, please take a look at:
http://www.w3.org/TR/DOM-Parsing/#dfn-concept-xml-serialization-algorithm, step 15. I think it keeps things as close to HTML serializing interop as practically possible. For other html-namespaced element is preserves begin/end tag serialization.

Let me know if there's more that you think can be done here. Thanks.
Comment 22 David Carlisle 2014-10-15 08:33:43 UTC
monitoring is probably putting it too strongly, but I'm here.
I'm happy with the way the document's progressed, thanks.

David
Comment 23 C. Scott Ananian 2014-10-20 16:35:50 UTC
It looks to me like steps 15 and 16 solve the problem which originally led me here -- that is, they ensure that we only use self-closing tags for HTML void elements.  I'll check with the others here at Wikimedia (in particular we have HTML/XML interop issues with the the Parsoid and Visual Editor components) and see if this is sufficient.  (I'll probably have to implement the new spec by hand and transition Visual Editor to use it, and/or audit Visual Editor's existing serializer against the new spec.)