This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5808 - Define a way to coerce HTML5 parser output to an XML 1.0 4th ed. + Namespaces 1.0 infoset
Summary: Define a way to coerce HTML5 parser output to an XML 1.0 4th ed. + Namespaces...
Status: CLOSED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: All All
: P3 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-06-26 12:52 UTC by Henri Sivonen
Modified: 2010-10-04 14:47 UTC (History)
5 users (show)

See Also:


Attachments

Description Henri Sivonen 2008-06-26 12:52:55 UTC
There's now a canned answer for anyone who argues that XHTML works better with the 'XML toolchain' than HTML5: "Just put an HTML5 parser at the start of your XML pipeline."

There's a slight problem though: The HTML5 parser algorithm can output a document tree that is not an XML 1.0 4th ed. + Namespaces 1.0 infoset. This poses a problem if a processing pipeline serializes to XML and expects a later stage to reparse using a conforming XML 1.0 4th ed. + Namespaces 1.0 parser or if a component in the pipeline (e.g. the XOM library) performs early checks.

Therefore, every HTML5 parser writer who wishes to provide a full-featured general-purpose HTML5 parser needs to come up with a coercion from an HTML5 DOM onto an XML 1.0 4th ed. + Namespaces 1.0 Infoset.

I suggest documenting a mapping.

Here's a list of problems with proposed solutions:

 * The document mode isn't part of the infoset: Optionally communicate as out-of-infoset-band data. Instruct apps to use the standards mode when not communicated.
 * The form pointer isn't part of the infoset: Make communicating the form pointer optional. Allow communicating it as out-of-infoset-band data. When the form element is not an ancestor of the form control, allow an UUID id attribute be generated on the form element and allow a form attribute be generated on the form control.
 * Some XML APIs treat the doctype as syntactic sugar: Make representing the document type information item is optional.
 * Attributes with the local name "xmlns" or a local name starting with "xmlns:" are not permitted attribute information items: Drop on the floor.
 * Namespace declarations are not attribute information items: Drop on the floor. (Optionally syntethize namespace information items for XLink and SVG or MathML on <svg> and <math> nodes, respectively, and XHTML namespace information items on HTML elements (including root) that do not have an HTML element as the parent.)
 * Form feed is not an XML character (either literally or as a character reference expansion): turn into a space.
 * The input stream contains a literal non-XML character other than form feed: turn into a REPLACEMENT CHARACTER.
 * A comment contains "--": Replace with "- -".
 * A name is not an NCName: Use the original name on tree builder stack for matching, but use as escaped name in the output. The escaping function must escape each non-NCName to a unique NCName, and the result must have at least one upper case ASCII character but must not match any known SVG camelCase name.
Comment 1 Ian 'Hixie' Hickson 2008-06-26 22:57:56 UTC
Yeah, I guess we'll add a section about this in the parser section somewhere. It'll free us up a bit and allow us to diverge more from XML, instead of vainly trying to keep the two in sync all the time.
Comment 2 Henri Sivonen 2008-06-27 05:55:29 UTC
(In reply to comment #1)
> It'll free us up a bit and allow us to diverge more from XML, instead of vainly
> trying to keep the two in sync all the time.

I think we should not add any new cases where the HTML5 parsing algorithm can produce XML-incompatible parse trees. Maintaining alternative code paths for all the things I mentioned is annoying enough as is.
Comment 3 Ian 'Hixie' Hickson 2008-07-23 02:04:16 UTC
Done, but I didn't always follow your suggestions. In particular, I made bad names and attributes just get mutated so that bad characters turn into "_" characters, with clashes being dealt with by dropping attributes, instead of suggesting using a mapping function.
Comment 4 Henri Sivonen 2008-07-23 06:39:07 UTC
(In reply to comment #3)
> Done, 

Thanks.

> but I didn't always follow your suggestions. 

"Construct the DOM as if appropriate namespace declarations were in scope.", "Construct the DOM as if these were default namespace declarations." and "Construct the DOM as if these were namespace prefix declarations." are vague compared to saying that it is permissible to a) drop NS declarations and b) synthetize NS declarations.

> In particular, I made bad
> names and attributes just get mutated so that bad characters turn into "_"
> characters, with clashes being dealt with by dropping attributes, instead of
> suggesting using a mapping function.

I'd much prefer having a mapping function that can't cause clashes. That way, I don't need to deal with attribute name clashes. Also, when the mapping function cannot cause element name clashes, implementation that don't maintain separate stack comparison name and an app-exposed name for elements would automatically be protected. (The Validator.nu parser maintains a stack comparison name and an exposed name separately already to allow pointer compares instead of case-insensitive string compares when an SVG camelCase name is the exposed name on the stack.)
Comment 5 Thomas Broyer 2008-07-23 08:33:48 UTC
(In reply to comment #4)
> 
> > In particular, I made bad
> > names and attributes just get mutated so that bad characters turn into "_"
> > characters, with clashes being dealt with by dropping attributes, instead of
> > suggesting using a mapping function.
> 
> I'd much prefer having a mapping function that can't cause clashes. That way, I
> don't need to deal with attribute name clashes.

Why not use ISO9075-like encoding? (i.e. replace \uXXXX with _xXXXX_)
Comment 6 Ian 'Hixie' Hickson 2008-07-23 08:42:11 UTC
Fixed. I ended up going with a U12345 scheme. (For various reasons we need a capital letter, and U seems to be the most obvious letter.)