This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5809 - Mitigate data loss when conforming documents are coerced to XML 1.0
Summary: Mitigate data loss when conforming documents are coerced to XML 1.0
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
Keywords: NoReply
Depends on:
Reported: 2008-06-26 13:12 UTC by Henri Sivonen
Modified: 2010-10-04 14:46 UTC (History)
4 users (show)

See Also:


Description Henri Sivonen 2008-06-26 13:12:51 UTC
Over in bug 5808 I suggested a way to coerce the output of the HTML5 parsing algorithm into XML.

It's theoretically unpure for conforming documents to trigger coercions that aren't mostly harmless. I, therefore, suggest narrowing the conformance definition accordingly.

 * The document mode isn't part of the infoset: Optionally communicate as
out-of-infoset-band data. Instruct apps to use the standards mode when not

Mostly harmless.

 * The form pointer isn't part of the infoset: Make communicating the form
pointer optional. Allow communicating it as out-of-infoset-band data. When the
form element is not an ancestor of the form control, allow an UUID id attribute
be generated on the form element and allow a form attribute be generated on the
form control.

Mostly harmless.

 * Some XML APIs treat the doctype as syntactic sugar: Make representing the
document type information item is optional.

Mostly harmless.

 * Attributes with the local name "xmlns" or a local name starting with
"xmlns:" are not permitted attribute information items: Drop on the floor.

Mostly harmless. However, in the case of <embed>, this theoretically loses conforming data. These attributes could be excluded from what is permitted on <embed> as plug-in parameters.

 * Namespace declarations are not attribute information items: Drop on the
floor. (Optionally syntethize namespace information items for XLink and SVG or
MathML on <svg> and <math> nodes, respectively, and XHTML namespace information
items on HTML elements (including root) that do not have an HTML element as the

Mostly harmless.

 * Form feed is not an XML character (either literally or as a character
reference expansion): turn into a space.

Mostly harmless.

 * The input stream contains a literal non-XML character other than form feed:

Mostly harmless, but these might as well be defined as non-conforming.

 * A comment contains "--": Replace with "- -".

Mostly harmless.

 * A name is not an NCName: Use the original name on tree builder stack for
matching, but use as escaped name in the output. The escaping function must
escape each non-NCName to a unique NCName, and the result must have at least
one upper case ASCII character but must not match any known SVG camelCase name.

This is dataloss in theory even if not in probable practice. Attributes that are actually used on <embed> are NCNames anyway, so forbidding non-NCNames wouldn't break anything. Forbidding data-* from forming a non-NCName would still leave a countably infinite space of names, and authors are likely to use printable ASCII anyway.
Comment 1 Ian 'Hixie' Hickson 2008-06-26 22:58:45 UTC
So what exactly are the changes you're proposing? (No need to tell me what you _don't_ want me to change!)
Comment 2 Henri Sivonen 2008-06-27 06:07:32 UTC
I'm suggesting that attributes on <embed> and data-* attributes be restricted to XML 1.0 4th ed. + Namespaces NCNames for the purpose of conformance (with the consequence that xmlns:foo on <embed> ends up as non-conforming, too).

If this is against data-* principles, we could at least have this restriction on <embed> to get some theoretical purity without actually restricting any practical activities.
Comment 3 Ian 'Hixie' Hickson 2008-06-27 06:52:57 UTC
Bah. You ruin all the fun. :-P
Comment 4 Ian 'Hixie' Hickson 2008-07-01 00:10:03 UTC
Comment 5 Maciej Stachowiak 2010-03-14 13:14:40 UTC
This bug predates the HTML Working Group Decision Policy.

If you are satisfied with the resolution of this bug, please change the state of this bug to CLOSED. If
you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:

This bug is now being moved to VERIFIED. Please respond within two weeks. If this bug is not closed, reopened or escalated within two weeks, it may be marked as NoReply and will no longer be considered a pending comment.