11910 – @id values in polyglot markup should be XML-valid (or not?)

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11910 - @id values in polyglot markup should be XML-valid (or not?)

Summary: @id values in polyglot markup should be XML-valid (or not?)

Status:	CLOSED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff) (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Eliot Graff
QA Contact:	HTML WG Bugzilla archive list

URL:	http://dev.w3.org/html5/html-xhtml-au...
Whiteboard:
Keywords:

Depends on:	11909
Blocks:
	Show dependency tree / graph

Reported:	2011-01-28 13:28 UTC by Leif Halvard Silli
Modified:	2011-08-04 05:07 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2011-01-28 13:28:13 UTC

In a halleluia, people have realized that HTML5 permits any character as @id value:

http://twitter.com/#!/codepo8/status/30212174852923392
http://www.456bereastreet.com/archive/201011/html5_allows_almost_any_value_for_the_id_attribute_use_wisely/
http://mathiasbynens.be/notes/html5-id-class

However, in XML, the value of an @id, has the validity constratints that it must meet XML's name production, which amongst other things means that the first character cannot be a number. It is not a well-formedness issue, it doesn't cause yellow-screen-of death. 

See http://www.w3.org/TR/REC-xml/#id 

Even HTML5 has validity constraints: space characters are not permitted (line-feeds, tabs, space, carriage returs). And even in HTML5, this is primariliy a validity issue - at least CSS selectors works flawlessly even if the @id has space within itself.

But space is not a problem per XML's validity rules. In XML then both  line-feeds, tabs, space and CR are not anymore invalid inside @id than in any other attribute: you can use all of them, as long as you escape line-feeds, tabs and CRs and as long as XML-validity is not an issue.

OPTIONS:
 1) Disallow space in @id because it is HTML5's validity rules that matters.
      But  ignore the XML validity rules.
 2) Same as 1) but say that authors SHOULD also be XML-valid
 3) same as 2) but say that authors MUST be XML-valid

Being silent is not an option. This bug depnes on bug 11909 - which is about the principles of Polyglot Markup.

Comment 1 David Carlisle 2011-01-28 13:44:15 UTC

(In reply to comment #0)

> However, in XML, the value of an @id, has the validity constrataints that it
> must meet XML's name production, which amongst other things means that the
> first character cannot be a number. 

This is only the case if the id attribute is of type ID.  Since (most) of the polyglot valid doctype usage would not point to a resolvable DTD at all I think that it would make sense to assume all attributes are CDATA for the purposes of this document, and also, if ever a dtd for html5 were to be produced that tried to approximate the polyglot  rules it should define id to be CDATA. However I don't think that the polyglot spec should depend on dtd at all since there is unlikely to be a normative html dtd.

So I don't think that this spec need say anything about id attributes other than the general comments about attributes (white space and other character normalization/quoting issues)


David

Comment 2 Leif Halvard Silli 2011-01-28 16:28:33 UTC

(In reply to comment #1)
> (In reply to comment #0)
> 
> > However, in XML, the value of an @id, has the validity constrataints that it
> > must meet XML's name production, which amongst other things means that the
> > first character cannot be a number. 
> 
> This is only the case if the id attribute is of type ID.  Since (most) of the
> polyglot valid doctype usage would not point to a resolvable DTD at all I think
> that it would make sense to assume all attributes are CDATA for the purposes of
> this document, and also, if ever a dtd for html5 were to be produced that tried
> to approximate the polyglot  rules it should define id to be CDATA. However I
> don't think that the polyglot spec should depend on dtd at all since there is
> unlikely to be a normative html dtd.
> 
> So I don't think that this spec need say anything about id attributes other
> than the general comments about attributes (white space and other character
> normalization/quoting issues)

Right. Thus I think that Polyglot Markup should *point out* that there are, in fact, no restrictions from the XML-side on the @id in a (typical) polyglot. May be it could simply point out that the @id in a (typical) polyglot is CDATA. Because, otherwise, many are used to the fact that there has been restrictions on the content of @id.

That said, Polyglot Markup does not (at this point) forbid the legacy DOCTYPEs that are permitted in HTML5. 
See http://www.w3.org/TR/html-polyglot/#doctype
And http://www.w3.org/TR/html5/syntax#the-doctype

Quoting HTML5: "A DOCTYPE containing an obsolete permitted DOCTYPE string is an obsolete permitted DOCTYPE. Authors should not use obsolete permitted DOCTYPEs, as they are unnecessarily long."

So, for example, if an author was using XHTML 1.0 doctype, what then? Perhaps Polyglot Markup should say that, *if* authors use a DTD, then they SHOULD (MUST?) *also* (A) comply wiht the general validity constraints of XML (such as the name productionof @ID) as well as (B) with all the other  syntax rules (for attributes and elements) of that particular DTD? ( Even if one uses a DTD, it is not usually a problem - in a normal browser - to break e.g. the validity rules of the name production, as long as one doesn't use a validating parser etc. Hence I said "should" instead of "must".)

Polyglot Markup have been compared to Appendix C of XHTML 1.0. - which have been said to be too inaccurate. Perhaps we don't need to make up for that inaccuracy. But Polyglot Markup actually becomes something completely different from Appendix C if it simply ignores the issues of having a DTD in the DOCTYPE.

There are, as I see it, very few problems with Appendix C - except that it does not talk about the DOM.

Comment 3 David Carlisle 2011-01-28 16:40:06 UTC

(In reply to comment #2)


> So, for example, if an author was using XHTML 1.0 doctype, what then?

I don't think that it should mention validity at all. If authors want to make sure their documents are valid there are plently of tools to do that without consulting this polyglot spec.

It would be fruitless to try to enumerate the extra constraints one must satisfy to ensure that a conforming html document that uses a legacy but conforming doctype is valid xml.

@id being Name is a tiny fraction of it, also you'd have to not use any new elements such as canvas, and the usage of all other elements would have to match the dtd in addition to what the html(5) spec says. So long as the document sticks to the single aim of specifying how to get conforming html documents to have equivalent parse trees if parsed as XML, then xml validity is irrelevant.

David

Comment 4 Leif Halvard Silli 2011-01-28 17:48:04 UTC

(In reply to comment #3)
> (In reply to comment #2)

> So long as the
> document sticks to the single aim of specifying how to get conforming html
> documents to have equivalent parse trees if parsed as XML, then xml validity is
> irrelevant.

I agree that it would be very good if it could stick to that single goal. Because then this document would fit on *all* flavours of XML/XHTML. And would thus become a true replacement for  Appendix C of XHTML 1.

But if so, then Henri's idea that it should be a subset of XML 1.0 and HTML5, does not 100% fit. It only fits because it is HTML5 that defines how the HTML DOM looks like.

Even HTML5 defines things as forbidden, that does not constitute a problem with regad to DOM-equality. One thing I mentioned in this very bug is the fact that HTML5 forbids space characters inside its @id attribute. This is not a problem from a DOM-equality pointof view.

Perhaps one way to get this document turned into that direction, could be to split it sharply in two parts: One part which describes the general rules. And another part which uses HTML5 as an example. It would then be up to those that uses other flavours of XHTML to find out what it would mean for them to be conforming. 

In that regard, the use of <?xml version="1.0" encoding="UTF-8" ?>, is not an DOM issue - except in one particular version of Internet Explorer - IE6 (and earlier). It is far worse that if you place a <!--comment--> in front of the <!DOCTYPE html>, then this triggers quirks-mode in *all* versions of Internet Explorer. And this even happens if the comment comes between the XML declaration and the DOCTYPE. Thus, this triggers quirks-mode in IE6 to IE9, whether you remove the XML declaration or not:

<?xml version="1.0" encoding="UTF-8" ?>
<!--Hello, IE -->
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

(Test it yourself or read http://en.wikipedia.org/wiki/Quirks_mode#Triggering_different_rendering_modes)

In other words, the forbiddance of the xml declaration is only a conformance thing. Anyway, what do you think about splitting the document as I suggested?

Comment 5 David Carlisle 2011-01-30 13:09:05 UTC

(In reply to comment #4)
 what do you think about splitting the document as I suggested?

I think the document should only concern itself with conforming html(5) documents. Otherwise it will get totally out of hand, for example, the following fragment

<p>aaa<ul><li></li></ul>kkk</p>

is well formed xml with wildly incompatible xml and html dom. This (along with most of the "special parsing rules" of the html parser are not covered by the rules in the polyglot spec. This is OK so long as the polyglot spec only concerns itself with conforming documents. The above fragment is not conforming html because of the </p>.

Unfortunately the document is still not sufficiently explicit that the rules that it gives are only sufficient to produce compatible DOM if the document is well formed xml and conforming html.

Comment 6 Leif Halvard Silli 2011-01-30 14:24:58 UTC

(In reply to comment #5)
> (In reply to comment #4)
>  what do you think about splitting the document as I suggested?
> 
> I think the document should only concern itself with conforming html(5)
> documents. Otherwise it will get totally out of hand,

As I show in my comments in Bug 11909, I think it gets out of hand if it is *not* split. The document needs to discern between principles and praxis.

> for example, the  following fragment
> 
> <p>aaa<ul><li></li></ul>kkk</p>
> 
> is well formed xml with wildly incompatible xml and html dom. This (along with
> most of the "special parsing rules" of the html parser are not covered by the
> rules in the polyglot spec.

The <p> is already mentioned in the polyglot spec, no? HTML5 also mention it in the "restricitons on the content model" seciton. It would be simple to provide a list of those element that have special parsing attached to them. 

> This is OK so long as the polyglot spec only
> concerns itself with conforming documents. The above fragment is not conforming
> html because of the </p>.
> 
> Unfortunately the document is still not sufficiently explicit that the rules
> that it gives are only sufficient to produce compatible DOM if the document is
> well formed xml and conforming html.

Here I think you are mixing things: XML is not alone in discerning between "working" (aka "well-formed") and valid (aka "conformance"). HTML has the same concept.

 E.g. HTML5 says that it is forbidden to set the value of the img@border to a non-zero value. Thus, this is forbidden <img border="9" src="i" alt="i">. We can both agree that it is not an issue, with regard to getting the exact same DOM, whether @border is set to "9" or "0". 

If this document is supposed to replace Appendix C, then it must, in my view, also describe principles.

Comment 7 David Carlisle 2011-01-30 16:47:45 UTC

(In reply to comment #6)

> The <p> is already mentioned in the polyglot spec, no? 


<p> is mentioned, but not in any way related to the issue in that example.

> HTML5 also mention it in
> the "restrictions on the content model" seciton. It would be simple to provide
> a list of those element that have special parsing attached to them.

You would also have to say exactly what the special parsing rules are, and what subset of well formed documents produce compatible xml parse trees despite those rules. This would make the polyglot many many times larger than it currently is, for very little benefit, and the chance of getting it right would be close to nil. For conforming documents the rules are irksome but not too difficult to state, but for non conforming, documents the rules are massively more complicated.
 

> 
> Here I think you are mixing things: XML is not alone in discerning between
> "working" (aka "well-formed") and valid (aka "conformance"). HTML has the same
> concept.

Not really, html(5) produces a parse tree for (almost) any input, just some inputs are declared non conforming.
> 
>  E.g. HTML5 says that it is forbidden to set the value of the img@border to a
> non-zero value. Thus, this is forbidden <img border="9" src="i" alt="i">. We
> can both agree that it is not an issue, with regard to getting the exact same
> DOM, whether @border is set to "9" or "0". 

The fact that there are a few non conforming documents for which it is possible to say something about xml parsing isn't enough to justify saying anything about them here.
> 
> If this document is supposed to replace Appendix C, then it must, in my view,
> also describe principles.

I can't imagine any reason to make the document many many times longer just to tell people how they can use xml tools to make non conforming documents rather than just saying how to produce conforming documents.

Comment 8 Leif Halvard Silli 2011-01-30 17:42:30 UTC

(In reply to comment #7)
> (In reply to comment #6)

> > HTML5 also mention it in
> > the "restrictions on the content model" seciton. It would be simple to provide
> > a list of those element that have special parsing attached to them.
> 
> You would also have to say exactly what the special parsing rules are,

Isn't it a good start to just list them? Just create a header saying "the following features are autogenerated if you don't insert them, and must therefore be explicitly added for DOM compatibility". And then list the features/elements. 

> and what
> subset of well formed documents produce compatible xml parse trees despite
> those rules. This would make the polyglot many many times larger than it
> currently is, for very little benefit, and the chance of getting it right would
> be close to nil. For conforming documents the rules are irksome but not too
> difficult to state, but for non conforming, documents the rules are massively
> more complicated.

How about discussing this in bug 11909? I must admit that I have not had in mind creating such a large document as you think that what I say would have lead to.

> > Here I think you are mixing things: XML is not alone in discerning between
> > "working" (aka "well-formed") and valid (aka "conformance"). HTML has the same
> > concept.
> 
> Not really, html(5) produces a parse tree for (almost) any input, just some
> inputs are declared non conforming.

Formally you are correct, I guess. 

But we could also say that HTML(5) has another consequence of being unwell-formed: soft-punishment instead of yellow-screen-of-death.

I think those errors that are related to wrong nesting can be placed in the un-well-formed category.

> >  E.g. HTML5 says that it is forbidden to set the value of the img@border to a
> > non-zero value. Thus, this is forbidden <img border="9" src="i" alt="i">. We
> > can both agree that it is not an issue, with regard to getting the exact same
> > DOM, whether @border is set to "9" or "0". 
> 
> The fact that there are a few non conforming documents for which it is possible
> to say something about xml parsing isn't enough to justify saying anything
> about them here.

I argue for operating with clear concepts. I don't argue for talking about @border.

> > If this document is supposed to replace Appendix C, then it must, in my view,
> > also describe principles.
> 
> I can't imagine any reason to make the document many many times longer just to
> tell people how they can use xml tools to make non conforming documents rather
> than just saying how to produce conforming documents.

Me neither. Neither do I get where you take it from that that is what I argue for.

Comment 9 Leif Halvard Silli 2011-01-30 17:44:34 UTC

(In reply to comment #5)

> Unfortunately the document is still not sufficiently explicit that the rules
> that it gives are only sufficient to produce compatible DOM if the document is
> well formed xml and conforming html.

It is also not enough that the document is conforming HTML(5): <noscript> does not work as intended in XML.  The goal of polyglot markup is "*work the same way*".

So conformance is not the ground principle. The ground principle is 'compatible DOM'.

So, I think we should define polyglotness separate from conforming: just as it is possible to write HTML that works fine even if it doesn't conform, it is also easy to write polyglot documents that doesn't conform.

It is only in *some* fields that polyglot documents have a higher requirement for conformance than "normal" HTML/XHTML have.

You asked, ironically, in another bug, if we perhaps should say that a document is polyglot if it is parsed by a non-validating XML parser. In a similar way, we can also say that it really doesn't matter for polyglotness whether the document has an xml declaration or not - instead, just make sure that you don't use a stupid parser that sets itself in quirks-mode because of that declaration - I really mean that.

TBL wrote that we should consider polyglot documents as its own breed of documents: http://lists.w3.org/Archives/Public/public-html/2010Jun/0225.html

I think, to live up to that expectation, the polyglot spec should set up polyglotness principles. Those principles should not be taken out of the air. 

Since the purpose of polyglot documents is to create documetns that work the same way in HTML and XML, it also isn't enough to only focus on (regular) HTML-conformance.

I prefer to have further debates about the scope of the document in bug 11909.

Comment 10 Eliot Graff 2011-02-12 00:51:36 UTC

In the Editor's Draft of 11 February 2011, I created section 7.3 "Attributes with Special Considerations" and within it, section 7.3.1 "The id Attribute," which states, 

]]
Polyglot markup does not contain any space characters within the value of an id attribute. This is because values for the id attribute may not contain space characters in HTML5. [HTML5]
[[

Good catch. Thanks for the feedback. 

Eliot

Comment 11 Leif Halvard Silli 2011-02-13 19:51:30 UTC

(In reply to comment #10)

> This is because values for the id attribute may not contain space
> characters in HTML5. [HTML5]
> [[
> 
> Good catch. Thanks for the feedback. 

I think this is a good catch/summary of yours as well: it is in fact HTML5 and not XML which sets defines the requirement!

Well done.  I'm closing the bug.

Comment 12 Michael[tm] Smith 2011-08-04 05:07:16 UTC

mass-move component to LC1

Comment 13 Michael[tm] Smith 2011-08-04 05:07:37 UTC

mass-move component to LC1