This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 22008 - Make it clear that there are syntax differences, add one about id attribute
Summary: Make it clear that there are syntax differences, add one about id attribute
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: HTML5 differences from HTML4 (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Simon Pieters
QA Contact: HTML WG Bugzilla archive list
URL: http://www.w3.org/TR/html5-diff/
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-05-11 09:25 UTC by Jukka K. Korpela
Modified: 2013-05-13 20:40 UTC (History)
4 users (show)

See Also:


Attachments

Description Jukka K. Korpela 2013-05-11 09:25:17 UTC
At the start of clause “2 Syntax”, the statement “HTML defines an HTML syntax that is compatible with HTML4 and XHTML1 documents published on the Web” should be less absolute, e.g. the word “mostly” could be added before “compatible”. In subclause 2.4, e.g. “Attributes have to be separated by at least one whitespace character” means that some valid and existing HTML 4.01 documents are not valid HTML5.

The following information should be added to subclause 2.4:

The id attribute syntax now allows any nonempty string that does not contain space characters. This is much more liberal than the HTML 4 syntax, but on the other hand it disallows spaces at the start and at the end (id=" foo " is valid HTML 4 but not valid HTML5).
Comment 1 Glenn Adams 2013-05-11 17:40:34 UTC
(In reply to comment #0)
> At the start of clause “2 Syntax”, the statement “HTML defines an HTML
> syntax that is compatible with HTML4 and XHTML1 documents published on the
> Web” should be less absolute, e.g. the word “mostly” could be added before
> “compatible”. In subclause 2.4, e.g. “Attributes have to be separated by at
> least one whitespace character” means that some valid and existing HTML 4.01
> documents are not valid HTML5.

Could you give an example of valid attributes in HTML 4.01 not separated by at least one whitespace character?

Clearly this wouldn't work in the case of unquoted or valueless attributes, but even in the case of quoted value attributes, whitespace separation is required by [1]:

"Any number of (legal) attribute value pairs, separated by spaces, may appear in an element's start tag."

[1] http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.2

> 
> The following information should be added to subclause 2.4:
> 
> The id attribute syntax now allows any nonempty string that does not contain
> space characters. This is much more liberal than the HTML 4 syntax, but on
> the other hand it disallows spaces at the start and at the end (id=" foo "
> is valid HTML 4 but not valid HTML5).

I also see that Bert Bos raised this point in [2]. AFAICT, reviewing the current HTML 5.0 CR [3], I don't see that this has changed; i.e., HTML5 does not prescribe attribute value normalization while HTML4 does (via SGML). This seems to present a significant change in backward compatibility.

[2] http://w3-org.9356.n7.nabble.com/html5-Attribute-value-normalization-is-not-backwards-compatible-td193859.html
[3] http://www.w3.org/TR/2012/CR-html5-20121217/
Comment 2 Jukka K. Korpela 2013-05-11 21:06:35 UTC
(In reply to comment #1)
 
> Could you give an example of valid attributes in HTML 4.01 not separated by
> at least one whitespace character?

<p class="foo"id="bar">

> Clearly this wouldn't work in the case of unquoted or valueless attributes,
> but even in the case of quoted value attributes, whitespace separation is
> required by [1]:
> 
> "Any number of (legal) attribute value pairs, separated by spaces, may
> appear in an element's start tag."

This is one of the cases where HTML 4.01 does not paint an accurate picture of SGML. That text is not normative; the normative reference is to SGML. And while the SGML standard is exceptionally vague in this issue, it seems to mean that a space is not needed after a delimiter. And that's how the W3C validator works, and so do browsers. But HTML5 sets a different rule – an improvement, I would say, since <p class="foo"id="bar"> is mildly confusing.

The most important thing is that there is a change from HTML 4.01, and existing documents may become invalid.
Comment 3 Glenn Adams 2013-05-11 22:09:33 UTC
(In reply to comment #2)
> (In reply to comment #1)
>  
> > Could you give an example of valid attributes in HTML 4.01 not separated by
> > at least one whitespace character?
> 
> <p class="foo"id="bar">
> 
> > Clearly this wouldn't work in the case of unquoted or valueless attributes,
> > but even in the case of quoted value attributes, whitespace separation is
> > required by [1]:
> > 
> > "Any number of (legal) attribute value pairs, separated by spaces, may
> > appear in an element's start tag."
> 
> This is one of the cases where HTML 4.01 does not paint an accurate picture
> of SGML. That text is not normative; the normative reference is to SGML. And
> while the SGML standard is exceptionally vague in this issue, it seems to
> mean that a space is not needed after a delimiter.

Hmm, I didn't know that. Or if I did, I had forgotten it. Now I'm going to have to chase down my copy of ISO-8879 for the first time in about 20 years. Charles Goldfarb, what were you thinking???!


> And that's how the W3C
> validator works, and so do browsers. But HTML5 sets a different rule – an
> improvement, I would say, since <p class="foo"id="bar"> is mildly confusing.
> 
> The most important thing is that there is a change from HTML 4.01, and
> existing documents may become invalid.
Comment 4 Jukka K. Korpela 2013-05-11 22:38:55 UTC
(In reply to comment #3)

> > while the SGML standard is exceptionally vague in this issue, it seems to
> > mean that a space is not needed after a delimiter.
> 
> Hmm, I didn't know that. Or if I did, I had forgotten it. Now I'm going to
> have to chase down my copy of ISO-8879 for the first time in about 20 years.
> Charles Goldfarb, what were you thinking???!

This is probably not relevant here (few people are interested in SGML, especially in the HTML context), but for the record: ISO 8879 clause defines “attribute specification list = attribute specification*” (so no separator is required or allowed) and “attribute specification = s*, ( name, s*, vi, s* )?, attribute value specification”, so whitespace is allowed before an attribute name. But it adds later there: “The leading s can only be omitted from an attribute specification that follows a delimiter.” My interpretation is that although s* denotes optional whitespace, at least one whitespace character is really required there, except when the preceding character is a delimiter. And the reference delimiter set can be found on p. 360 of the SGML Handbook; it includes quotation mark (") and apostrophe ('). Unless I’m missing something, other delimiters cannot appear in this context (since an attribute value containing them needs to be quoted, by HTML 4.01 rules).
Comment 5 Glenn Adams 2013-05-11 23:15:35 UTC
(In reply to comment #4)
> (In reply to comment #3)
> 
> > > while the SGML standard is exceptionally vague in this issue, it seems to
> > > mean that a space is not needed after a delimiter.
> > 
> > Hmm, I didn't know that. Or if I did, I had forgotten it. Now I'm going to
> > have to chase down my copy of ISO-8879 for the first time in about 20 years.
> > Charles Goldfarb, what were you thinking???!
> 
> This is probably not relevant here (few people are interested in SGML,
> especially in the HTML context), but for the record: ISO 8879 clause defines
> “attribute specification list = attribute specification*” (so no separator
> is required or allowed) and “attribute specification = s*, ( name, s*, vi,
> s* )?, attribute value specification”, so whitespace is allowed before an
> attribute name. But it adds later there: “The leading s can only be omitted
> from an attribute specification that follows a delimiter.” My interpretation
> is that although s* denotes optional whitespace, at least one whitespace
> character is really required there, except when the preceding character is a
> delimiter. And the reference delimiter set can be found on p. 360 of the
> SGML Handbook; it includes quotation mark (") and apostrophe ('). Unless I’m
> missing something, other delimiters cannot appear in this context (since an
> attribute value containing them needs to be quoted, by HTML 4.01 rules).

Thanks for going to the trouble of digging that out. I intended to investigate only to satisfy my curiosity, not that I didn't believe your original statement.
Comment 6 Simon Pieters 2013-05-13 15:29:24 UTC
(In reply to comment #0)
> At the start of clause “2 Syntax”, the statement “HTML defines an HTML
> syntax that is compatible with HTML4 and XHTML1 documents published on the
> Web” should be less absolute, e.g. the word “mostly” could be added before
> “compatible”. In subclause 2.4, e.g. “Attributes have to be separated by at
> least one whitespace character” means that some valid and existing HTML 4.01
> documents are not valid HTML5.

https://github.com/whatwg/html-differences/commit/bdff30f0952172f091c0a8ff0b6c1c0aee41ece6

> The following information should be added to subclause 2.4:
> 
> The id attribute syntax now allows any nonempty string that does not contain
> space characters. This is much more liberal than the HTML 4 syntax, but on
> the other hand it disallows spaces at the start and at the end (id=" foo "
> is valid HTML 4 but not valid HTML5).

This is mentioned elsewhere:

"The id global attribute is now allowed to have any value, as long as it is unique, is not the empty string, and does not contain space characters."
http://html-differences.whatwg.org/#changed-attributes
Comment 7 Jukka K. Korpela 2013-05-13 16:18:03 UTC
(In reply to comment #6)

> > The following information should be added to subclause 2.4:
> > 
> > The id attribute syntax now allows any nonempty string that does not contain
> > space characters. This is much more liberal than the HTML 4 syntax, but on
> > the other hand it disallows spaces at the start and at the end (id=" foo "
> > is valid HTML 4 but not valid HTML5).
> 
> This is mentioned elsewhere:
> 
> "The id global attribute is now allowed to have any value, as long as it is
> unique, is not the empty string, and does not contain space characters."
> http://html-differences.whatwg.org/#changed-attributes

The current formulation makes the change sound like a pure extension, when it in fact invalidates some HTML 4 constructs.

But maybe it would be best to add the following general point to 2.4:

• Leading and trailing white space in attribute values is not ignored. This makes some valid HTML4 constructs like id=" ab " invalid.
Comment 8 Simon Pieters 2013-05-13 20:40:42 UTC
Good point, I forgot about normalization.

https://github.com/whatwg/html-differences/commit/20f8d772b5335b627e5ef45da413a220b5a4b85e