This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19102 - A semicolon-less named character reference in an attribute should also be treated as a parse error.
Summary: A semicolon-less named character reference in an attribute should also be tre...
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-28 03:27 UTC by contributor
Modified: 2013-01-31 00:37 UTC (History)
7 users (show)

See Also:


Attachments

Description contributor 2012-09-28 03:27:11 UTC
Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html
Multipage: http://www.whatwg.org/C#tokenizing-character-references
Complete: http://www.whatwg.org/c#tokenizing-character-references

Comment:
A semicolon-less named character reference in an attribute should also be
treated as a parse error.

Posted from: 119.161.158.96 by kennyluck@csail.mit.edu
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1
Comment 1 Kang-Hao (Kenny) Lu 2012-09-28 03:34:31 UTC
I think the spec intentionally treat this case as no error, but I propose we make it a parse error (or other author conformance violation) *temporary* before the market share of IE9 and below drops down to certain level.

The reason is simple: this, being surprising, still traps people.

Also, validator.nu (and hence Firefox source view) treats that as error, which I believe to be a good thing for authors. See, for example, http://validator.nu/?doc=data%3Atext%2Fhtml%2C%3C!DOCTYPE+html%3E%3Ctitle%3E%3C%2Ftitle%3E%3Ca+href%3D%22a%3D1%26reg%3D2%22%3E%3C%2Fa%3E
Comment 2 Samuel Bronson 2012-10-15 19:47:58 UTC
If this case is intentionally not a parse error, this should be made explicit, like in the case marked: "Not a character reference. No characters are consumed, and nothing is returned. (This is not an error, either.)"

But I don't see why this shouldn't be an error: that doesn't appear to mean anything more than "documents may not do this, validators have to warn if they do, and it's okay for implementations to give up on seeing it."  Only the license to give up could actually lead to breakage, and the big browsers won't actually do this until it's about time to excise this case, will they?
Comment 3 Ian 'Hixie' Hickson 2012-10-19 22:58:48 UTC
I'm confused. Can you give a precise example of what you think should be invalid but isn't?

If you mean this:

   <a href="a=1&reg=2"></a>

...then that is intentionally conforming, because it's very common and harmless.

(Looks like validator.nu isn't up to date on that.)
Comment 4 Kang-Hao (Kenny) Lu 2012-10-20 00:50:51 UTC
(In reply to comment #3)
> I'm confused. Can you give a precise example of what you think should be
> invalid but isn't?

This example is one of such.

> 
> If you mean this:
> 
>    <a href="a=1&reg=2"></a>
> 
> ...then that is intentionally conforming, because it's very common and
> harmless.

The harm is that it won't work in IE9 and below. Anyway, I am not all clear about the value of author conformance but it doesn't seem to be a bad thing to warn the user/raise error whenever the validator sees this.

> (Looks like validator.nu isn't up to date on that.)

Right.
Comment 5 Ian 'Hixie' Hickson 2012-10-20 23:33:12 UTC
That IE works differently on this than other browsers is news to me. I thought we checked this really carefully before changing the parser on this.
Comment 6 Kang-Hao (Kenny) Lu 2012-10-21 00:11:27 UTC
(In reply to comment #5)
> That IE works differently on this than other browsers is news to me. I
> thought we checked this really carefully before changing the parser on this.

I verified again that this IE9 and all it's modes works differently. But this parsing has use cases anyway and I was told that this is fixed in IE10.

Test case: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1857
Comment 7 Michael[tm] Smith 2012-11-04 10:44:01 UTC
FWIW in regard to the validator behavior for the, e.g., "&reg" case: I've written a patch to have the validator.nu HTML parser behave per the spec for the general case -- that is, to no longer report any error for semicolon-less strings after ampersands.

However even with that patch applied the validator currently will still report an error for the case of "&reg" and some others that the named-character-references table lists both with and without the semicolon. The actual error it reports is "Named character reference was not terminated by a semicolon". It only reports that for the special case of things such as "&reg" and not for the case of some arbitrary string like "&foo".

If IE behavior for "&reg" is as Kenny describes, it seems useful for the validator to emit an error for it -- and so for the spec to define semicolon-less named character references such as "&reg" as a parse error.
Comment 8 Henri Sivonen 2012-11-05 10:26:13 UTC
See bug 19718. I think it was a mistake to make unescaped ampersands be non-errors. If there is an ampersand and it is not followed by a semicolon-terminated entity name, it may well be a typo. It seems better aligned with the purpose of validation to catch those typos than to make it so that www.google.com as it exists has no parse errors (IIRC, the case Sam presented for the trickery under discussion).
Comment 9 Kang-Hao (Kenny) Lu 2012-11-05 11:36:48 UTC
(In reply to comment #8)
> See bug 19718. I think it was a mistake to make unescaped ampersands be
> non-errors. 

To be precise, it's currently an error when an unescaped is parsed into the chacter it represents:

  # Otherwise, a character reference is parsed. If the last character matched 
  # is not a U+003B SEMICOLON character (;), there is a parse error.

> If there is an ampersand and it is not followed by a
> semicolon-terminated entity name, it may well be a typo.

Right. But in a case like <a href="a=1&reg=2">, it's not likely that it's a typo and the spec makes sense here. This bug is raised because IE9 and below haven't supported this and the validator should still raise an error for this case *for now*, not after IE10 takes over the world.
Comment 10 Henri Sivonen 2012-11-05 12:12:48 UTC
(In reply to comment #9)
> But in a case like <a href="a=1&reg=2">, it's not likely that it's a
> typo and the spec makes sense here. This bug is raised because IE9 and below
> haven't supported this and the validator should still raise an error for
> this case *for now*, not after IE10 takes over the world.

OK, in that case, it’s not a typo, but in that case, compatibility should weigh more than the ability to declare www.google.com free of Parse Errors.

(In reply to comment #7)
> If IE behavior for "&reg" is as Kenny describes

It indeed is. How did we end up with legacy-IE-incompatible tokenization for this case in the first place?
Comment 11 Henri Sivonen 2012-11-05 12:18:40 UTC
(In reply to comment #10)
> (In reply to comment #7)
> > If IE behavior for "&reg" is as Kenny describes
> 
> It indeed is. How did we end up with legacy-IE-incompatible tokenization for
> this case in the first place?

FWIW, IE8 also turns <a href="&reg=2"> into a ®=2.
Comment 12 Simon Pieters 2012-11-05 13:15:05 UTC
(In reply to comment #10)
> (In reply to comment #7)
> > If IE behavior for "&reg" is as Kenny describes
> 
> It indeed is. How did we end up with legacy-IE-incompatible tokenization for
> this case in the first place?

We made <a href="&reg=2"> parse different from legacy IE on the basis that legacy IE was not what people expect and there were few enough pages relying on this that we could change it. I had to argue the case to convince Hixie to change it.

http://lists.w3.org/Archives/Public/public-html/2009Jul/0417.html

(I can't find the email where it was accepted or the actual spec change right now.)
Comment 13 Kang-Hao (Kenny) Lu 2012-11-05 14:23:02 UTC
(In reply to comment #12)
> We made <a href="&reg=2"> parse different from legacy IE on the basis that
> legacy IE was not what people expect and there were few enough pages relying
> on this that we could change it. I had to argue the case to convince Hixie
> to change it.
> 
> http://lists.w3.org/Archives/Public/public-html/2009Jul/0417.html
> 
> (I can't find the email where it was accepted or the actual spec change
> right now.)

Only after reading this thread do I realize IE9 and below treat <a href="&copypasta"> according to the spec. So, as the reporter of this bug, my request is changed to the following, to be precise:

add

  | However, if the next character is a a U+003D EQUALS SIGN character (=), 
  | this is a parse error.

after

  # If the character reference is being consumed as part of an attribute, and the
  # last character matched is not a U+003B SEMICOLON character (;), and the next
  # character is either a U+003D EQUALS SIGN character (=) or an alphanumeric
  # ASCII character, then, for historical reasons, all the characters that were
  # matched after the U+0026 AMPERSAND character (&) must be unconsumed, and
  # nothing is returned.

(For what it's worth, "reg" was a real case.) and I encourage validator.nu to do this.

Thanks for the link!
Comment 14 Ian 'Hixie' Hickson 2012-11-05 20:01:34 UTC
> is a a 

Was that intended to be "is a" or "is not a"? (Just making sure we're on the same page... I think either way makes sense. In particular, having this temporarily be a parse error for the near term while IE9 and below are still in wide use seems rather reasonable.)
Comment 15 Kang-Hao (Kenny) Lu 2012-11-05 20:08:39 UTC
(In reply to comment #14)
> > is a a 
> 
> Was that intended to be "is a" or "is not a"? 

"is a"
Comment 16 Ian 'Hixie' Hickson 2013-01-31 00:35:18 UTC
Ok, I've made this a conformance error for the time being.
Comment 17 contributor 2013-01-31 00:37:03 UTC
Checked in as WHATWG revision r7679.
Check-in comment: Make <a href='?guitar=2&amp=1&pedal=6'> a parse error since IE9 misparses it '?guitar=2&=1&pedal=6' apparently.
http://html5.org/tools/web-apps-tracker?from=7678&to=7679