19102 2012-09-28 03:27:11 +0000 A semicolon-less named character reference in an attribute should also be treated as a parse error. 2013-01-31 00:37:03 +0000 1 1 1 Unclassified WHATWG HTML unspecified Other other RESOLVED FIXED http://www.whatwg.org/specs/web-apps/current-work/#tokenizing-character-references P3 normal Unsorted 1 contributor ian hsivonen ian kennyluck mike naesten qjy1111 zcorpan contributor oldest_to_newest 74708 0 contributor 2012-09-28 03:27:11 +0000 Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html Multipage: http://www.whatwg.org/C#tokenizing-character-references Complete: http://www.whatwg.org/c#tokenizing-character-references Comment: A semicolon-less named character reference in an attribute should also be treated as a parse error. Posted from: 119.161.158.96 by kennyluck@csail.mit.edu User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1 74710 1 kennyluck 2012-09-28 03:34:31 +0000 I think the spec intentionally treat this case as no error, but I propose we make it a parse error (or other author conformance violation) *temporary* before the market share of IE9 and below drops down to certain level. The reason is simple: this, being surprising, still traps people. Also, validator.nu (and hence Firefox source view) treats that as error, which I believe to be a good thing for authors. See, for example, http://validator.nu/?doc=data%3Atext%2Fhtml%2C%3C!DOCTYPE+html%3E%3Ctitle%3E%3C%2Ftitle%3E%3Ca+href%3D%22a%3D1%26reg%3D2%22%3E%3C%2Fa%3E 76335 2 naesten 2012-10-15 19:47:58 +0000 If this case is intentionally not a parse error, this should be made explicit, like in the case marked: "Not a character reference. No characters are consumed, and nothing is returned. (This is not an error, either.)" But I don't see why this shouldn't be an error: that doesn't appear to mean anything more than "documents may not do this, validators have to warn if they do, and it's okay for implementations to give up on seeing it." Only the license to give up could actually lead to breakage, and the big browsers won't actually do this until it's about time to excise this case, will they? 76787 3 ian 2012-10-19 22:58:48 +0000 I'm confused. Can you give a precise example of what you think should be invalid but isn't? If you mean this: <a href="a=1&reg=2"></a> ...then that is intentionally conforming, because it's very common and harmless. (Looks like validator.nu isn't up to date on that.) 76802 4 kennyluck 2012-10-20 00:50:51 +0000 (In reply to comment #3) > I'm confused. Can you give a precise example of what you think should be > invalid but isn't? This example is one of such. > > If you mean this: > > <a href="a=1&reg=2"></a> > > ...then that is intentionally conforming, because it's very common and > harmless. The harm is that it won't work in IE9 and below. Anyway, I am not all clear about the value of author conformance but it doesn't seem to be a bad thing to warn the user/raise error whenever the validator sees this. > (Looks like validator.nu isn't up to date on that.) Right. 76823 5 ian 2012-10-20 23:33:12 +0000 That IE works differently on this than other browsers is news to me. I thought we checked this really carefully before changing the parser on this. 76824 6 kennyluck 2012-10-21 00:11:27 +0000 (In reply to comment #5) > That IE works differently on this than other browsers is news to me. I > thought we checked this really carefully before changing the parser on this. I verified again that this IE9 and all it's modes works differently. But this parsing has use cases anyway and I was told that this is fixed in IE10. Test case: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1857 77828 7 mike 2012-11-04 10:44:01 +0000 FWIW in regard to the validator behavior for the, e.g., "&reg" case: I've written a patch to have the validator.nu HTML parser behave per the spec for the general case -- that is, to no longer report any error for semicolon-less strings after ampersands. However even with that patch applied the validator currently will still report an error for the case of "&reg" and some others that the named-character-references table lists both with and without the semicolon. The actual error it reports is "Named character reference was not terminated by a semicolon". It only reports that for the special case of things such as "&reg" and not for the case of some arbitrary string like "&foo". If IE behavior for "&reg" is as Kenny describes, it seems useful for the validator to emit an error for it -- and so for the spec to define semicolon-less named character references such as "&reg" as a parse error. 77859 8 hsivonen 2012-11-05 10:26:13 +0000 See bug 19718. I think it was a mistake to make unescaped ampersands be non-errors. If there is an ampersand and it is not followed by a semicolon-terminated entity name, it may well be a typo. It seems better aligned with the purpose of validation to catch those typos than to make it so that www.google.com as it exists has no parse errors (IIRC, the case Sam presented for the trickery under discussion). 77865 9 kennyluck 2012-11-05 11:36:48 +0000 (In reply to comment #8) > See bug 19718. I think it was a mistake to make unescaped ampersands be > non-errors. To be precise, it's currently an error when an unescaped is parsed into the chacter it represents: # Otherwise, a character reference is parsed. If the last character matched # is not a U+003B SEMICOLON character (;), there is a parse error. > If there is an ampersand and it is not followed by a > semicolon-terminated entity name, it may well be a typo. Right. But in a case like <a href="a=1&reg=2">, it's not likely that it's a typo and the spec makes sense here. This bug is raised because IE9 and below haven't supported this and the validator should still raise an error for this case *for now*, not after IE10 takes over the world. 77866 10 hsivonen 2012-11-05 12:12:48 +0000 (In reply to comment #9) > But in a case like <a href="a=1&reg=2">, it's not likely that it's a > typo and the spec makes sense here. This bug is raised because IE9 and below > haven't supported this and the validator should still raise an error for > this case *for now*, not after IE10 takes over the world. OK, in that case, it’s not a typo, but in that case, compatibility should weigh more than the ability to declare www.google.com free of Parse Errors. (In reply to comment #7) > If IE behavior for "&reg" is as Kenny describes It indeed is. How did we end up with legacy-IE-incompatible tokenization for this case in the first place? 77867 11 hsivonen 2012-11-05 12:18:40 +0000 (In reply to comment #10) > (In reply to comment #7) > > If IE behavior for "&reg" is as Kenny describes > > It indeed is. How did we end up with legacy-IE-incompatible tokenization for > this case in the first place? FWIW, IE8 also turns <a href="&reg=2"> into a ®=2. 77871 12 zcorpan 2012-11-05 13:15:05 +0000 (In reply to comment #10) > (In reply to comment #7) > > If IE behavior for "&reg" is as Kenny describes > > It indeed is. How did we end up with legacy-IE-incompatible tokenization for > this case in the first place? We made <a href="&reg=2"> parse different from legacy IE on the basis that legacy IE was not what people expect and there were few enough pages relying on this that we could change it. I had to argue the case to convince Hixie to change it. http://lists.w3.org/Archives/Public/public-html/2009Jul/0417.html (I can't find the email where it was accepted or the actual spec change right now.) 77885 13 kennyluck 2012-11-05 14:23:02 +0000 (In reply to comment #12) > We made <a href="&reg=2"> parse different from legacy IE on the basis that > legacy IE was not what people expect and there were few enough pages relying > on this that we could change it. I had to argue the case to convince Hixie > to change it. > > http://lists.w3.org/Archives/Public/public-html/2009Jul/0417.html > > (I can't find the email where it was accepted or the actual spec change > right now.) Only after reading this thread do I realize IE9 and below treat <a href="&copypasta"> according to the spec. So, as the reporter of this bug, my request is changed to the following, to be precise: add | However, if the next character is a a U+003D EQUALS SIGN character (=), | this is a parse error. after # If the character reference is being consumed as part of an attribute, and the # last character matched is not a U+003B SEMICOLON character (;), and the next # character is either a U+003D EQUALS SIGN character (=) or an alphanumeric # ASCII character, then, for historical reasons, all the characters that were # matched after the U+0026 AMPERSAND character (&) must be unconsumed, and # nothing is returned. (For what it's worth, "reg" was a real case.) and I encourage validator.nu to do this. Thanks for the link! 77903 14 ian 2012-11-05 20:01:34 +0000 > is a a Was that intended to be "is a" or "is not a"? (Just making sure we're on the same page... I think either way makes sense. In particular, having this temporarily be a parse error for the near term while IE9 and below are still in wide use seems rather reasonable.) 77906 15 kennyluck 2012-11-05 20:08:39 +0000 (In reply to comment #14) > > is a a > > Was that intended to be "is a" or "is not a"? "is a" 82384 16 ian 2013-01-31 00:35:18 +0000 Ok, I've made this a conformance error for the time being. 82385 17 contributor 2013-01-31 00:37:03 +0000 Checked in as WHATWG revision r7679. Check-in comment: Make <a href='?guitar=2&amp=1&pedal=6'> a parse error since IE9 misparses it '?guitar=2&=1&pedal=6' apparently. http://html5.org/tools/web-apps-tracker?from=7678&to=7679