This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9351 - Do not interpret & followed by an entity name followed by = as an entity reference in attribute values (maybe in text content too)
Summary: Do not interpret & followed by an entity name followed by = as an entity refe...
Status: CLOSED DUPLICATE of bug 9207
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-27 22:41 UTC by Maciej Stachowiak
Modified: 2010-10-04 13:55 UTC (History)
7 users (show)

See Also:


Attachments

Description Maciej Stachowiak 2010-03-27 22:41:00 UTC
It's been suggested that an unterminated entity (one not followed by a semicolon) that is followed by an equal sign in an attribute value should not be treated as an entity reference.

It seems that rather few pages overall would be affected by changing this, one study found 50 occurrences  out of approximately 425k pages:
http://lists.w3.org/Archives/Public/public-html/2009Jun/0463.html

It was also reported that most of these occurrences appeared to be cases where the author did not expect their text to be interpreted as an entity reference, and review of these 50 instances seems to confirm that impression.

It seems like there is at least some content that would be broken by changing the interpretation:
http://lists.w3.org/Archives/Public/public-html/2009Jul/0421.html

Specifically, it seems that &amp= may occasionally be intended as "&" rather than as "&amp=" or "&=".

On the whole, it still seems like the proposed change would still fix more content than it breaks.

One possible variable is to exclude &amp= from this change. However, it seems that if authors write that when they mean "&", then their content will not work as intended even under either the existing parsing rule or the proposed new one. However, content that writes "&amp=" when it means "&amp=" would be fixed. There were instances of both in the data set.

In addition to the direct benefits of fixing content, this change would also make it safer to change parsing rules to allow unescaped & in attributes.
Comment 1 Henri Sivonen 2010-03-29 07:55:26 UTC
What's the rationale of restricting this change request to attribute values? URLs also get copied and pasted to element content occasionally and making them tokenize differently there would violate the principle of least surprise.
Comment 2 Maciej Stachowiak 2010-03-29 08:17:30 UTC
(In reply to comment #1)
> What's the rationale of restricting this change request to attribute values?
> URLs also get copied and pasted to element content occasionally and making them
> tokenize differently there would violate the principle of least surprise.
> 

After some consideration, I think this change should probably be applied, though I would also be limited by a version limited only to attribute values. Updating title accordingly.

The reasons I originally asked only for attribute values (which I no longer think are valid):

1) I thought the study evaluating occurrence of this pattern was limited to attribute values, so there was no information about the risk of the greater change. But that's not right - it included cases in text content and in script as well.

2) I did not think there was any benefit in text content, but I failed to think about the case of a URL in text content, or copying between attributes in text content. It's likely that the benefit is lower in text content, but the pattern seems much less likely to occur in the first place.

Title amended.
Comment 3 Maciej Stachowiak 2010-03-29 08:35:42 UTC
One interesting side note: processing of entities (aka "named character references" apparently) is just about the only thing that is already different between quoted attribute values and text content, as far as the tokenizer is concerned:

http://dev.w3.org/html5/spec/Overview.html#attribute-value-double-quoted-state
http://dev.w3.org/html5/spec/Overview.html#data-state

Making = get treated as an extra "additional allowed character" would in fact have the desired effect suggested by this bug.

That being said, I think it might still be a good idea to do the change everywhere.
Comment 4 Maciej Stachowiak 2010-03-29 08:47:35 UTC
Further correction: the "additional allowed character" is a pretty minor difference, because it applies only right after the ampersand. The actual key difference is:

"If the character reference is being consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next character is in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned."

Adding = to that list would effect the change proposed here, but only for attribute values, not for text. However, since processing of entities is already quite a bit different between attribute values and text content, I am no longer so sure that making this change for text content too is warranted. Updating title to bias it a bit the other way.
Comment 5 Ian 'Hixie' Hickson 2010-04-02 21:44:14 UTC
see also bug 9207
Comment 6 Ian 'Hixie' Hickson 2010-04-02 22:39:54 UTC

*** This bug has been marked as a duplicate of bug 9207 ***
Comment 7 Maciej Stachowiak 2010-04-02 22:40:27 UTC
Agree that this is a duplicate given the resolution of 9207