Re: ISSUE-126: charset-vs-backslashes - Straw Poll for Objections

On 05.03.2011 22:41, Philip Jägenstedt wrote:
> ...
>>> Furthermore, earlier steps of the algorithm are nowhere near close to
>>> the HTTP spec, simply finding the first occurence of "charset",
>>> allowing e.g. content='garbagecharset=UTF-8'.
>>
>> I believe this is ISSUE-148.
>>
>>> Only if the algorithm as a whole matches exactly the media-type
>>> production will the spec not require "recipients to parse
>>> Content-Type headers in <meta> elements in a way breaking HTTP's
>>> parsing rules." Since the change proposal does not achieve that, I
>>> object to its adoption.
>>
>> Again, it's a process problem that we're looking at three issues at
>> the same time.
>
> OK, I wasn't aware that there was a third issue as well. Would it be
> fair to simply treat the sum of your proposals as a single proposal that
> causes the content="" attribute value to be parsed as per the media-type
> production?
> ...

My goals would be:

- either align parsing with HTTP; *or* be clear that this is specific to 
META, and consumers will need different parsing rules for the two 
protocol elements.

- in the latter case, rephrase and possibly move the text we're 
discussing so it becomes crystal clear that this is error handling, and 
*only* applies to <meta>.

- make sure that field values that are syntactically valid in HTTP and 
conforming in HTML have the same interpretation.

- clarify how the two sets described above differ (for instance, if 
backslash doesn't do the same thing as in quoted-string it should be 
profiled out in HTML, this may already be the case).

- get rid of claims that things are done for backwards compatibility 
when we have proof this is not the case.

I don't care a lot how we get there.

>> The bug was originally raised because the spec claims that the
>> described behavior was needed for compatibility with "existing
>> content". This has been proven to be nonsense, or minimally an
>> exaggeration.
>
> It seems to me that parsing as per the the media-type production is
> actually extremely likely to break existing content. The impact of
> backslash escaping or quotes is likely rather small (not zero), but the
> way the charset parameter is extracted (ISSUE-148) is much more serious.
> The following kinds of typos are very likely to exist in the wild in
> fairly large numbers, and would break:
>
> content='text/html charset=UTF-8' (missing semicolon)
> content='text/html: charset=UTF-8' (colon instead of semicolon)
> content='text/html; charset = UTF-8' (whitespace between attribute and
> value)
> content='text/html; charset=UTF-8;' (trailing semicolon)
> content='text/html;; charset=UTF-8' (double semicolon)

Again, numbers and test cases would be interesting.

BTW:

content='text/html; charset = UTF-8' (whitespace between attribute and 
value)

is syntactically legal per RFC 2616 (although we may have broken it in 
HTTPbis, just opened a ticket).

>> If we follow Anne's proposal for ISSUE-125 we'll at least have spec
>> text that simply states that parsing of meta tag values is different
>> from HTTP header field values, which is an improvement. We can then
>> focus on deciding *which* of all of these differences make sense/are
>> "required".
>
> One could instrument an existing HTML5 parser to strictly use the
> media-type production, then running that and a standard one a few
> millions of web pages. My guess is that we'd find that the detected
> encoding is different on a non-neglible percentage of pages.

Sure.

But we should keep in mind that error handling and changing the 
interpretation of legal values are not the same thing.

Best regards, Julian

Received on Sunday, 6 March 2011 10:46:40 UTC