Bug 24845 - Merge <form> and URL error modes?
Merge <form> and URL error modes?
Status: RESOLVED FIXED
Product: WHATWG
Classification: Unclassified
Component: Encoding
unspecified
PC All
: P2 normal
: Unsorted
Assigned To: Anne
sideshowbarker+encodingspec
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-27 22:53 UTC by Ian 'Hixie' Hickson
Modified: 2014-04-11 10:54 UTC (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ian 'Hixie' Hickson 2014-02-27 22:53:00 UTC
http://encoding.spec.whatwg.org/#encodings

# Otherwise, if encoder's error handling mode is URL, emit byte 0x3F.

As far as I can tell, Chrome and Safari actually do the same for the URL mode as they do in the <form> mode. (I didn't check IE. Firefox goes back and reserialises the whole segment using UTF-8 instead.)

e.g.: http://damowmow.com/playground/tests/urls/001.html
Comment 1 Simon Pieters 2014-02-28 14:32:37 UTC
IE will send a literal ? whereas per spec the ? should be percent-encoded IIRC.
Comment 2 Anne 2014-03-20 13:43:52 UTC
Henri, what do you think? How do you want to align Gecko? Making the error handling the same is certainly attractive.
Comment 3 Henri Sivonen 2014-04-02 07:47:53 UTC
(In reply to Anne from comment #2)
> Henri, what do you think? How do you want to align Gecko? Making the error
> handling the same is certainly attractive.

I think I don't have enough familiarity with this issues to respond in an informed way. Moreover, I don't find the sentence that Hixie quoted in the spec to figure out the context of the question.

I might be able to develop an opinion if I knew what case exactly we are talking about here.
Comment 4 Simon Pieters 2014-04-02 10:01:18 UTC
There are two cases which currently are handled differently in the spec:

Consider a document (with a http: base URL) with encoding windows-1251 which includes a link <a href="?&aring;"> and a form <form><input name=x value="&aring;"></form>. å is not representable in windows-1251. The former is turned into ?%3F and the latter is turned into ?%26%23229%3B. The proposal is to make both ?%26%23229%3B.

The proposal matches WebKit/Blink.

IE almost matches the current spec, it just doesn't percent-escape the "?".

Gecko switches to utf-8 for the whole URL and gets ?%C3%A5.

I think the <form> handling is interoperable already.

I think the relevant part of the spec is http://encoding.spec.whatwg.org/#concept-encoding-process
Comment 5 Henri Sivonen 2014-04-03 13:02:25 UTC
(In reply to Simon Pieters from comment #4)
> Consider a document (with a http: base URL) with encoding windows-1251 which
> includes a link <a href="?&aring;"> and a form <form><input name=x
> value="&aring;"></form>. å is not representable in windows-1251. The former
> is turned into ?%3F and the latter is turned into ?%26%23229%3B. The
> proposal is to make both ?%26%23229%3B.
> 
> The proposal matches WebKit/Blink.
> 
> IE almost matches the current spec, it just doesn't percent-escape the "?".

I could live with matching WebKit/Blink. The probability of IE's approach ever leading to a useful URL seems low.

> Gecko switches to utf-8 for the whole URL and gets ?%C3%A5.

Hmm. Switching the encoding when one non-representable character is added doesn't seem like a good idea to me, especially if other browsers don't do the same. CCing bz in the hope of getting background info.
Comment 6 Boris Zbarsky 2014-04-03 14:46:28 UTC
I don't recall exactly what necko does in this situation and why...  Worth checking with one of the current necko peers.
Comment 7 Leif Halvard Silli 2014-04-04 05:16:33 UTC
(In reply to Henri Sivonen from comment #5)
> (In reply to Simon Pieters from comment #4)

> > Gecko switches to utf-8 for the whole URL and gets ?%C3%A5.
> 
> Hmm. Switching the encoding when one non-representable character is added
> doesn't seem like a good idea to me, especially if other browsers don't do
> the same. CCing bz in the hope of getting background info.

Well, compared with what they do for erroneous URLs with *representable* characters, then both Firefox and Webki/Blink switch the encoding when one  non-representable character is added. They just do it different ways.

Firefox just follows the normal procedure of representing code points higher than U+009F as UTF-8 percent-encoded characters. Webkit/BLink do the same - but only for the *representable* characters.

The issue here is, I guess, 'storing': The percent-encoding is decoded e.g. when storing a form.  And so, in Webkit/Blink, ?%26%23229%3B becomes ?&#229;, which is compatible even with Cyrillic encodings.

For IE, the character is stored as representable, and therefore wrong, character.

For Firefox, unless it performs some extra encoding step after the decoding, the percent-encoded character is probably stored as UTF-8 encoded characters read through a non-UTF parser

For storing a form, the Webkit/Blink behavior seems more fruitful. Inside a Web page, the Firefox method might be better?
Comment 8 Anne 2014-04-10 11:36:12 UTC
I asked Necko guys before (just checked again on #necko, no response so far) and I believe nobody in charge at the moment really knows much or cares for the URL parsing code.