24845 – Merge <form> and URL error modes?

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 24845 - Merge <form> and URL error modes?

Summary: Merge <form> and URL error modes?

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-02-27 22:53 UTC by Ian 'Hixie' Hickson
Modified:	2014-04-11 10:54 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description Ian 'Hixie' Hickson 2014-02-27 22:53:00 UTC

http://encoding.spec.whatwg.org/#encodings

# Otherwise, if encoder's error handling mode is URL, emit byte 0x3F.

As far as I can tell, Chrome and Safari actually do the same for the URL mode as they do in the <form> mode. (I didn't check IE. Firefox goes back and reserialises the whole segment using UTF-8 instead.)

e.g.: http://damowmow.com/playground/tests/urls/001.html

Comment 1 Simon Pieters 2014-02-28 14:32:37 UTC

IE will send a literal ? whereas per spec the ? should be percent-encoded IIRC.

Comment 2 Anne 2014-03-20 13:43:52 UTC

Henri, what do you think? How do you want to align Gecko? Making the error handling the same is certainly attractive.

Comment 3 Henri Sivonen 2014-04-02 07:47:53 UTC

(In reply to Anne from comment #2)
> Henri, what do you think? How do you want to align Gecko? Making the error
> handling the same is certainly attractive.

I think I don't have enough familiarity with this issues to respond in an informed way. Moreover, I don't find the sentence that Hixie quoted in the spec to figure out the context of the question.

I might be able to develop an opinion if I knew what case exactly we are talking about here.

Comment 4 Simon Pieters 2014-04-02 10:01:18 UTC

There are two cases which currently are handled differently in the spec:

Consider a document (with a http: base URL) with encoding windows-1251 which includes a link <a href="?&aring;"> and a form <form><input name=x value="&aring;"></form>. å is not representable in windows-1251. The former is turned into ?%3F and the latter is turned into ?%26%23229%3B. The proposal is to make both ?%26%23229%3B.

The proposal matches WebKit/Blink.

IE almost matches the current spec, it just doesn't percent-escape the "?".

Gecko switches to utf-8 for the whole URL and gets ?%C3%A5.

I think the <form> handling is interoperable already.

I think the relevant part of the spec is http://encoding.spec.whatwg.org/#concept-encoding-process

Comment 5 Henri Sivonen 2014-04-03 13:02:25 UTC

(In reply to Simon Pieters from comment #4)
> Consider a document (with a http: base URL) with encoding windows-1251 which
> includes a link <a href="?&aring;"> and a form <form><input name=x
> value="&aring;"></form>. å is not representable in windows-1251. The former
> is turned into ?%3F and the latter is turned into ?%26%23229%3B. The
> proposal is to make both ?%26%23229%3B.
> 
> The proposal matches WebKit/Blink.
> 
> IE almost matches the current spec, it just doesn't percent-escape the "?".

I could live with matching WebKit/Blink. The probability of IE's approach ever leading to a useful URL seems low.

> Gecko switches to utf-8 for the whole URL and gets ?%C3%A5.

Hmm. Switching the encoding when one non-representable character is added doesn't seem like a good idea to me, especially if other browsers don't do the same. CCing bz in the hope of getting background info.

Comment 6 Boris Zbarsky 2014-04-03 14:46:28 UTC

I don't recall exactly what necko does in this situation and why...  Worth checking with one of the current necko peers.

Comment 7 Leif Halvard Silli 2014-04-04 05:16:33 UTC

(In reply to Henri Sivonen from comment #5)
> (In reply to Simon Pieters from comment #4)

> > Gecko switches to utf-8 for the whole URL and gets ?%C3%A5.
> 
> Hmm. Switching the encoding when one non-representable character is added
> doesn't seem like a good idea to me, especially if other browsers don't do
> the same. CCing bz in the hope of getting background info.

Well, compared with what they do for erroneous URLs with *representable* characters, then both Firefox and Webki/Blink switch the encoding when one  non-representable character is added. They just do it different ways.

Firefox just follows the normal procedure of representing code points higher than U+009F as UTF-8 percent-encoded characters. Webkit/BLink do the same - but only for the *representable* characters.

The issue here is, I guess, 'storing': The percent-encoding is decoded e.g. when storing a form.  And so, in Webkit/Blink, ?%26%23229%3B becomes ?&#229;, which is compatible even with Cyrillic encodings.

For IE, the character is stored as representable, and therefore wrong, character.

For Firefox, unless it performs some extra encoding step after the decoding, the percent-encoded character is probably stored as UTF-8 encoded characters read through a non-UTF parser

For storing a form, the Webkit/Blink behavior seems more fruitful. Inside a Web page, the Firefox method might be better?

Comment 8 Anne 2014-04-10 11:36:12 UTC

I asked Necko guys before (just checked again on #necko, no response so far) and I believe nobody in charge at the moment really knows much or cares for the URL parsing code.

Comment 9 Anne 2014-04-11 10:54:23 UTC

https://github.com/whatwg/encoding/commit/73874f405d64a061eee35f4211cb3fe1e903a934