Re: FW: single percent from Erik van der Poel on 2009-09-29 (public-iri@w3.org from September 2009)

From: Erik van der Poel <erikv@google.com>
Date: Tue, 29 Sep 2009 07:39:15 -0700
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: Larry Masinter <masinter@adobe.com>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <c07a32650909290739k4741787k48e0c2bb3b7565df@mail.gmail.com>
On Tue, Sep 29, 2009 at 3:50 AM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> Hello Erik, others,
>
> On 2009/09/26 3:23, Erik van der Poel wrote:
>>
>> On Fri, Sep 25, 2009 at 10:42 AM, Larry Masinter<masinter@adobe.com>
>>  wrote:
>>>
>>> Some mail didn't get sent to public-IRI which should have been:
>>>
>>> On 2009/09/03 7:33, Larry Masinter wrote:
>>>>
>>>> Sorry to be rehashing what I think are old topics, but the discussion of
>>>> these things seems to be scattered around on a zillion mailing lists:
>>>>
>>>>
>>>>   *   I'm not sure why  http://example.com/%<http://example.com/%25>
>>>>  should be illegal as an IRI. I remember some discussion of this, but not
>>>> the resolution. Why not update IRI to allow it, since it seems to work in
>>>> most systems?
>>
>> I think this got garbled along the way, but I assume you're talking
>> about a percent sign (%) in the path part that is not followed by two
>> hex digits. This does not "work in most systems". Our automated tests
>> show that IE8 will not send the HTTP request, Safari4 escapes % as
>> %25, while Firefox, Chrome and Opera leave the % as is.
>
> Oh, interesting. I think Larry and I were assuming that there was some
> uniform behavior at least for major browsers that we could document (instead
> of HTML5). If there's such variation, my first proposal would be to go with
> the most conservative variant (single percents are simply illegal -> don't
> send request,...). (My second proposal would be to mention more lenitent
> processing only as a MAY.)

When there is a single percent in the query part, the browsers are a
bit more consistent. MSIE, Firefox, Chrome and Opera leave it as %
while Safari escapes it as %25.

The browsers are also a bit more consistent when there is a single
percent in the host part. MSIE, Firefox, Chrome and Opera don't emit
any DNS or HTTP packets, while Safari sends DNS and HTTP packets with
the % as is.

>>> Martin:
>>>
>>> It's illegal in URIs, too. The URI and IRI syntaxes should be as
>>> parallel as possible. In terms of implementations, it may be easy for
>>> consumers, but for producers, it's not. It's much easier to just escape
>>> than to go and check whether (one or) two hex digits are following
>>> (which would change the meaning totally).
>>
>> Surely that depends on the type of producer. For HTML form
>> submissions, % should be escaped as %25,
>
> Yes, if you have a '%' which is simply data, you should convert it to '%25'.
>
>> but for HTML hrefs, the
>> producer is also a consumer and should first check whether two hex
>> digits follow.
>
> I'm not sure what you mean here by "the producer is also a consumer". Can
> you explain?

A browser consumes HTML and produces DNS and HTTP packets (requests).

>> The big question is what to do about a % sign that is
>> not followed by two hex digits. The major browsers currently handle
>> this differently, so producers would be wise to avoid this,
>
> Very much so indeed. Even if major browsers handled this all the same way,
> there's much more than just major browsers that processes URIs or IRIs.

Yes, I agree that single percents should be strongly discouraged (but
we need to document and try to standardize the behavior of producers
that are also consumers).

>> but it is
>> not clear to me what advice should be given to consumer/producer
>> implementers. Is it better to be conservative like IE and reject it?
>> Or is it better to be forgiving like Firefox and just send out the
>> lone % sign? (Note: this particular case is interesting, because IE is
>> usually the forgiving one, while Firefox is the conservative one.)
>
> Well, there's always the hope for progress.

Yes.

>>> Martin:
>>>
>>> I think the purpose is to %-encode '[' and ']' except for the authority
>>> part, where they are needed for IPV6. The encoding is done because '['
>>> and ']' are not allowed elsewhere than in IP-literal.
>>
>> I don't see why [ and ] should be disallowed in the path and query
>> parts,
>
> Well, currently the specs say so (the URI spec says so, and the IRI spec
> follows it).

That may be true, but is there a good reason why those are disallowed?

>> but the major browsers currently handle those characters
>> differently in the path/query. (Some browsers %-encode, others don't.)
>
> Can you give details?

Path: MSIE, Chrome and Opera leave [ and ] as is, while Firefox and
Safari %-encode them.

Query: MSIE, Firefox, Chrome and Opera leave [ and ] as is, while
Safari %-encodes them.

> Overall, I'm more and more wondering how we as editors, or a potential IETF
> IRI WG, would deal with the kind of variability between browsers that Erik
> is bringing up here. I thought we could just work from what HTML5 had,
> because that reflected wide current practice among browsers, but that
> doesn't seem to really be true.

My proposal is to document the differences (between the major
browsers). I can produce the differences for a number of test cases.
The hope is that the browsers will try to align. In some cases, it may
be a good idea to discuss how to align. In particular, we are not only
concerned about interoperability -- we are also concerned about
security.

Where the browsers do not align, the documented differences serve as
warnings to producers (to avoid those areas).

The specs might give recommendations, and also good reasons for those
recommendations. In particular, it would be nice to have good security
considerations.

Erik
Received on Tuesday, 29 September 2009 14:45:13 UTC