Re: Interpretation if %-escapes in IRIs [escapeInterpret-14]

Hello Bjoern,

Many thanks for all your questions.

Most of these questions, if not all of them, are answered
in the actual draft. Please check it and tell me where you
think something is missing or not clear enough.

At 05:37 03/05/02 +0200, Bjoern Hoehrmann wrote:
>* Martin Duerst wrote:
> >>IMO, the IRI draft should say, that if %-escaping is used in an IRI, the
> >>escape sequence must be generated from UTF-8 octets and %-escapes must
> >>be interpreted as octets in an UTF-8 sequence.
> >
> >why should it say so? In that case, you should not really use
> >%-escaping in an IRI, you should use real characters.
>
>What if it is impossible to use "real" characters due limitations of the
>transport media, the transport encoding,

Then preferably use a transport-specific escaping or encoding
(e.g. the various MIME mechanisms for email, numeric character
references for HTML and XML,...).


>if I need to escape a reserved character to avoid it's special meaning,

Then use escaping. That's very clear in the draft.


>if the character is disallowed

Then use escaping. Again, the draft says so.


>or if I want to encode binary data that does not represent any
>character?

Then use escaping. Same thing again.


>What if my IRI-aware application receives an IRI containing %-escape
>sequences but needs characters in order to work, like some kind of
>server for file transfer expecting a file name or a database frontend
>expecting a search string?

Then the server will do the conversion from %-escapes to octets
the same way it currently does, and some servers (e.g. Apache and IIS
on WinNT/2000/XP), or server configurations, will convert further,
where possible, to whatever character encoding is used internally
in the server.


>Let's say there is an 'uri' URI scheme and an 'iri' IRI scheme

There is really no such difference. All URI schemes can be used
with IRIs. For some, the benefit of using IRIs is greater than
for others. I think what you wanted to say is that there are
two protocol slots, let's say
iri="http://www.example.org/search?Bj+APY-rn" and
uri="http://www.example.org/search?Bj+APY-rn". I'll assume
this for the following examples, but I'll not change your syntax.


>(the + in
>the query part has no special meaning and may thus stay unescaped):
>
>   uri://www.example.org/search?Bj+APY-rn
>   iri://www.example.org/search?Bj+APY-rn
>
>Decoding the query part of the URI I would get the octets
>
>   <42><6A><2B><41><50><59><2D><72><6E>

Yes.


>The database frontend would then search for "Bjo"rn",

Sorry to have to use "Bjo"rn" for your example due to my
Japanese mailer.


>since it decodes
>the octets represented by characters in the URL as UTF-7 octets.

If the database frontend is programmed that way, then that's correct.


>What
>about the IRI? Is the frontend supposed to search for "Bj+APY-rn" or
>for "Bjo"rn"?

If the same frontend is used, the same thing will happen.
The frontend has no way to distinguish whether it receives an URI
or an IRI.


>Is a data character in an IRI a character or is it a
>representation of an octet or even something else?

It is a character. That does not prohibit that these characters
are (mis)used to represent other characters, as in the case of
UTF-7.


>If an IRI data character is a "real" character, refer %-escape sequence
>also to real characters? Are these IRIs equivalent:
>
>   iri://www.example.org/search?Bj%F6rn
>   iri://www.example.org/search?Bjo"rn

These are definitely not equivalent, because the %F6 is based
on Latin-1, not UTF-8.


>just like these URIs are:
>
>   uri://www.example.org/search?a
>   uri://www.example.org/search?%61

If you read section 6 of
http://www.ietf.org/internet-drafts/draft-fielding-uri-rfc2396bis-03.txt
carefully, you'll see that these are equivalent
under certain definitions of equivalence, and for
those protocols/applications that use this definition
of equivalence.


>Are these equivalent:
>
>   iri://www.example.org/search?Bj%C3%B6rn
>   iri://www.example.org/search?Bjo"rn

These are equivalent under certain definitions of equivalence.


>and are these IRIs:
>
>   iri://www.example.org/search?a
>   iri://www.example.org/search?%61

They are as equivalent as the same URIs (see above).


>equivalent? If the latter two IRIs are equivalent, how would one then
>encode binary data in an IRI? What octets are represented in the query
>part of e.g.
>
>   iri://www.example.org/search?<U+20AC>
>   iri://www.example.org/search?<U+1D7F6>

The octets, when octets are needed, are based on UTF-8, i.e.
E2 82 AC in the first case, and F0 9D 9F B6 in the second case.


>Consider I want to send an IRI in a text/plain e-mail using us-ascii,
>but the IRI has non-ASCII characters, like
>
>   iri://www.example.org/bjo"rn

In the first place, you should not use us-ascii for sending this IRI.
There are many encodings, starting with iso-8859-1 and utf-8 that
can easily transfer the IRI.


>can I use %-escaping to encode the 'o"' and if yes, how would the IRI
>then look like? Would it be
>
>   iri://www.example.org/bj%F6rn
>   iri://www.example.org/bj%ECrn
>   iri://www.example.org/bj%C3%B6rn

If anything, it would be this one, with "bj%C3%B6rn", using UTF-8.
While this would not work for namespaces (i.e. XML parsers and
XSLT processors would treat the namespaces
iri://www.example.org/bjo"rn and iri://www.example.org/bj%C3%B6rn
differently), it would at least resolve to the same thing, e.g.
over http (exactly the same applies to http://www.example.org/search?a
and http://www.example.org/search?%61).


>   iri://www.example.org/bj%00%F6rn
>   ...
>
>Currently neither RFC 2396 nor the IRI draft give an advise here. Is
>this a scenario not supported by IRIs?

Which scenario? The scenario of sending IRIs over US-ASCII?
Or another one?


>If yes, why do you think it is
>not necessary or not possible to support it,

If you mean sending IRIs over US-ASCII, then it's not possible in
the same way it's not really possible to send German or Japanese
email over US-ASCII.


>and why does the IRI draft
>not mention that %-escaping cannot be used for non-ASCII characters, but
>rather says it SHOULD NOT be used?

Because it depends on exactly what you are doing.


>If it is possible to use %-escaping
>for non-ASCII characters, the IRI draft must say how the non-ASCII
>character have to be encoded (actually, how any character is to be
>encoded) and should say, how one gets the characters back.

There are two very detailed sections in the draft discussing this.
For escaping, see section 3.1, "Mapping of IRIs to URIs".
For unescaping, see section 3.2, "Converting URIs to IRIs".
If you find anything that is unclear, please tell us, so that I can
fix it.


Regards,   Martin.

Received on Thursday, 26 June 2003 17:29:54 UTC