HTTP URI

Section 3.2.1 unfortunately will need some careful wording to deal with the sticky issues of character sets. The problem is that the BNF of the URL specification is described in terms of characters, not octets. That is, the URL "http://foo.baz/%23bar" is defined as a sequence of characters that includes the character "%", "2", "3", "b", etc. The translation of those sequences of characters to octets used in on-the-wire protocols is left up to the individual scheme. The "ftp:" scheme, for example, calls for the parsing of the URL as a sequence of characters, and then performing various operations in the FTP protocol using the de-encoded octets. This is how "ftp://foo.baz/afs/a/b/c" which is supposed to "CD a", "CD b" and then "RETR c" can be distinguished from "ftp://foo.baz/%2fafs%2fa%2fb/c" which is supposed to "CD /a/b" and then "RETR c".

However, for the HTTP protocol, the translation from characters to octets does not involve any de-encoding; the characters in the URL after the host name (and the entire URL when talking to a proxy) are turned into octets using the US-ASCII encoding.

HTTP itself need not say, but the question arises for special circumstances where octets that are not US-ASCII are used in a URL, e.g., after a "?" which encodes a query. The HTTP specification itself could be silent on this issue, but in the case of HTML forms with <ISINDEX>, it might need to be specified as ISO-8859-1.

http working group issues, 2/24.