W3C International Internationalization

This page is no longer maintained and may be inaccurate. For more up-to-date information, see the Internationalization Activity home page.

URI encoding programs

For worldwide interoperability, URIs have to be encoded uniformly. To map the wide range of characters used worldwide into the 60 or so allowed characters in a URI, a two-step process is used:

For example, the string

Franois

would be encoded as

Fran%c3%a7ois

(The "" is encoded in UTF-8 as two bytes C3 (hex) and A7 (hex), which are then written as the three characters "%c3" and "%a7" respectively.)

This can make a URI rather long (up to 9 ASCII characters for a single Unicode character), but the intention is that browsers only need to display the decoded form, and many protocols can send UTF-8 without the %HH escaping.

Program code

Here are some examples of program code for encoding and decoding:

Jigsaw

Jigsaw , the W3C demonstration server, is written in Java and could in principle serve resources with non-ASCII names. However, the current version (1.x) doesn't do so. By replacing the unescape routine in file LookupState.java with the version above that omission is fixed. (However, it is currently difficult to create non-ASCII resources interactively; that won't be fixed until Jigsaw 2.0.)


W3C Bert Bos , i18n coordinator
Webmaster
Last updated $Date: 2008/05/07 17:58:25 $