This page is no longer maintained and may be inaccurate. For more up-to-date information, see the Internationalization Activity home page.
For worldwide interoperability, URIs have to be encoded uniformly. To map the wide range of characters used worldwide into the 60 or so allowed characters in a URI, a two-step process is used:
For example, the string
Franois
would be encoded as
Fran%c3%a7ois
(The "" is encoded in UTF-8 as two bytes C3 (hex) and A7 (hex), which are then written as the three characters "%c3" and "%a7" respectively.)
This can make a URI rather long (up to 9 ASCII characters for a single Unicode character), but the intention is that browsers only need to display the decoded form, and many protocols can send UTF-8 without the %HH escaping.
Here are some examples of program code for encoding and decoding:
javac UTF8URL.java
)java UTF8URL
)As you type in the upper box, the second box shows the encoded version, and the bottom box shows the decoded version of the second box (which, of course, should be exactly the same as what you typed).
Jigsaw , the W3C demonstration server, is written in Java and could in principle serve resources with non-ASCII names. However, the current version (1.x) doesn't do so. By replacing the unescape routine in file LookupState.java with the version above that omission is fixed. (However, it is currently difficult to create non-ASCII resources interactively; that won't be fixed until Jigsaw 2.0.)