- 1 What is Happening with "International URLs"
- 1.1 Enter the URL Specification
- 1.2 What are the problems?
- 1.3 Internationalized Domain Name (IDNA) Issues
- 1.4 Bidirectional Languages
- 1.5 Other Possible Issues
What is Happening with "International URLs"
URLs were originally defined as ASCII only. Although it was desirable to allow non-ASCII characters in URLs, shoehorning UTF-8 into ASCII-only protocols seemed unapproachable. At that time, Unicode was not as dominant an encoding as it is today on the Web, so the tack was taken was to leave "URI" alone and define a new protocol element, the "IRI" (Internationalized Resource Identifier). IRI is defined by RFC 3987, which was published in 2005 (in sync with the RFC 3986 update to the URI definition).
IRI defined an IRI-to-URI transformation. Unfortunately, it had options, such that the same IRI might have several different URI representations.
The URI-to-IRI transformation also isn't fully heuristic, since there is no guarantee that %xx-encoded bytes in the source URI are actually meant to be the %xx percent-hex-encoded bytes of a UTF-8 encoded Unicode string.
To address these issues, a new working group was established in IETF in 2009 (the "IRI working group"). Despite meeting several times, the group didn't get the attention of or support from browsers and other key developers. The IRI group was closed in early 2014, with the idea that the documents that were in the IRI working group could be updated using the individual submission process within the "applications area" of IETF. In particular, one of the IRI working group items was to update the "scheme guidelines and registration process", which was recently submitted here. This draft, of course, applies to IRIs as well.
Independently, HTML5 defined something called a "Web Address", in an attempt to describe (and harmonize) how browsers handled URLs. This definition (which focused on the parsing algorithm) was moved out into a separate WHATWG document called "URL".
There are other developments as well. ICANN approved non-ASCII top level domains as part of the IDNA effort. IDNA 2003 and 2008 didn't really address the problem of encoding non-ASCII domain elements into URLs. The Unicode Consortium developed UTS #46 to help implementers navigate issues that arose between the two IDNA versions.
Enter the URL Specification
At some point it was recognized that "Web addresses" and "IRIs" and "URIs" were all trying to accomplish effectively the same thing. With the demise of the IRI working group at IETF, the WHATWG "URL" spec is currently the focus of development effort. Development of this document appears to have better implementer support.
The URL spec focuses on the problem of defining rules for parsing, processing, and serializing URLs (the term chosen to represent all Web address/IRI/URI things).
It lists the following goals:
- Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process. (E.g. spaces, other "illegal" code points, query encoding, equality, canonicalization, are all concepts not entirely shared, or defined.) URL parsing needs to become as solid as HTML parsing.
- Standardize on the term URL. URI and IRI are just confusing. In practice a single algorithm is used for both so keeping them distinct is not helping anyone. URL also easily wins the search result popularity contest.
The IETF Working Group for IRI was shut down in January 2014 in recognition of the lack of progress or potential progress on a revision of the RFC 3987 document and URL is now the main vehicle for providing a solid reference that specifications (such as HTML) and implementations (such as browsers) can be based on. Similar to the HTML5 effort, URL's editors mainly focus on documenting what browsers actually do and, where interoperability is not currently present, working with implementers on resolving differences.
This means that the issues with IRI are unresolved. The URI/IRI transformation is not fully described and it is unclear if the URL spec will resolve all or only a part of the issues being worked on in the context of IRIs.
What are the problems?
There are a few problems that remain to be solved by URL. Some of the most visible of these are:
Internationalized Domain Name (IDNA) Issues
One class of issues relates to the advent of non-ASCII domain identifiers.
How to Encode IDNs
Non-ASCII domain names ("IDNA") rely on a special transfer encoding ("Punycode") to allow non-ASCII domain values inside the ASCII-only DNS nameserver hierarchy. Punycode encodes a domain name such as "faß.de" using special markup that looks like so: "xn--fa-hia.de"
By contrast, IRI defined the use of percent-encoding for the same case. So the same string would be "fa%C3%9F.de" in an IRI.
Which form (Unicode characters, Punycode, or percent-encoded) should be used in plain text? On the wire? In document formats? Can one form substitute for the other(s)? Are IRIs that use different forms "identical"?
To add the confusion, IDNA itself has two versions: IDNA2003 and IDNA2008. These differ in handling of certain characters and a few other details. Unicode Technical Report #46 TR46 discusses the details and problems of this and has a summary of the issues involved.
There are reports from Verisign that percent-encoded domain names are being sent to DNS (and failing), which suggests that the hex-encode-versus-punycode problem exists as an operational issue.
There is some evidence that some processors are %xx-hex-encoding the UTF8 of domain names in some circumstances.
If 'hostname' is a non-ASCII string (IDN), should a processor trying to convert the IRI to a URI use punicode or %xx-hex-encoding for the authority segment?
Currently schemes aren't required to reserve the 'authority' field for DNS names only, so a URI might look like:
... for which a Punycode translation of the "non-asacii-but-not-dns-name" shouldn't be Punycode encoded: it's not a domain name.
Internationalized Email Addresses (EAI)
One specialized type of URL is the email address (presented with or without the 'mailto' scheme). Internationalized domain names can be multiply encoded (as above). In addition, it isn't clear how the "localpart" (e.g. the 'fred' in "email@example.com") should be encoded or handled.
Mail agents, mail clients, plain text parsers, and many many applications/programs/websites, etc. have in-built assumptions about what is a "valid" email address. Identifying, passing, and making functional addresses that contain non-ASCII values consistently end-to-end remains an issue.
Certain scripts in Unicode, such as Arabic and Hebrew, use a right-to-left writing direction. The Unicode Bidirectional Algorithm (UBA) defines how bi-directional strings are presented. The use of bi-directional text in URLs brings up a number of issues.
There exists an IETF Internet-Draft  with guidelines for creation of Bidi IRIs, but this document has not been recently updated.
Overall Presentation in a Bidirectional Language
Should an URL be presented in a default left-to-right presentation in all circumstances? Or should users in a predominantly right-to-left context see the URL presented with its base direction reversed. For example, which presentation should Arabic speakers expect to see:
Most bidi language speakers seem to find the top version "natural" for a URL, but there are anecdotal reports of some preferring the bottom presentation.
Unicode Bidirectional Algorithm Failure
UBA itself is a problematic when an URL is rendered as plain text (rather than in, say, the specialized environment of a browser's address bar). The separators in an URL are the characters "." (U+002E FULL STOP) and "/" (U+002F SOLIDUS) characters. These characters have no strong direction of their own: they're "neutral" and take their direction from neighboring characters. This means that strongly right-to-left or left-to-right text elements adjacent to a path or domain name separator have an effect on the presentation of the overall string—and not a benign effect either. Bi-directional text elements could be seen to "swap position" when presented to the user in a way that is confusing, potentially creating difficult-to-detect phishing attacks.
For example, here is the Arabic string meaning "Egypt": مصر. And here is the Arabic string meaning "Tunisia": تونس.
Consider an URL that contains these two string, one on the end of the a domain name and one as the first element in the path. In the text here, "Egypt" is in the domain name and "Tunisia" is in the path. But that's not how they are displayed:
http://www.مصر/تونس h t t p : / / w w w . \u0645 \u0635 \u0631 / \u062A \u0648 \u0646 \u0633
Note that this is a trivial example: elements can also "change places" visually in the path or between the path and query or fragment portions of a URL.
In plain text, Unicode bidirectional control characters are used to fix display issues like this, but inserting them into the URL would be unwelcome and potentially problematic when the address is broken into its constituent parts. Here's the same text with U+200E (Left-to-right marker control character or "LRM") inserted before the path separator:
http://www.مصر/تونس h t t p : / / w w w . \u0645 \u0635 \u0631 \u202E / \u062A \u0648 \u0646 \u0633
It has been proposed that the "address bar" of browsers and other user-agents be a special processing environment that provides an augmented bi-directional algorithm for processing and display. However, it should be noted that URLs are widely used as plain text: they appear on billboards and sides of buses, they are jotted down on napkins, and they are traded via email. These non-browser environments probably cannot all be fixed to provide special URL handling.
Other Possible Issues
Finding the boundaries of a URL might seems like a simple thing, but Unicode contains a number of characters that, while legal in an IRI, are not legal in an ASCII-only URI or which naive processors of text might not recognize as being "part of the URL". For example, certain kinds of space characters other than
U+0020 SPACE might be legal, but, being spaces, might cause programs like an email client or editor program to treat as the "end" of the address. This is outside the scope of the current URL specification but may be an area where additional documentation can be developed.
IRI called for all data in an IRI to be encoded using Unicode UTF-8. However, for historical reasons, form data in the query string of a Web address is usually encoded in the character encoding of the form (that is, of the HTML page containing the form). This might not be UTF-8 and changing it to be UTF-8 would break Web servers that expect data to be in some other character encoding.
Update 2015-08-25: The URL spec defines this formally. By default the query string uses UTF-8. X-Form's (defined by HTML) allows the page author to supply the override legacy character encoding if needed (UTF-8 is encouraged). If an override is used, there may be nothing in the URL itself that indicates what this override encoding is: the receiver just has to know.
Unicode contains many characters that have similar appearance to other characters. Allowing the full range of Unicode into a URL means that characters which look similar—or even identical to—other characters could be used to spoof users.
IRI attempted to address certain aspects of confusable characters by requiring that URLs be in Unicode Normalization Form C (NFC). This eliminates differences related to the presence and order of combining marks. It doesn't remove most confusables, however.
IDNA attempts to prevent confusables by disallowing mixed scripts in domain names and applying a much higher level of preparation and normalization to domain name values. This process works well for DNS because of the way that nameservers work, but it cannot be applied to the other portions of a URL easily (without breaking the functionality of URLs).
Should URL provide additional spoofing prevention? What rules should apply to the content of URLs? Should it remain NFC? Be relaxed? Be more stringent?