Difference between revisions of "IRIStatus"

From Internationalization
Jump to: navigation, search
(Unicode Bidirectional Algorithm Failure)
(Unicode Bidirectional Algorithm Failure)
Line 75: Line 75:
  
 
     http://www.مصر/تونس
 
     http://www.مصر/تونس
     http://www.\u0645\u0635\u0631/\u062A\u0648\u0646\u0633
+
     h t t p : / / w w w . \u0645 \u0635 \u0631 / \u062A \u0648 \u0646 \u0633
  
 
Note that this is a trivial example: elements can also "change places" visually in the path or between the path and query or fragment portions of a URL.  
 
Note that this is a trivial example: elements can also "change places" visually in the path or between the path and query or fragment portions of a URL.  
Line 82: Line 82:
  
 
   http://www.مصر‎/تونس
 
   http://www.مصر‎/تونس
   http://www.\u0645\u0635\u0631<span style="color:red">\u202E</span>/\u062A\u0648\u0646\u0633
+
   h t t p : / / w w w . \u0645 \u0635 \u0631 <span style="color:red">\u202E</span> / \u062A \u0648 \u0646 \u0633
  
 
It has been proposed that the "address bar" of browsers and other user-agents be a special processing environment that provides an augmented bi-directional algorithm for processing and display. However, it should be noted that URLs are widely used as plain text: they appear on billboards and sides of buses, they are jotted down on napkins, and they are traded via email. These non-browser environments probably cannot all be fixed to provide special URL handling.
 
It has been proposed that the "address bar" of browsers and other user-agents be a special processing environment that provides an augmented bi-directional algorithm for processing and display. However, it should be noted that URLs are widely used as plain text: they appear on billboards and sides of buses, they are jotted down on napkins, and they are traded via email. These non-browser environments probably cannot all be fixed to provide special URL handling.

Revision as of 19:30, 15 April 2014

What's Going On With IRIs?

URLs were originally defined as ASCII only. Although it was quickly determined that it was desirable to allow non-ASCII characters in URLs, shoehorning UTF-8 into ASCII-only protocols seemed unapproachable. At that time, Unicode not as dominant an encoding as it is today on the Web, so the tack was taken was to leave "URI" alone and define a new protocol element, the "IRI" (Internationalized Resource Identifier). RFC 3987 published in 2005 (in sync with the RFC 3986 update to the URI definition).

IRI defined an IRI-to-URI transformation. Unfortunately, it had options so that it wasn't a deterministic path. The URI-to-IRI transformation also wasn't heuristic, since there was no guarantee that %xx-encoded bytes in the URI were actually meant to be the %xx percent-hex-encoded bytes of a UTF-8 encoded Unicode string.

To address these issues a new working group was established in IETF in 2009 (the "IRI working group"). Despite meeting several times, the group didn't get the attention of browser developer or support from key developers. The IRI group was closed in 2014, with the idea that the documents that were in the IRI working group could be updated using the individual submission process within the "applications area" of IETF. In particular, one of the IRI working group items was to update the "scheme guidelines and registration process", which was recently submitted here. This draft, of course, applies to IRIs as well.

Independently, the HTML5 effort at WHATWG and W3C defined something called a "Web Address", in an attempt to describe (and harmonize) how browsers handled URLs. This definition (which focused on the parsing algorithm) was moved out into a separate WHATWG document called "URL".

The world has also moved on. ICANN has approved non-ASCII top level domains as part of the IDNA effort. IDNA 2003 and 2008 didn't really address the problem of encoding non-ASCII domain elements into URLs. The Unicode Consortium developed UTS #46 to help implementers navigate issues that arose between the two IDNA versions.

This leaves the original issues with IRI unresolved. The transformation between URI and IRI is ambiguous. Other issues must also be addressed before a complete description of URLs is completed.

--

The Internationalized Resource Identifiers (IRI) spec RFC 3987 was published by the IETF in January 2005, in lockstep with a revision to the URI spec RFC 3986. IRI defines Internet resource identifiers that can contain non-ASCII values, as well as the rules for converting to or from classical URIs, which were ASCII-only.

The IRI spec is not yet an IETF Standard. It has been stuck at "Proposed Standard" for some time, mainly because various issues were found with IRI almost immediately upon its publication. An IETF Working Group was established to address these changes in 2009, but progress was sporadic and finally abandoned because it was difficult to get enough active members who owned the key implementations involved.

Meanwhile, HTML5 defined something called a "Web address". Effectively, a "Web address" was an IRI appearing inside of an HTML document, such as in a href.

Enter the URL Specification

At some point it was recognized that "Web addresses" and "IRIs" and "URIs" were all trying to accomplish effectively the same thing. With the demise of the IRI working group at IETF, the WHATWG "URL" spec is currently the focus of development effort. Development of this document appears to have better implementer support.

The URL spec focuses on the problem of defining rules for parsing, processing, and serializing URLs (the term chosen to represent all Web address/IRI/URI thingys).

It lists the following goals:

  • Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process. (E.g. spaces, other "illegal" code points, query encoding, equality, canonicalization, are all concepts not entirely shared, or defined.) URL parsing needs to become as solid as HTML parsing.
  • Standardize on the term URL. URI and IRI are just confusing. In practice a single algorithm is used for both so keeping them distinct is not helping anyone. URL also easily wins the search result popularity contest.
  • Define URL's existing JavaScript API in full detail and add enhancements to make it easier to work with. Add a new URL object as well for URL manipulation without usage of HTML elements. (Useful for Web Workers.)

The IETF Working Group for IRI was shut down in January 2014 in recognition of the lack of progress or potential progress on a revision of the RFC 3987 document and URL is now the main vehicle for providing a solid reference that specifications (such as HTML) and implementations (such as browsers) can be based on. Similar to the HTML5 effort, URL's editors mainly focus on documenting what browsers actually do and, where interoperability is not currently present, working with implementers on resolving differences.

What are the problems?

There are a few problems that remain to be solved by URL. Some of the most visible of these are:

International Domain Names: How to Encode

Non-ASCII domain names ("IDNA") rely on a special transfer encoding ("Punycode") to allow non-ASCII domain values inside the ASCII-only DNS nameserver hierarchy. Punycode encodes a domain name such as "faß.de" using special markup that looks like so: "xn--fa-hia.de"

By contrast, IRI defined the use of percent-encoding for the same case. So the same string would be "fa%C3%9F.de" in an IRI.

Which form (Unicode characters, Punycode, or percent-encoded) should be used in plain text? On the wire? In document formats? Can one form substitute for the other(s)? Are IRIs that use different forms "identical"?

To add the confusion, IDNA itself has two versions: IDNA2003 and IDNA2008. These differ in handling of certain characters and a few other details. Unicode Technical Report #46 TR46 discusses the details and problems of this and has a summary of the issues involved.

There are reports from Verisign that percent-encoded domain names are being sent to DNS (and failing), which suggests that the hex-encode-versus-punycode problem exists as an operational issue.

Bidirectional Languages

Certain scripts in Unicode, such as Arabic and Hebrew, use a right-to-left writing direction. The Unicode Bidirectional Algorithm (UBA) defines how bi-directional strings are presented. The use of bi-directional text in URLs brings up a number of issues.

Overall Presentation in a Bidirectional Language

Should an URL be presented in a default left-to-right presentation in all circumstances? Or should users in a predominantly right-to-left context see the URL presented with its base direction reversed. For example, which presentation should Arabic speakers expect to see:

   http://left.to.right/example
   example/right.to.left//:http

Most bidi language speakers seems to find the top version "natural" for a URL, but there are anecdotal reports of some preferring the bottom presentation.

Unicode Bidirectional Algorithm Failure

UBA itself is a problematic when an URL is rendered as plain text (rather than in, say, the specialized environment of a browser's address bar). The separators in an URL are the characters "." (U+002E FULL STOP) and "/" (U+002F SOLIDUS) characters. These characters have no strong direction of their own: they're "neutral" and take their direction from neighboring characters. This means that strongly right-to-left or left-to-right text elements adjacent to a path or domain name separator have an effect on the presentation of the overall string—and not a benign effect either. Bi-directional text elements could be seen to "swap position" when presented to the user in a way that is confusing, potentially creating difficult-to-detect phishing attacks.

For example, here is the Arabic string meaning "Egypt": مصر. And here is the Arabic string meaning "Tunisia": تونس.

Consider an URL that contains these two string, one on the end of the a domain name and one as the first element in the path. In the text here, "Egypt" is in the domain name and "Tunisia" is in the path. But that's not how they are displayed:

   http://www.مصر/تونس
   h t t p : / / w w w . \u0645 \u0635 \u0631 / \u062A \u0648 \u0646 \u0633

Note that this is a trivial example: elements can also "change places" visually in the path or between the path and query or fragment portions of a URL.

In plain text, Unicode bidirectional control characters are used to fix display issues like this, but inserting them into the URL would be unwelcome and potentially problematic when the address is broken into its constituent parts. Here's the same text with U+200E (Left-to-right marker control character or "LRM") inserted before the path separator:

  http://www.مصر‎/تونس
  h t t p : / / w w w . \u0645 \u0635 \u0631 \u202E / \u062A \u0648 \u0646 \u0633

It has been proposed that the "address bar" of browsers and other user-agents be a special processing environment that provides an augmented bi-directional algorithm for processing and display. However, it should be noted that URLs are widely used as plain text: they appear on billboards and sides of buses, they are jotted down on napkins, and they are traded via email. These non-browser environments probably cannot all be fixed to provide special URL handling.

Link detection

Finding the boundaries of a URL might seems like a simple thing, but Unicode contains a number of characters that, while legal in an IRI, are not legal in an ASCII-only URI or which naive processors of text might not recognize as being "part of the URL". For example, certain kinds of space characters other than U+0020 SPACE might be legal, but, being spaces, might cause programs like an email client or editor program to treat as the "end" of the address.

Query encoding

IRI called for all data in an IRI to be encoded using Unicode UTF-8. However, for historical reasons, form data in the query string of a Web address is encoded in the character encoding of the form (that is, of the HTML page containing the form). This might not be UTF-8 and changing it to be UTF-8 would break Web servers that expect data to be in some other character encoding. Should query strings be encoded to UTF-8? What rules should be applied inside query strings?

Confusable characters

Unicode contains many characters that have similar appearance to other characters. Allowing the full range of Unicode into a URL means that characters which look similar—or even identical to—other characters could be used to spoof users.

IRI attempted to address certain aspects of confusable characters by requiring that URLs be in Unicode Normalization Form C (NFC). This eliminates differences related to the presence and order of combining marks. It doesn't remove most confusables, however.

IDNA attempts to prevent confusables by disallowing mixed scripts in domain names and applying a much higher level of preparation and normalization to domain name values. This process works well for DNS because of the way that nameservers work, but it cannot be applied to the other portions of a URL easily (without breaking the functionality of URLs).

Should URL provide additional spoofing prevention? What rules should apply to the content of URLs? Should it remain NFC? Be relaxed? Be more stringent?

Non-domain name authority fields

There is some evidence that some processors are %xx-hex-encoding the UTF8 of domain names in some circumstances.

If 'hostname' is a non-ASCII string (IDN), should a processor trying to convert the IRI to a URI use punicode or %xx-hex-encoding for the authority segment?

Currently schemes aren't required to reserve the 'authority' field for DNS names only, so a URI might look like:

   newscheme://non-ascii-but-not-dns-name/path

... for which a Punycode translation of the "non-asacii-but-not-dns-name" shouldn't be Punycode encoded: it's not a domain name.