UriIriIssues

From W3C Wiki

URI / IRI Issues

This page tracks potential issues with the use of IRIs, especially regarding their relationship to URIs and use as a protocol element.

See also:

Issues

Canonicalisation

Unicode case folding is complex, with several context-dependent algorithms, some that produce results that casual users would find astonishing. Canonicalising domain names in URIs is no longer simple. Likewise, characters can be represented in more than one way (depending on what "same" means in a context). Similarly, normalisation (whereby different representations of the same character are mapped to a single form) has different options, adding confusion and potential for problems.

Multiple encodings

IRIs define a specific encoding for non-ASCII characters, but in some contexts this encoding may not be used, e.g., because another encoding is more readily available. For example, in XML and HTML, authors may find it more familiar to use percent-encoding on the entire identifier; in HTTP headers, RFC2047 encoding is defined as the appropriate encoding (although this is changing in HTTPbis).

Additionally, whether percent-encoding should be used in the host portion of a URI (not IRI) needs to be considered.

"Confusables"

As has been widely noted, the considerable range of characters in Unicode makes it possible to mint an identifier that physically looks like another, so that a user can be deceived about the authority they're connecting to.

Use in Protocol Elements

There have been some instances of protocol elements specifying the use of IRIs even when they are not intended for display to end users.

Examples

Atom Links

Links in Atom are allowed to be IRIs. Some are serialised into HTTP requests; others are used for purposes of comparison (e.g., as link relation types). In the former case, the links will be converted to URIs; in the latter, they will be compared character-for-character in the XML, and therefore may not vary.

Background Notes

One of the nice things about ASCII is that there is no ambiguity about the letters themselves, or at least we successfully pretend that there isn't. The DNS does case-insensitive matching and it is generally accepted that, even if one writes DNS labels in a case-idiosyncratic way, only a madman would depend on the distinctions. There is also exactly exactly one way to represent any given character. The decision to do case-insensitive matching is ancient history and there seems to be no way to change it, but it is not as uncontroversial as we would normally like to believe.

When one moves to Unicode (whether in UTF-8 coding or otherwise), that assumption isn't true any more. Matching characters that differ only by case requires a procedure or a choice of procedures. The one (toCaseFold) that UTC recommends has become extremely controversial in some quarters and is context-dependent. Many characters can be represented in more than one way; some can be represented in many different ways (depending in part on what "same character" actually means, a subject that may require subjective or context-dependent judgments).

So, if one has a stored identifier (whether a FQDN or something in some other system), the question of what gets looked up is often the easy part. The more difficult problem is what rules are used to determine whether the identifier in the query and the stored identifier match. Stringprep and its various profiles and the Unicode operations NFC, NFKC, toCaseFold, toLowerCase can all be used to try to canonicalize (in the general sense, not the Unicode one) the two strings so that simplified matching is feasible, but there are many alternatives and arguments for each of them. When the identifiers used are user-facing, getting the matching rules wrong is likely to violate the law of least astonishment and other fundamental design principles.

To further complicate this, our most commonly-used user-facing identifiers are probably URIs in their various forms and variations. Many URIs are not expected to be resolved at all: they are simply identifiers whose purpose is realized by comparison with other identifiers. The URI spec implies, and the HTML and XML specs seem to be quite explicit, that two URIs match iff they are bit-string identical. It isn't even clear whether, at the URL level, http://www.iab.org/ and http://www.IAB.org/ are the same identifier. That means that there is a very strong case for canonical forms, and canonical forms only, for URIs. But users don't type canonical forms all the time, the URI spec doesn't identify the canonical form, the current IRI spec (which is arguably broken in other ways) is no help at all, and applications are very inconsistent about how these things are interpreted in practice... an interoperability problem waiting to happen as well as a user-confusion one.

The IRI part is both easier and move complicated. We don't have clear agreement on whether they are suitable for use in protocols (HTTP says "no", at least at present, HTML5 and XML apparently say "yes", others are mixed or vague), whether they are just a recommended part of a common user interface spec, or something in between. If they are anything more than recommended UI elements, the specification is far too permissive in places and far too vague in others. As just one example, if I correctly understand a recent comment of Martin's on the IDNA list, he believes that an IRI->URI processor that sees non-ASCII characters in a domain name field can legitimately map the relevant labels to either punycode or to %-escaped UTF-8. Whatever the other issues are, it implies that, if the URI comparison rules are taken seriously, a single IRI can be mapped by different processors into a pair of URIs that don't compare equal -- obviously not a good situation.