Re: IRI guidance from Alex Hall on 2011-04-29 (public-rdf-wg@w3.org from April 2011)

From: Alex Hall <alexhall@revelytix.com>
Date: Fri, 29 Apr 2011 09:42:33 -0400
To: "Eric Prud'hommeaux" <eric@w3.org>
Cc: Ivan Herman <ivan@w3.org>, nathan@webr3.org, RDF WG <public-rdf-wg@w3.org>
Message-ID: <BANLkTikoXcogFMJbxHTi0AwqHjQ2ZYYxNA@mail.gmail.com>
On Fri, Apr 29, 2011 at 8:14 AM, Eric Prud'hommeaux <eric@w3.org> wrote:

> * Ivan Herman <ivan@w3.org> [2011-04-29 08:24+0200]
> >
> > On Apr 28, 2011, at 23:59 , Eric Prud'hommeaux wrote:
> > <snip/>
> > >>
> > >> Unfortunately this can lead to unexpected consequences, such as an
> > >> application dereferencing the IRI http://xn--rsum-bpad.example.org(not sure
> > >> how GMail will escape that -- that's the punycode version) and getting
> a
> > >> document with a description of some resource with IRI
> > >> http://résumé.example.org <http://xn--rsum-bpad.example.org> <
> http://xn--rsum-bpad.example.org> (Unicode
> > >> version).  To help prevent this, we could discourage the use of IRIs
> with
> > >> encoded IDNs in RDF, similar to how the existing spec discourages the
> use of
> > >> URI Refs with percent-escaped characters.
> > >
> > > I think this leads down the path of not using IRIs. When dereferencing
> > > an HTTP IRI, one has to punyify the domain name and percentulate the
> > > path, mapping http://伝言.example/?user=أكرم<http://xn--9oqp94l.example/?user=%D8%A3%D9%83%D8%B1%D9%85>to
> > > http://xn--9oqp94l.example/?user=%D8%A3%D9%83%D8%B1%D9%85 . Any IRI
> > > with characters outside of the legal URI characters will map to a
> > > differently spelled URI, necessitating some typing of these respective
> > > strings. If we're taking away the sharp knives, we'll have to take
> > > away non-ascii characters and díäcrìtïcâl markç.
> >
> > Eric, I am not sure I understand that. The proposal is to say that, in
> RDF, there should be a preference for the UTF version of the URI-s, ie, I
> should, if possible, opt for http://伝言.example/?user=أكرم<http://xn--9oqp94l.example/?user=%D8%A3%D9%83%D8%B1%D9%85>rather than the the other version. What happens underneath if I dereference
> that URI and send it to tools for an HTTP get or anything similar is a
> separate issue. Indeed, on an English keyboard typing something even as
> simple as http://iván.herman.net <http://xn--ivn-fla.herman.net> is a pain
> for a user, but that is a practical problem which is again outside the realm
> of RDF.
>
> Ahh, I interpreted "discourage … encoded IDNs" as discouraging
> UTF-8-encoded IRIs while the intent was discouraging punycode-encoded.
> Sorry.
>
>
No worries -- "encoded" is too vague a term, I should have been more
specific.


>
> > Ie: saying that we keep to the current version of RDF, ie, equality of
> IRI-s is based on a character-by-character comparison (like now) but giving
> an advice to, if possible, use the IRI without the punycode seems to be a
> reasonable way of handling this... What else would you propose instead?
>
> I'm all for character-by-character comparison. I think the emphasis should
> be on keeping track of the type. Here's a draft of a minimal change to the
> Concepts document:
> [[
> 6.2 RDF Graph
> An RDF triple contains three components:
>
>    * the subject, which is an IRI or a blank node
>    * the predicate, which is an IRI
>    * the object, which is an IRI, a literal or a blank node
> …
> 6.4 IRI
>
> An IRI within an RDF graph (an RDF URI reference) is a Unicode string
> [UNICODE] that conforms to the definition of an IRI in RFC2397 [IRI].
> Implementations may issue warnings concerning the use of RDF terms
> designated to be IRIs but which are not conformant to the IRI
> definition.
>

I wonder if it's too confusing to mention IRI and RDF URI reference in the
same breath, in the very first sentence no less?  I'd prefer to keep URIs
out of the discussion as much as possible.


>
> Note: RFC2397 Section 3.1. "Mapping of IRIs to URIs" specifies the
> mapping to URIs, which must be done, for instance, when constructing
> an HTTP GET request. This specification does not define a relationship
> between an IRI and the URI to which it is mapped.
>
> Note: RFC2397 Section 5.3.1. "Simple String Comparison" specifies
> equivalence for IRIs used as identity tokes, as they are in RDF
> graphs.
>
> Note: IRIs are compatible with the anyURI datatype as defined by XML
> schema datatypes [XML-SCHEMA2], constrained to be an absolute rather
> than a relative URI reference.
>
> Note: IRIs are compatible with International Resource Identifiers as
> defined by [XML Namespaces 1.1].
>
> Note: The restriction to absolute IRIs is found in this abstract
> syntax. When there is a well-defined base, concrete syntaxes, such as
> RDF/XML, may permit relative IRIs as a shorthand for such absolute IRIs.
> ]]
>

I think this part could use some clarification.  An IRI is, by definition,
absolute per section 2.2 of RFC3987.  IRI references may be absolute or
relative, but resolve to an absolute IRI (as described in section 1.3).

To muddy the waters even further, the "absolute-IRI" grammar construct in
section 2.2 omits the fragment identifier, but I cannot find any references
to this either internal or external to the RFC.

So I think we should (a) specifically call out out the definition in section
2.2; and (b) avoid any mention of the terms "IRI reference" or "absolute
IRI" except in an informative context.

-Alex



>
> Note, I changed "RDF URI reference" to "IRI" instead of "RDF IRI" as I'm
> not convinced that an IRI which appears in an RDF document is of a different
> type than an IRI which appears in an email or in the location bar of my
> browser.
>
> Here I proposed saying that IRIs and their URIs are simply different
> things, eliding the syntactic hint
> x [[
> x Note: Because of the risk of confusion between RDF URI references that
> x would be equivalent if derefenced, the use of %-escaped characters in
> x RDF URI references is strongly discouraged. See also the URI
> x equivalence issue of the Technical Architecture Group [TAG].
> x ]]
>
> I agree with Alex that punycoded domain names and %-escaped characters
> should be mentioned in the same breath. From a human-engineering
> perspective, I think any text specifying syntactic hints to help observers
> visually discriminate them discourages programmers from being conscientious
> about the distinction. However, if we want to encourage the world to mint
> IRIs which we can procedurally calculate from URIs (motivated perhaps by
> associating HTTP traffic with assertions about resources), we could add some
> text encouraging an unambiguous transformation:
>
> [[
> Note: RFC2397's mapping of IRIs to URIs does not alter "%25" or
> punycoded domain names, which means that the IRIs
> <http://伝言.example/R&D <http://xn--9oqp94l.example/R&D>> and <
> http://xn--9oqp94l.example/R%25D> will
> both be transformed to the URI to <http://xn--9oqp94l.example/R%25D>.
> RFC2397 section 3.2. "Converting URIs to IRIs" defines a function
> which produces a single IRI for any URI. When minting IRIs for RDF,
> it is encouraged to mint forms which can round trip to a URI form
> and back.
> ]]
>
>
> > Cheers
> >
> > Ivan
> >
> >
> > >
> > >
> > >> -Alex
> > >
> > > --
> > > -ericP
> > >
> >
> >
> > ----
> > Ivan Herman, W3C Semantic Web Activity Lead
> > Home: http://www.w3.org/People/Ivan/
> > mobile: +31-641044153
> > PGP Key: http://www.ivan-herman.net/pgpkey.html
> > FOAF: http://www.ivan-herman.net/foaf.rdf
> >
> >
> >
> >
> >
>
> --
> -ericP
>
Received on Friday, 29 April 2011 13:43:02 UTC