language-tagged literal datatypes from Pat Hayes on 2011-08-18 (public-rdf-wg@w3.org from August 2011)

From: Pat Hayes <phayes@ihmc.us>
Date: Thu, 18 Aug 2011 18:11:54 -0500
To: "public-rdf-wg@w3.org Group WG" <public-rdf-wg@w3.org>
Message-Id: <9D97BFD3-E419-4646-87C8-A6EC2CF3DF82@ihmc.us>

As promised (http://www.w3.org/2011/rdf-wg/track/actions/76) a summary of various options for how to handle language-tagged literals. This builds on and uses the terminology of [1].

Option 1 (minimalist). Language-tagged literals have no datatype and hence are distinct from all other literals, which are typed. rdf:LangString is a class name but not a datatype. DATATYPE("foo"@en) returns an error message.

Option 1a. Just as option 1, except that DATATYPE("foo"@en) returns rdf:LangString, even though it is not called a datatype and does not have a defined L2V mapping.

Option 2. All literals have a type. rdf:LangString is a special datatype whose L2V mapping takes a pair of strings as input and returns a language-tagged pair as output. This mapping is the identity mapping on pairs <string, tag>, just as xsd:String is the identity mapping on single strings. DATATYPE("foo"@en) returns rdf:LangString, following the normal rules for datatyping.

Option 3. All literals have a type. Each language tag defines a datatype which is unique to that tag, and whose L2V mapping takes a string and produces a language-tagged string tagged with that particular tag. These datatypes are conventional but we would need to invent some kind of naming convention for them, perhaps rdf:LangString/en, rdf:LangString/fr, etc.. These would all be subclasses of rdf:LangString, which would not itself be a datatype. DATATYPE("foo"@en) returns rdf:LangString/en, following the normal rules for datatyping.

The pros and cons of these, as far as I can see them:

option 1: + minimal change -- does not resolve the muddle -- causes needless SPARQL errors
option 1a: + almost-minimal change + removes SPARQL errors -- introduces a confusing exception with no rationale
option 2: + simplifies literal syntax + removes SPARQL errors + theoretically clean -- requires change to the datatyping model
option 3: + simplifies literal syntax + removes SPARQL errors + gives access to tag information + theoretically clean -- requires an 'open extendable' rdf vocabulary.

----------

A few other thoughts which might be worth taking into consideration.

= The semantic change needed for option 2 really is semantically trivial, and it might have other uses. If we say that the L2V mapping takes as input all the syntactic 'components' of a literal, rather than forcing these to be all inside one string, then we allow such things as literals with latitude and longitude denoting positions, complex numbers with real and imaginary parts, etc.., without forcing people to invent coding tricks (like the trailing '^' in rdf:PlainLiteral) to artificially map these into a single string. This might be a genuinely useful extension, in other words. We can also quietly deprecate rdf:PlainLiteral along with 8-track tape players.

= If a SPARQL querier wants to determine the actual language tag in use, option 2 requires them to look inside the returned value, while option 3 requires looking inside the datatype URI, and can be determined from a DATATYPE query. I have no idea which of these is hardest to handle, but it might be worth thinking about the difference if it matters to anyone.

Pat

PS. FWIW, I vote for either 2 or 3, and against 1 or 1a. I prefer 2., for the reason mentioned above, and because it seems to me to be the most elegant solution.

[1] http://lists.w3.org/Archives/Public/public-rdf-wg/2011Jul/0048.html

------------------------------------------------------------
IHMC (850)434 8903 or (650)494 3973
40 South Alcaniz St. (850)202 4416 office
Pensacola (850)202 4440 fax
FL 32502 (850)291 0667 mobile
phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes

Received on Thursday, 18 August 2011 23:12:24 UTC