Re: Proposal for ISSUE-12, string literals from Pat Hayes on 2011-05-12 (public-rdf-wg@w3.org from May 2011)

From: Pat Hayes <phayes@ihmc.us>
Date: Thu, 12 May 2011 18:17:41 -0500
To: Richard Cyganiak <richard@cyganiak.de>
Cc: RDF Working Group WG <public-rdf-wg@w3.org>
Message-Id: <C3ED5BB2-63AC-4FAE-9EBD-5A08909349BF@ihmc.us>
On May 12, 2011, at 11:49 AM, Richard Cyganiak wrote:

> On 12 May 2011, at 17:05, Pat Hayes wrote:
>> Hmm, on second thoughts and a more careful reading, I am no longer sure I like the "MAY replace any literal with a canonical form" idea.  
> 
> Note that the term “canonical forms” is defined *in the proposal*, and covers *only* plain string literals. So this does *only* license the replacement of funky string datatypes with plain literals. It does *not* license any other replacements, like those you mention below.
> 
> I mentioned the idea of *extending* this to also cover other XSD literals, but that clearly goes beyond ISSUE-12, and is not necessary to address ISSUE-12.
> 
> I offer a scaled-down rewording. So instead of this:
> 
>>> §8 “Some literals are canonical forms. Implementations MAY replace any literal with a canonical form if both are syntactically different, but have the same value. All plain literals, with or without language tag, are canonical forms.”
> 
> How about this:
> 
>>> §8 “Implementations MAY replace xsd:string typed literals and rdf:PlainLiteral typed literals with a plain literal that has the same value.”
> 
> That would be sufficient if we don't want to do anything about other XSD literals.

That reads better for me, if we want to do this at all. But I still don't really see the point. If we want to deprecate xsd:string, then lets just do that. There will be a loud squealing noise but it will go away fairly quickly, and then all these problems will be over. Or, if we want to keep the xsd:string option open, then let us say that clearly.

> That said, I find your argument against literal canonicalization not compelling.
> 
>> If this is a licence for some other engine to tidy up the literals in my RDF, then I vote against this idea. Who knows why I might have chosen to use a non-canonical form? Some people might use the number of leading zeros to encode precision information, for example.
> 
> We don't have to cater for inappropriate use of the technology. You could use the same kind of reasoning to argue that "foo" and "foo"^^xsd:string must be kept distinct because someone might use the difference to encode access control information. That's absurd. Show me someone who does it.
> 
> (You don't *actually* encode precision information this way in your own data, do you???)

No, and I confess that leading zeros is kind of a silly example. But for trailing zeros, yes. I have seen people (engineers) distinguish between 2.0 and 2.000 to indicate exactly the implied degree of precision. In fact I believe this convention is quite common, though obviously one would use several numbers to do the thing properly. 

> 
>> It just seems inappropriate to give a global licence to 'tidy up' other people's data.
> 
> Why?

Basically because it presumes that you know everything about that data and how it is being used, which is rarely the case and not a good basis for interoperation. 

> 
>> And why do we need this? The datatype definitions already provide for the relevant equalities, if someone wants to keep their data semantically tidy.
> 
> It's about keeping it *syntactically* tidy -- removing unnecessary syntactic variation, so that the syntax matches the semantics. It gives implementers license to simplify implementations.

But this is a losing bet, always. You can never assume hat the syntax will match the semantics this well. As soon as data has owl:sameAs in it, for example, there is no way to stop things having multiple names. 

> And as you say, it's replacing one form with another form that “means” the same thing (under D-Entailment), so I really don't see the problem.

I guess I have a background assumption (which I thought might appeal to you :-) along the lines that while the RDF semantics is a really good idea, ultimately RDF meaning is determined by actual use. And given that, we ought to allow it to be used in ways that we might not have thought of yet. 

> Finally, it's a MAY. Implementers who think it's inappropriate don't have to do it. Users who think it's inappropriate can vote with their legs / with apt-get.

Well, true. I just think it will cause trouble down the road, with people protesting that their data is being modified without their permission/licence. One man's normalization is another's disaster. My own mathematical instincts, for example, would lead me to normalize 2.0 to 2, and I used to argue for this until a squad of programmers took me aside and threatened me with bignums. 

But maybe I am being oversensitive about this; and I can see how it makes things easier for developers to have normal forms wherever possible. 

Pat

> 
> Best,
> Richard
> 
> 
> 
>> 
>> Pat
>> 
>> =======
>> 
>> On May 11, 2011, at 4:23 PM, Richard Cyganiak wrote:
>> 
>>> I took an action today to draft text for RDF Concepts that resolves ISSUE-12. I put it on the wiki here:
>>> http://www.w3.org/2011/rdf-wg/wiki/StringLiterals/EntailmentProposal
>>> A plain text copy is attached below.
>>> 
>>> Best,
>>> Richard
>>> 
>>> 
>>> 
>>> SHORT SUMMARY
>>> 
>>> 1. RDF Concepts puts more emphasis on the distinction between (syntactic) “literal equality” and (semantic, important for applications) “value equality”
>>> 2. RDF Concepts explicitly points out the specific string value equalities that already arise from RDF Semantics
>>> 3. RDF Concepts declares one of the string literal forms as canonical
>>> 4. Implementations MAY canonicalize, but don't have to
>>> 5. The canonical form is plain literals.
>>> 
>>> 
>>> WHY?
>>> 
>>> 1. No changes to the abstract syntax required
>>> 2. No changes to any concrete syntax or parser required
>>> 3. No changes to any implementations of any of the existing entailment regimes required
>>> 4. Those who are ok with canonicalization can do that, and don't need to deal with entailment
>>> 5. Those who don't want to canonicalize, have the option of supporting only string value equality at query time, without RDFS- and D-Entailment
>>> 6. “MAY canonicalize” softly discourages the use of xsd:string typed literals, without abolishing them outright or declaring them archaic
>>> 7. Standardizing on xsd:string was never an option because of language tags
>>> 8. Standardizing on rdf:PlainLiteral was never an option because it MUST NOT be used in serializations that support plain literals
>>> 
>>> 
>>> CHANGES TO 6.5.2 The Value Corresponding to a Typed Literal
>>> http://www.w3.org/TR/rdf-concepts/#section-Literal-Value
>>> 
>>> 
>>> §1 Rename it to “6.5.1 The Value Corresponding to a Literal” and move it ahead of 6.5.1
>>> 
>>> §2 Add to the beginning:
>>> “The value of a plain literal without language tag is the same Unicode string as its lexical form.
>>> 
>>> The value of a plain literal with language tag is a pair consisting of 1. the same Unicode string as its lexical form, and 2. its language tag.
>>> 
>>> For typed literals, …” (continue with rest of section as is)
>>> 
>>> §3 Remove the Note at the end of the section
>>> 
>>> 
>>> CHANGES TO 6.5.1 Literal Equality
>>> http://www.w3.org/TR/rdf-concepts/#section-Literal-Equality
>>> 
>>> 
>>> §4 Rename section to “6.5.2 Literal Equality and Canonical Forms”
>>> 
>>> §5 Add to the beginning:
>>> “Equality of literals can be evaluated based on their syntax, or based on their value.”
>>> 
>>> §6 Change “Two literals are equal …” to: “Two literals are syntactically equal …” in the current first paragraph.
>>> 
>>> §7 Add to the end:
>>> “In application contexts, comparing the values of literals (see section 6.5.1) is usually more helpful than comparing their syntactic forms. Literals with different lexical forms and with different datatypes can have the same value. In particular:
>>> 
>>> - A plain literal with lexical form aaa and no language tag has the same value as a typed literal with lexical form aaa and datatype IRI xsd:string
>>> - A plain literal with lexical form aaa and no language tag has the same value as a typed literal with lexical form aaa@ and datatype IRI rdf:PlainLiteral
>>> - A plain literal with lexical form aaa and language tag xx has the same value as a typed literal with lexical form aaa@xx and datatype IRI rdf:PlainLiteral”
>>> 
>>> §8 “Some literals are canonical forms. Implementations MAY replace any literal with a canonical form if both are syntactically different, but have the same value. All plain literals, with or without language tag, are canonical forms.”
>>> 
>>> 
>>> CHANGES TO 6.3 Graph Equivalence
>>> http://www.w3.org/TR/rdf-concepts/#section-graph-equality
>>> 
>>> 
>>> §9 Append this leftover sentence, which was removed from 6.5.1:
>>> “Note: For comparing RDF Graphs, semantic notions of entailment (see [RDF-SEMANTICS]) are usually more helpful than the syntactic equivalence defined here.”
>>> 
>>> 
>>> EXTENDING THIS TO NUMERIC LITERALS???
>>> 
>>> (While we're at it, we might also cover equalities between the built-in numeric XSD types, and between different lexical forms of the same built-in XSD datatype.)
>>> 
>> 
>> ------------------------------------------------------------
>> IHMC                                     (850)434 8903 or (650)494 3973   
>> 40 South Alcaniz St.           (850)202 4416   office
>> Pensacola                            (850)202 4440   fax
>> FL 32502                              (850)291 0667   mobile
>> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>> 
>> 
>> 
>> 
>> 
> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Thursday, 12 May 2011 23:18:14 UTC