Re: RDF dataset semantics again from Sandro Hawke on 2012-08-20 (public-rdf-wg@w3.org from August 2012)

From: Sandro Hawke <sandro@w3.org>
Date: Mon, 20 Aug 2012 13:11:22 -0400
To: Antoine Zimmermann <antoine.zimmermann@emse.fr>
CC: RDF WG <public-rdf-wg@w3.org>
Message-ID: <50326FBA.2030304@w3.org>
On 08/20/2012 10:02 AM, Antoine Zimmermann wrote:
> Dear all,
>
>
> ==Post scriptum:==
> Sorry for the long email.
> *In summary:*  I describe 3 different families of datasets semantics, 
> I argue that there are important use cases for each of them, I'd like 
> that all semantics are standardised with a mechanism to describe what 
> semantics is assumed when exchanging datasets. There are more 
> arguments on this at the end if you want to skip the discussion on the 
> semantics.
> ====End of PS=====
>
>
> I come back to the topic of formal semantics for RDF datasets. I can 
> see that there are two issues that are almost orthogonal:
>
>  1. how the semantics of the triples inside the named graphs work.
>  2. how the graph "names" relate to the graph inside the (name,graph) 
> pairs.
>
>
> To discuss this, I'll use the following example (do not bother the 
> meaning of the classes and properties, I just try to make an example 
> that looks a little realistic):
>
>
> # == EXAMPLE STARTS HERE ==
> :year1960  dc:date  "1960"^^xsd:gYear;  :endorsed  true .
> :year2000  dc:date  "2000"^^xsd:gYear;  :endorsed  true .
> :year2012  dc:date  "2012"^^xsd:gYear;  :endorsed  true .
> :myth  :endorsed  false .
>
> :year1960 {
>   ex:MarilynMonroe  a  ex:LivingPerson .
>   ex:LivingPerson  owl:disjointWih  ex:DeadPerson .
> }
> :year2000 {
>   ex:MarilynMonroe  a  ex:DeadPerson .
>   ex:DeadPerson  owl:disjointWih  ex:LivingPerson .
> }
> :year2012 {
>   ex:MarilynMonroe  a  ex:DeceasedPerson .
>   ex:DeceasedPerson  owl:equivalentClass  ex:DeadPerson .
> }
> :myth {
>   ex:MarilynMonroe  ex:livesIn  ex:desertIsland .
>   ex:livesIn  rdfs:domain  ex:LivingPerson .
> }
> # == EXAMPLE ENDS HERE ==
>
>
> Wrt item 1 above, there are essentially 3 cases:
>
>  a) The dataset simply is an RDF graph where the triples have been 
> simply partitioned. An interpretation of that dataset is an 
> interpretation of the graph made of all the triples found in all the 
> named graphs and the default graph. Depending on what is decided about 
> item 2 above, there can be additional semantic constraint wrt what the 
> graph IRIs denote, but there could be no constraint either, so item 1 
> and 2 are essentially orthogonal issues in this case.
> Applications use the partitioning mechanism as they wish, e.g., for 
> optimisation, for documentation...
> If such is the semantics of datasets, then the example is 
> inconsistent, so it entails all possible datasets.
>
>
>  b) The dataset is interpreted in the same way as an RDF graph, where 
> the default graph must be true and the <name,graph> pairs are 
> interpreted as assertions that relate the name to the graph itself. 
> The actual relationship is to be determined, but what matters here is 
> the syntax of the graph. It matters that the term ex:DeceasedPerson is 
> used, not that the person denoted by ex:MarilynMonroe is dead.
> It is essentially the "quoting" semantics. The entailments depend on 
> what is the relationship between the graph IRI and the graph, but a 
> typical case is when the graph IRI denotes the graph, in which case, 
> the example does not entail:
>
> :year2012 {
>   ex:MarilynMonroe  a  ex:DeadPerson .
> }
>
> neither does it entail:
>
> :myth {
>   ex:MarilynMonroe  a  ex:LivingPerson .
> }
>
> In this case, no conclusion are ever drawn from any assertion put 
> inside a named graph.
>
>
>  c) Each named graphs describe a world according to the graph IRI. In 
> the example, the world according to :myth is that ex:MarilynMonroe is 
> living somewhere. What matters is the truth of the assertions rather 
> than the fact that the term "deceased" or "dead" was used.
> So one can draw the conclusion that:
>  - *in :year1960*, ex:MarilynMonroe is not a ex:DeadPerson;
>  - *in :year2012*, ex:MarilynMonroe is a ex:DeadPerson
> etc.
> In this case, the possibilities for what's the relationship between 
> the graph IRI and the graph are more limited than in the other case. 
> For instance, if the IRI must be intrepeted as the graph itself, then 
> it prevents a lot of inferences.
>
>
>
> I can see use cases for each of these semantics.
>  a- If one is managing data that are verified facts, then one would 
> like that all of the triples are true. Yet, they still have reasons to 
> split the data in different parts, allowing users to query them 
> separately with SPARQL GRAPH keywords.
>  b- for a Semweb search engin exchanging the dump of its crawl, it 
> makes sense to have an accurate "quote" of has been crawled.
>  c- for situation regarding temporal evolution of facts, integration 
> of variously trusted sources, tracking provenance of inferred 
> knowledge, etc...
>
>
> I find odd that semantics b is retained as the only valid one in the 
> "RDF graph identification" proposal. It's sweeping away several 
> Priority A use cases, with some of the Priority B too.
>

I believe it's possible to handle the use cases that want (a) and (c) by 
standardizing on (b) and then defining additional RDF vocabulary terms 
(either now or later).

(As an aside: I don't think the priorities have any formal weight. The 
WG has never resolved to accept or reject or prioritize any uses as more 
important than any other.)

> Also, the condition ∀i: I(ui) = Gi is problematic. At first, it seems 
> to be natural to say that the graph IRI RDF-denotes the graph. But:
>
> http://www.w3.org/2011/rdf-wg/meeting/2011-04-14#resolution_1
>
> "RESOLVED: Named Graphs in SPARQL associate IRIs and graphs *but* they 
> do not necessarily "name" graphs in the strict model-theoretic sense. 
> A SPARQL Dataset does not establish graphs as referents of IRIs 
> (relevant to ISSUE-30)".
>
> I know this resolution is about SPARQL datasets, and it's not 
> necessarily applying to whatever structure we come up with in RDF, but 
> one of the Priority A use cases is to be able to dump a SPARQL store. 
> With this resolution, there is apparently a clash between the use case 
> requirement and the semantic condition.
>

I agree.  I'm pretty sure ∀i: I(ui) = Gi is wrong.   Most of the time, 
in practice, Ui denotes a g-box, not a g-snap.   (Or, sometimes, it's 
something else associated with a g-box, like the primary subject.)   I 
don't see how SPARQL 1.1 UPDATE with the GRAPH keyword makes any sense 
if Ui denotes Gi.

>
> My proposal is to define several recommended semantics and allow the 
> concrete syntax to declare in a document what semantics is assumed 
> when exchanging a dataset.
>
> I find this idea appealing because it is in line with the fact that 
> information carried by HTTP is accompanied by a self description of 
> how it should be understood. For instance, we have MIME types, we have 
> <!DOCTYPE> declarations, etc. Since RDF is not a purely syntactical 
> datastructure, it makes sense that it carries with it a reference to 
> the semantics it uses.
> Such practices of referencing the MIME type, charset, doctype, schema, 
> etc have been a key enabler of interoperability on the Web. Why not 
> extend the pattern to the formal semantics?
> BTW, SPARQL services have a way to tell what inferrence regime they 
> support, and SPARQL queries have a way to ask for a particular regime. 
> I pretend that my proposal is simply in agreement with already 
> accepted notions in the SPARQL world.
>

I see the appeal -- solving each kind of problem with an approach 
crafted directly for it -- but my sense is this would cause too much 
confusion in the market and result a lack of interoperability.  I think 
we're better off standardizing (b) now, as long as I'm right that we can 
address the (a) and (c) use cases using just additional vocabulary.

       -- Sandro

>
> Best,
Received on Monday, 20 August 2012 17:11:36 UTC