TF-Graphs/Minimal-dataset-semantics

From RDF Working Group Wiki
< TF-Graphs
Revision as of 15:35, 26 September 2012 by Tthibodeau (Talk | contribs)

Jump to: navigation, search

__NUMBEREDHEADINGS__

This is a new attempt to provide a semantics of datasets. It is made such that it covers a decent amount of use cases and more use cases can be covered by proper "semantic extensions". The formal semantics only describes what can be deduced from or assumed to be true of a dataset. It does not describe a mechanism by which an RDF dataset is affected. Implementations are free to ignore the semantics and manipulate the syntactic structure only (e.g., parsers, editors).

The first part of this page reflects a semantics that some of us (AZ, RC, IH) have agreed on. The second part contains test cases and entailment examples that illustrate how this formal semantics works. The third part contains some explicit design issues that the group may want to consider; the semantics put forward reflects choices on those issues, but the group may want to decide otherwise. The fourth part outlines some possible semantic extensions that could be defined on top of the presented semantics to realize various different or more powerful semantic behaviours.

1 Semantics

In order to decide how to interpret a dataset, it must be determined how plain RDF graphs are interpreted. According to the existing standards, there are several ways of interpreting a set of triples, each being tied to what SPARQL calls an entailment regime. As of today, the standard entailment regimes are Simple Entailment, RDF Entailment, RDFS Entailment, D Entailment, OWL Entailment with Direct Semantics, OWL Entailment with RDF Based Semantics and RIF Entailment.

The entailment regime is determined by the application and should be either fixed by it and described in the documentation, or changeable through the application setup.

1.1 Informal description of the semantics

Given an entailment regime E, the dataset is interpreted in the following way:

  • the default graph has the same meaning as an isolated RDF graph according to the regime E. So, a dataset with no <name,graph> pair can be identified with a plain RDF graph. This allows us to treat RDF graphs as if they were datasets with a minimal abuse of notations.
  • the <n,g> pairs are interpreted as a relationship between the resource the "name" n denotes and a certain graph, not necessarily g. This relationship is considered to be true for a dataset if the graph in relation with n E-entails g.

This means that the resource denoted by the name n is associated with a graph that has at least the truth of the graph g, according to the chosen entailment regime.

1.2 Model-theoretic semantics

Let E be an entailment regime and V a vocabulary of IRIs and literals. An E-dataset-interpretation over vocabulary V is a pair I = <Id,IGEXT> such that:

  • Id is an E-interpretation over vocabulary V;
  • IGEXT is a function from the set of resources defined by Id to the set of RDF graphs.

Further, I is extended into a function assigning truth values to graphs, <name,graph> pairs and dataset as follows:

  • for a graph G, I(G) is true iff Id(G) is true;
  • for an IRI n and RDF graph g, I(<n,g>) is true iff IGEXT(Id(n)) is defined and E-entails g;
  • for a dataset D=(DG,<n1,G1>,…,<nk,Gk>), I(D) is true iff I(DG) is true and for all i in 1…k, I(<ni,Gi>) is true.

2 Test cases

The following test cases are examples for entailments, non-entailments, equivalences and contradictions among RDF datasets.

2.1 Basics

Just like RDF graphs, RDF datasets are assumed to be expressions that have truth, that is, can be true or false.

Since RDF datasets are logical expressions, we can speak of the same logical relationships that hold between RDF graphs:

  • Entailment: If the truth of A can be shown or presumed, then B is true as well.
  • Equivalence: Two RDF datasets A and B are equivalent if they both have the same truth value. A entails B and B entails A.
  • Contradiction: One dataset is a contradiction if it cannot be true under any circumstances. Two RDF datasets A and B contradict each other if they cannot both be true at the same time.
  • Consistency: One dataset is consistent if it is not a contradiction. Dataset A is consistent with dataset B if they can both be true at the same time.

Or more precisely:

  • A E-entails B when for every E-ds-interpretation which makes A true also makes B true.
  • A is E-equivalent to B when A E-entails B and B E-entails A.
  • A is an E-contradiction when A is false in every E-interpretation.
  • A is E-consistent if it is not an E-contradiction.

The truth of an RDF dataset is defined with respect to a graph extension, a relationship that associates RDF graphs with resources. Think of it as capturing a snapshot of the contents of all g-boxes in the universe of discourse.

An RDF dataset is true if its default graph is true and if all the named graphs are true. A named graph <n,G> is true if the resource denoted by n has an RDF graph that entails G as its graph extension. Note that it is not required that G be true.

2.2 Notation

These test cases assume the following notation.

This is an RDF dataset with one named graph and an empty default graph:

:g1 { :s :p :o }

This is an RDF dataset with one triple in the default graph and no named graphs:

{ :s :p :o }

This is an RDF dataset with one triple in the default graph, one triple in a named graph, and a second empty named graph:

{ :s :p :o }
:g1 { :s :p :o }
:g2 {}

This is an RDF graph (*not* an RDF dataset):

:s :p :o

2.3 The default graph is asserted

T1.1 Under simple dataset entailment:

{ :s :p :o }

is equivalent to

# This is an RDF graph, not an RDF dataset
:s :p :o

T1.2 Under OWL dataset entailment:

{ :o1 owl:differentFrom :o1 }

is a contradiction.

2.4 Entailment works within the default graph

T2.1 Under simple dataset entailment:

{ :s :p :o1, :o2 }

entails

{ :s :p :o1 }

T2.2 Under simple dataset entailment:

{ :s :p :o1 }

entails

{ :s :p [] }

T2.3 Under simple dataset entailment:

{ :s :p _:blank1 }

is equivalent to (and hence entails)

{ :s :p _:blank2 }

T2.4 Under simple dataset entailment:

{ :s :p _:blank1 }

is equivalent to (and hence entails)

{ :s :p _:blank1, _:blank2 }

2.5 Named graphs are not asserted

T3.1 Under simple dataset entailment:

:g1 { :s :p :o }

does not entail

{ :s :p :o }

T3.2 Under simple dataset entailment:

:g1 { :s :p :o }

does not entail

# This is an RDF graph, not an RDF dataset
:s :p :o

T3.3 Under OWL dataset entailment:

:g1 { :o1 owl:differentFrom :o1 }

is consistent.

Issue: Is this actually consistent according to the formalized semantics? I believe it is consistent because an interpretation with IGEXT={<:g1, { :o1 owl:differentFrom :o1 }>} satisfies this dataset. This “abuses” the fact that a contradiction entails every graph. —RC

2.6 Entailment works within named graphs

T4.1 Under simple dataset entailment:

:g1 { :s :p :o1, :o2 }

entails

:g1 { :s :p :o1 }

T4.2 Under simple dataset entailment:

:g1 { :s :p :o1 }

entails

:g1 { :s :p [] }

T4.3 Under simple dataset entailment:

:g1 { :s :p _:blank1 }

is equivalent to (and hence entails)

:g1 { :s :p _:blank2 }

T4.4 Under simple dataset entailment:

:g1 { :s :p _:blank1 }

is equivalent to (and hence entails)

:g1 { :s :p _:blank1, _:blank2 }

2.7 Empty named graphs are trivially true

T5.1 Under simple dataset entailment:

:g1 { :s :p :o }

entails

:g1 {}

T5.2 Under simple dataset entailment:

:g1 {}

entails

# empty default graph, no named graphs

2.8 An RDF dataset is the conjunction of the default + named graphs

T6.1 Under simple dataset entailment:

:g1 { :s :p :o }

entails

# empty default graph, no named graphs

T6.2 Under simple dataset entailment:

:g1 { :s :p :o }
:g2 { :s :p :o }

entails

:g1 { :s :p :o }

T6.3 Under simple dataset entailment:

{ :s :p :o }
:g1 { :s :p :o }

entails

:g1 { :s :p :o }

T6.4 Under simple dataset entailment:

{ :s :p :o }
:g1 { :s :p :o }

entails

{ :s :p :o }

2.9 Different named graphs do not contradict each other

T7.1 Under simple dataset entailment:

:g1 { :s :p :o1 }

is consistent with

:g1 { :s :p :o2 }

2.10 The same entailment regime is active in default and named graphs

T8.1 Under OWL dataset entailment:

{ :s :p :o1. :o1 owl:sameAs :o2 }

entails

{ :s :p :o2 }

T8.2 Under OWL dataset entailment:

:g1 { :s :p :o1. :o1 owl:sameAs :o2 }

entails

:g1 { :s :p :o2 }

2.11 Named graphs are independent from each other

T9.1 Under OWL dataset entailment:

:g1 { :s :p :o1 }
:g2 { :o1 owl:sameAs :o2 }

is consistent with, but does not entail

:g1 { :s :p :o1, :o2 }

2.12 Named graphs are independent from the default graph

T10.1 Under OWL dataset entailment:

{ :o1 owl:sameAs :o2 }
:g1 { :s :p :o1 }

is consistent with, but does not entail

:g1 { :s :p :o1, :o2 }

T10.2 Under OWL dataset entailment:

{ :s :p :o1 }
:g1 { :o1 owl:sameAs :o2 }

is consistent with, but does not entail

{ :s :p :o1, :o2 }

2.13 Indirection between graph name and graph

T11.1 Under OWL dataset entailment:

{ :g1 owl:differentFrom :g2 }
:g1 { :s :p :o }
:g2 { :s :p :o }

is consistent.


T11.2 Under OWL dataset entailment:

{ :g1 owl:sameAs :g2 }
:g1 { :s :p :o }

entails

{ :g1 owl:sameAs :g2 }
:g1 { :s :p :o }
:g2 { :s :p :o }

(Note: This entailment would not hold under the IRI-IGEXT version of the semantics; see Design Decision 4 below.)


T11.3 Under OWL-dataset-entailment:

{ :g1 owl:sameAs "a" }
:g1 { :s  :p  :o }

is consistent.

2.14 Inconsistencies remain contained within their named graph

T12.1 Under OWL dataset entailment:

:g1 { :o1 owl:differentFrom :o1 }

entails

:g1 { :s :p :o }

T12.2 Under OWL dataset entailment:

:g1 { :o1 owl:differentFrom :o1 }

does not entail

{ :s :p :o }

T12.3 Under OWL dataset entailment:

:g1 { :o1 owl:differentFrom :o1 }

does not entail

:g2 { :s :p :o }

2.15 IRIs can denote different resources in different graphs

T13.1 Under OWL-dataset-entailment:

{ :s owl:sameAs "a" }
:g1 { :s owl:sameAs "b" }
:g2 { :s owl:sameAs "c" }

is consistent.

2.16 Brain twisters

T14.1 Under D-dataset-entaillment with a datatype map containing the four mentioned datatypes:

{ :p  rdfs:range  xsd:boolean .
  :s  :p  :n, :m, :o . }
:n { :q  rdfs:range  xsd:string .
     :x  :q  :y }
:m { :q  rdfs:range  rdf:HTML .
     :x  :q  :y }
:o { :q  rdfs:range  rdf:langString .
     :x  :q  :y}

is inconsistent.

This is because xsd:boolean is known to have only two members. The :p triples assign all the three graph names to the class xsd:boolean. Hence two of them must denote the same resource. But the three graphs are pairwise inconsistent because the defined ranges for :q are pairwise disjoint, hence :y would have to be in two disjoint classes.

I don't think that example 2.16 is correct. It seems to me that any dataset with a consistent default graph is consistent. Any interpretation of the default graph can be extended to an interpretation of the dataset by mapping every resource to inconsistent graphs. This trivially satisfies "IGEXT(Id(n)) is defined and E-entails g" as an inconsistent graph entails every graph. —PFPS

3 Design Decisions (DDs)

These are design decision that the group may have to consider to make the final consensus process clean.

3.1 DD0: Do we define a semantics for RDF datasets?

Design Decision 0: Should we say anything about the semantics of RDF datasets at all?

All test cases above assume the answer is “yes”. If the answer is “no”, then one cannot speak of RDF datasets in terms of entailment or contradiction. A separate notion of “equivalence” for RDF datasets would have to be defined; this would probably be “RDF dataset isomorphism”, that is, two datasets are equivalent if they only differ in their blank nodes.

If DD0 is answers “no”:

{ :s :p 42.0 }

would not be equivalent to (but consistent with):

{ :s :p +42.00 }

3.2 DD1: Different regime for default graphs and named graphs?

Design Decision 1: Can the entailment regime of the default graph be different from the one of the <name,graph> pairs? More generally, could we assign a different entailment regime to each individual <name,graph> pair? Note that the SPARQL Entailment document allows for that, and it would be relatively easy, mathematically, to extend the semantics to do that. However, the result would be fairly complicated.

The test cases above assume that the same entailment regime holds for the default graph and for the named graphs. So, if a particular entailment holds in the default graph G, then it also holds in a named graph G. But potentially this could be changed, either by having one regime for the default graph and another one for the named graphs. Or by having separate regimes for every named graph.

(No full proposal for this has been made yet, so no test case.)

3.3 DD2: No-Semantics

Design Decision 2: Do we want to allow an entailment regime that is “weaker” than Simple Entailment? Something like the "no-semantics" in one of our previous proposals. Note that this has any relevance only if the answer to Issue 1 is “yes”. Otherwise this amounts to not using any semantics at all to the dataset, which does not require any further formalism.

The proposal is to introduce a new “no-semantics” entailment regime, in which a graph G entails only isomorphisms of G.

Under no-dataset-semantics:

:g1 { :s :p :o1 }

contradicts

:g1 { :s :p :o2 }

Under no-dataset-semantics:

:g1 { :s :p :o1 }

contradicts

:g1 { :s :p :o1, :o2 }

Under no-dataset-semantics:

{ :s :p :o1 }

contradicts

{ :s :p :o2 }

This case assumes that the same entailment regime applies to default graph and named graph; otherwise it could be made consistent by applying a normal entailment regime to the default graph, and no-semantics only to the named graphs.

3.4 DD3: Let the dataset announce its assumed entailment regime?

Design Decision 3: Can a dataset declare what semantics it assumes, instead of letting the application decide in all cases? This was proposed as a possible extension in a previous proposal. Another alternative explored in the past was the usage of extra predicates in TriG.

No complete proposal for this is on the table yet, so no test case; but here's an example that uses the [SPARQL Service Description](http://www.w3.org/TR/sparql11-service-description/) vocabulary. This shows how an announcement of the used entailment regime could potentially look. This assumes a different answer for DD2.

<> a sd:Dataset;
   sd:defaultEntailmentRegime er:rdf;
   sd:namedGraph [
      a sd:NamedGraph;
      sd:name "http://example.com/g"^^xsd:anyURI;
      sd:entailmentRegime er:simple
   ].

3.5 DD4: Does the graph extension assign graphs to resources or to IRIs?

Design Decision 4: should the relationship be between "name" and graph or between resource denoted by "name" and graph? The latter can be made a proper semantic extension of the former, not the opposite. In formal terms, IGEXT could map IRIs to graphs, and not resources; in which case the formalism would refer to IGEXT(n) instead of IGEXT(Id(n)).

Under the RES-IGEXT variant of the semantics (formalized above):

{ :g1 owl:sameAs :g2 }
:g1 { :s :p :o }

entails

{ :g1 owl:sameAs :g2 }
:g1 { :s :p :o }
:g2 { :s :p :o }

Under the IRI-IGEXT variant of the semantics, this particular entailment would not hold, because the graph is associated with the IRI, and not with the resource that is declared as having two names.

3.6 DD5: Does the graph name denote the graph?

Design Decision 5: In <n,G>, does n denote G, or may n denote any resource? Note that this is related to Issue-4. In terms of the terminology used in the current RDF Semantics, the current semantics is not “denoting”, because Id does not map n to any graph.

Under the semantics formalized above:

{ :g1 owl:differentFrom :g2 }
:g1 { :s :p :o }
:g2 { :s :p :o }

is consistent, because :g1 and :g2 can be associated with the same graph even though they denote different resources.

Under an alternative semantics where :g1 and :g2 directly denote the graphs, the above would be a contradiction.


To give a more colorful example, assume we've crawled this from the web:

<http://example.com/people/bob> { <http://example.com/people/bob> a foaf:Person }

Under an alternative semantics where graph names directly denote graphs, it would be legitimate to conclude:

{ <http://example.com/people/bob> a sd:Graph }

This conclusion would be derived from the prose description of sd:Graph; the meaning of sd:Graph is not formalized. It means that, literally, <http://example.com/people/bob> denotes an RDF graph, a set of triples. Given that it's pretty clear that a person is not a set of triples, this would be inconsistent with the statement:

<http://example.com/people/bob> a foaf:Person.

But under the semantics formalized on this page, no assumption about the nature of <http://example.com/people/bob> is made, so we cannot conclude that it denotes an sd:Graph, and no inconsistency with the assumption that Bob is a person arises.

3.7 DD6: Open-graph or closed-graph semantics

Design Decision 6: Is it sufficient for the truth of I(<n,g>) that IGEXT(Id(n)) E-entails g, or should we require that IGEXT(Id(n)) is equivalent to g under E-entailment? This is open-graph versus closed-graph semantics.

The test cases above assume the open-graph version of the semantics. Under the open-graph version with simple entailment:

:g1 { :s :p :o1, :o2 }

entails

:g1 { :s :p :o1 }

Under the alternative closed-graph version of the semantics:

:g1 { :s :p :o1, :o2 }

contradicts

:g1 { :s :p :o1 }

Another example. Under open-graph semantics:

:g1 { :s :p :o1 }

is consistent with

:g1 { :s :p :o2 }

But under closed-graph semantics:

:g1 { :s :p :o1 }

contradicts

:g1 { :s :p :o2 }

3.8 DD7: Is the default graph universally true?

Design Decision 7: Should the truth of a named graph require that the named graph satisfies the default graph?

The test cases above assume that the default graph has a truth value, and it is asserted; however, its truth is not presumed in the named graphs.

Under OWL dataset semantics:

{ :o1 owl:sameAs :o2.
  :s :o :o1 }

entails

# This is an RDF graph, not an RDF dataset
:o1 owl:sameAs :o2.
:s :o :o1, :o2

Under OWL dataset semantics:

{ :o1 owl:sameAs :o2. }
:g1 { :s :o :o1 }

is consistent with, but does not entail

{ :o1 owl:sameAs :o2. }
:g1 { :s :o :o1, :o2 }

Design Decision 7 asks whether the truth of the default graph should be presumed in the named graphs. In that case, the second case would be an entailment.

4 Addressing the Use Cases

TODO for TF-Graphs: Write up how the abstract syntax with the presented semantics can address (some of) the multigraph use cases. Our main use cases are:

5 Possible Semantic Extensions

This section sketches various possible semantic extensions. This shows the flexibility and extensibility of the proposal. The standardization of any of these extensions is not part of the proposal, but may potentially be considered as separate proposal.

5.1 owl:imports entailment

This gives a formal semantics to the owl:imports property as defined in OWL.

5.1.1 Example

Under simple-dataset-plus-owl:imports entailment:

{ :g1 owl:imports :g2 . }
:g1 { :p :q :r }
:g2 { :x :y :z }

entails:

{ :g1 owl:import :g2 . }
:g1 { :p :q :r. :x :y :z }

5.1.2 Definition

An E-dataset-plus-owl:imports-interpretation is an E-dataset-interpretation that meets the following additional semantic conditions:

  • If <r1,r2> in IEXT(I(owl:imports)) and IGEXT(r1) and IGEXT(r2) are defined then IGEXT(r1) E-entails IGEXT(r2).

5.2 Web entailment

This explains the follow-your-nose dereferencing mechanics as an entailment regime. Thus, the empty dataset web-entails the entire web of data.

An E-web-interpretation is an E-dataset-interpretation that meets the following additional semantic condition:

  • If r is a web resource with at least one representation in an RDF format, and parsing the representation yields graph G, then IGEXT(r) entails G.

(This perhaps should say, IGEXT(r) is equivalent to G. It depends on how one wants to treat resources with multiple representations where the representations encode different triples. Is this a contradiction, or does it mean that the state of the resource is the merge of everything said in the representations?)

5.3 Direct graph semantics

NOTE: This extends the dataset semantics non-monotonically, and the form how it's done here is therefore probably a bad idea.

This is an attempt at formalizing an extension that allows making statements directly about RDF graphs (rather than about resources that have an associated RDF graph, a.k.a. g-boxes). The idea is that one can type a resource as rdf:Graph in the default graph, and doing so asserts that the resource actually is the RDF graph given in a name-graph pair. This makes the graph name rigidly denote the RDF graph.

5.3.1 Example

{ :g1 a rdf:Graph. }
:g1 { :s :p :o }

Now, :g1 denotes the RDF graph consisting of the single triple :s :p :o..

5.3.2 Definition

An E-direct-graph-interpretation is an E-dataset-interpretation that meets the following additional semantic conditions:

  • g is in IG if and only if g is an RDF graph
  • IG = ICEXT(I(rdf:Graph))
  • If g is in IG then IGEXT(g) = g

The first and second conditions introduces IG, the set of all RDF graphs, denoted by rdf:Graph. The third condition requires that any RDF graph have itself as its graph extension.

Furthermore, we need to change the definition of the interpretation function I for <name,graph> pairs, in order to “disable” open-graph semantics when the resource in question is an RDF graph:

  • for an IRI n and RDF graph g, I(<n,g>) is true iff IGEXT(Id(n)) is defined and
    • equals g, if Id(n) is in IG
    • E-entails g, otherwise

5.3.3 Issues

This is ugly because it cannot be stated purely as a constraint on interpretations but also requires a modification of the interpretation function.

There is also the question whether is is a proper (monotonic) semantic extension of dataset entailment: In many cases where RDF dataset A dataset-entails RDF dataset B, the two datasets contradict each other under direct graph entailment. Is this still monotonic? (Answer: No, because some cases of “A entails B” turn into “A contradicts B” when the extension is switched on, despite A being consistent.)