Difference between revisions of "TF-Graphs/RDF-Datasets-Proposal"

From RDF Working Group Wiki
Jump to: navigation, search
(Antoine's Semantic extensions (a quick proposal to be discussed))
(Interpreting datasets)
Line 51: Line 51:
 
=== Interpreting datasets ===
 
=== Interpreting datasets ===
  
The interpretation of an RDF Dataset (G, (&lt;n<sub>1</sub>&gt;,G<sub>1</sub>), ..., (&lt;n<sub>k</sub>&gt;,G<sub>k</sub>)) is a tuple (I, I<sub>1</sub>, ..., I<sub>k</sub>) where I is an RDF-interpretation of G (the default graph) and for all i in [1..k], I<sub>i</sub> is an RDF-intepretation of G<sub>i</sub>.
+
The interpretation of an RDF Dataset (''G'', (&lt;''n''<sub>1</sub>&gt;,''G''<sub>''n''<sub>1</sub></sub>), ..., (&lt;''n''<sub>''k''</sub>&gt;,''G''<sub>''n''<sub>''k''</sub></sub>)) is a tuple (''I'', ''I''<sub>''n''<sub>1</sub></sub>, ..., ''I''<sub>''n''<sub>''k''</sub></sub>) where ''I'' is an RDF-interpretation of ''G'' (the default graph) and for all ''i'' in [1..''k''], ''I''<sub>''n''<sub>''i''</sub></sub> is an RDF-intepretation of ''G''<sub>''n''<sub>''i''</sub></sub>.
  
A model of an RDF Dataset (G, (&lt;n<sub>1</sub>&gt;,G<sub>1</sub>), ..., (&lt;n<sub>k</sub>&gt;,G<sub>k</sub>))</math> is an interpretation (I, I<sub>1</sub>, ..., I<sub>k</sub>) such that I is an RDF-model of G, and for all i in [1..k], I<sub>i</sub> is a model of G<sub>i</sub>.
+
A model of an RDF Dataset (''G'', (&lt;''n''<sub>1</sub>&gt;,''G''<sub>''n''<sub>1</sub></sub>), ..., (&lt;''n''<sub>''k''</sub>&gt;,''G''<sub>''n''<sub>''k''</sub></sub>)) is an interpretation (''I'', ''I''<sub>''n''<sub>''1''</sub></sub>, ..., ''I''<sub>''n''<sub>''k''</sub></sub>) such that ''I'' is an RDF-model of ''G'', and for all ''i'' in [1..''k''], ''I''<sub>''n''<sub>''i''</sub></sub> is a model of ''G''<sub>''n''<sub>''i''</sub></sub>.
 +
 
 +
We say that a dataset ''D''=(''G'', (&lt;''n''<sub>1</sub>&gt;,''G''<sub>''n''<sub>1</sub></sub>), ..., (&lt;''n''<sub>''k''</sub>&gt;,''G''<sub>''n''<sub>''k''</sub></sub>)) entails a dataset (''H'', (&lt;''m''<sub>1</sub>&gt;,''H''<sub>''m''<sub>1</sub></sub>), ..., (&lt;''m''<sub>''p''</sub>&gt;,''H''<sub>''m''<sub>''p''</sub></sub>)) iff {''m''<sub>1</sub>, ..., ''m''<sub>''p''</sub>} is included in {''n''<sub>1</sub>, ..., ''n''<sub>''k''</sub>} and for all models (''I'', ''I''<sub>''n''<sub>1</sub></sub>, ..., ''I''<sub>''n''<sub>''k''</sub></sub>) of ''D'', ''I'' is an RDF-model of ''H'' and for all ''m'' in {''m''<sub>1</sub>..''m''<sub>''p''</sub>}, ''I''<sub>''m''</sub> is a model of ''H''<sub>''m''</sub>.
  
 
=== Antoine's Semantic extension (informative) ===
 
=== Antoine's Semantic extension (informative) ===

Revision as of 11:06, 29 June 2011

This is a proposal to address the “multiple graphs” work item for TF-Graphs. It is a minimalist proposal, based on SPARQL's RDF Dataset, compatible with existing SPARQL implementations, and in line with the “Named Graphs” concept.

Charter work item being addressed

This proposal addresses the following work item from the WG charter:

The RDF Community has used the term “named graphs” for a number of years in various settings, but this term is ambiguous, and often refers to what could rather be referred as quoted graphs, graph literals, URIs for graphs, knowledge bases, graph stores, etc. The term “Support for Multiple Graphs and Graph Stores” is used as a neutral term in this charter; this term is not and should not be considered as definitive. The Working Group will have to define the right term(s).

Required features: Standardize a model and semantics for multiple graphs and graphs stores (see the Workshop result page for further references)

Overview

  • The definition of RDF Dataset, as currently defined in SPARQL, would be lifted into the RDF Concepts document.
  • This is a set of zero or more <IRI, g-snap> pairs (named graphs), plus one unnamed (default) g-snap.
  • The exact nature of the relationship between IRI and g-snap in a pair is left unspecified.
  • The interpretation of the IRI, in the RDF Semantics sense, is left unspecified.
  • Serialization formats such as N-Quads, Qurtle, etc could be specified as serializing an RDF Dataset.

Proposal text

The following text would be inserted into RDF Concepts. It is lifted, with minor adaptions, from Sections 12 and 17.1.2 of SPARQL 1.1.

The RDF data model expresses information as graphs consisting of triples with subject, predicate and object. Often, one wants to hold multiple RDF graphs and record information about each graph, allowing an application to work with datasets that involve information from more than one graph.

An RDF Dataset represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI.

An RDF Dataset may contain zero named graphs; an RDF Dataset always contains one default graph.

Formally, an RDF dataset is a set:

{ G, (<u1>, G1), (<u2>, G2), . . . (<un>, Gn) }

where G and each Gi are graphs, and each <ui> is an IRI. Each <ui> is distinct.

G is called the default graph. The pairs (<ui>, Gi) are called named graphs.

Use cases

See TF-Graphs-UC for a collection of use cases for the “multiple graphs” work item.

To address some use cases, one needs to define additional RDF vocabulary that describes graphs and their relationships. These terms would be used to make statements about graphs, e.g., that :G1 is a snapshot of http://example.com/foo.rdf taken on a certain date.

It directly addresses use cases that don't rely on exchange of graph sets, but use them just as a means of keeping triples separate. No additional vocabulary is needed to address those use cases. It provides a sound abstract model, even though in some cases g-boxes instead of g-snaps might be more appropriate.

Semantics

RDF Semantics would be updated to state that blank nodes in an RDF Dataset are scoped to the graph (default or named) they occur in. The same blank node cannot occur in two graphs at the same time.

Interpreting datasets

The interpretation of an RDF Dataset (G, (<n1>,Gn1), ..., (<nk>,Gnk)) is a tuple (I, In1, ..., Ink) where I is an RDF-interpretation of G (the default graph) and for all i in [1..k], Ini is an RDF-intepretation of Gni.

A model of an RDF Dataset (G, (<n1>,Gn1), ..., (<nk>,Gnk)) is an interpretation (I, In1, ..., Ink) such that I is an RDF-model of G, and for all i in [1..k], Ini is a model of Gni.

We say that a dataset D=(G, (<n1>,Gn1), ..., (<nk>,Gnk)) entails a dataset (H, (<m1>,Hm1), ..., (<mp>,Hmp)) iff {m1, ..., mp} is included in {n1, ..., nk} and for all models (I, In1, ..., Ink) of D, I is an RDF-model of H and for all m in {m1..mp}, Im is a model of Hm.

Antoine's Semantic extension (informative)

With the semantics above, no assumption is made about the URI used as a "name" for graphs. According to the WG decision of 14th April 2011, this "name" must not be understood as denoting the graph. It merely "tag" the graph, and can denote anything in an RDF interpretation (such as a person, a document, a car, a concept, an idea.

However, there are cases when information must be attached to graphs, or graphs may be interrelated. To account for the use cases where assertions of a graph can influence the knowledge from another graph (for instance, a graph represents what is true during the year 2010, while another represent what is true in October 2010) an extension of the semantics could define a vocabulary which would provide additional constraints on the models of a dataset. Assertions using these vocabularies would define the compatibility of RDF-models to be part of the dataset interpretation (that is, which RDF-models of the named graphs and default graph can be put together to be a model of a dataset).

First, there must be a means to identify the graph itself with a URI. This URI should be distinct from the "name" in the (id,Gid) pairs, but related. It can be done as follows:

:G1 graph:hasName "http://example.com/name"^^xsd:anyURI .

where graph:hasName relates a graph (a set of triples) to the "name" for the graph, which is a URI. To properly identify the URI (as opposed to identify the thing denoted by the URI) we use the datatype xsd:anyURI.

To determine what is the graph denoted by :G1, we can rely on a "graph map", which is a partial function from URIs to graph and which can be application dependent. The graph map can be provided explicitly in a TriG file, an NQuads file, by HTTP-dereferencing or with a special index in a graph store. We can stay agnostic about it.

Now, let us assume that the term graph:imports can be used to specify that a graph imports another graph (more precisely, that the content of a g-box should include the content of another g-box):

ex:onto { :Person rdfs:subClassOf :Agent }
ex:me { ex:me a :Person }
ex:you { ex:you a :Person }
:G1 graph:hasName "http://example.com/me"^^xsd:anyURI .
:G2 graph:hasName "http://example.com/you"^^xsd:anyURI .
:G3 graph:hasName "http://example.com/onto"^^xsd:anyURI .
:G1 graph:imports G2: .

The term graph:imports imposes that an interpretation (I, I1, I2, I3) is a model of this graph if I RDF-satisfies the triple :G1 graph:imports :G2, I1 satisfies :G1 and :G2, I2 satisfies :G2, and I3 satisfies :G3. This would entail:

ex:me { ex:me a :Agent }

but would not entail:

ex:you { :you a :Agent }

Additionally, a semantic extension could define global restrictions. For instance, to interpret a dataset as a single RDF graph, one could define the following global restriction (let us call it SIMPLE): an interpretation (I, I1, ..., Ik) is a SIMPLE-model of a dataset if I, I1, ... and Ik are all RDF-models of all the graphs in the dataset. This is equivalent to saying that the interpretation of an RDF Dataset is that of the RDF-merge of its constituent graphs (Richard first proposal).

Discussion

Pros

  • Can be used to serialize a snapshot of a set of named g-boxes
  • Essentially lifts what's already defined and deployed in SPARQL to the level of RDF
  • Is compatible with the community notion of named graphs
  • Uses an already established term from SPARQL: RDF Dataset
  • Provides the building blocks for solving many if not all TF-Graphs use cases in a simple way

Cons

  • Can't address use cases while staying inside the plain RDF data model
  • Doesn't address “small-scale” multi-graphs well, e.g., a set of triples where each has a different confidence value attached
  • There are use cases that would benefit from “nesting”, that is, having multiple parallel RDF datasets with the same graph names; these are not well handled

Etc

  • Role of the default graph is a bit underspecified. How should/would one use it?