TF-Graphs/RDF-Datasets-Proposal

From RDF Working Group Wiki
< TF-Graphs
Revision as of 12:30, 22 August 2012 by Azimmerm (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This is a proposal to address the “multiple graphs” work item for TF-Graphs. It is a minimalist proposal, based on SPARQL's RDF Dataset, compatible with existing SPARQL implementations, and in line with the “Named Graphs” concept.

Charter work item being addressed

This proposal addresses the following work item from the WG charter:

The RDF Community has used the term “named graphs” for a number of years in various settings, but this term is ambiguous, and often refers to what could rather be referred as quoted graphs, graph literals, URIs for graphs, knowledge bases, graph stores, etc. The term “Support for Multiple Graphs and Graph Stores” is used as a neutral term in this charter; this term is not and should not be considered as definitive. The Working Group will have to define the right term(s).

Required features: Standardize a model and semantics for multiple graphs and graphs stores (see the Workshop result page for further references)

Overview

  • The definition of RDF Dataset, as currently defined in SPARQL, would be lifted into the RDF Concepts document.
  • This is a set of zero or more <IRI, g-snap> pairs (named graphs), plus one unnamed (default) g-snap.
  • The exact nature of the relationship between IRI and g-snap in a pair is left unspecified.
  • The interpretation of the IRI, in the RDF Semantics sense, is left unspecified.
  • Serialization formats such as N-Quads, Qurtle, etc could be specified as serializing an RDF Dataset.

Proposal text

The following text would be inserted into RDF Concepts. It is lifted, with minor adaptions, from Sections 12 and 17.1.2 of SPARQL 1.1.

The RDF data model expresses information as graphs consisting of triples with subject, predicate and object. Often, one wants to hold multiple RDF graphs and record information about each graph, allowing an application to work with datasets that involve information from more than one graph.

An RDF Dataset represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI.

An RDF Dataset may contain zero named graphs; an RDF Dataset always contains one default graph.

Formally, an RDF dataset is a set:

{ G, (<u1>, G1), (<u2>, G2), . . . (<un>, Gn) }

where G and each Gi are graphs, and each <ui> is an IRI. Each <ui> is distinct.

G is called the default graph. The pairs (<ui>, Gi) are called named graphs.

Use cases

See TF-Graphs-UC for a collection of use cases for the “multiple graphs” work item.

To address some use cases, one needs to define additional RDF vocabulary that describes graphs and their relationships. These terms would be used to make statements about graphs, e.g., that :G1 is a snapshot of http://example.com/foo.rdf taken on a certain date.

It directly addresses use cases that don't rely on exchange of graph sets, but use them just as a means of keeping triples separate. No additional vocabulary is needed to address those use cases. It provides a sound abstract model, even though in some cases g-boxes instead of g-snaps might be more appropriate.

Semantics

In substance, this formalization says that each RDF Graph in a Dataset is interpreted separately. This models the fact that different RDF Graphs hold in different contexts. This way, graphs that have been put in different "named graph pairs" can contradict with each other without making the Dataset inconsistent.

Like RDF interpretations, a dataset-interpretation is relative to a vocabulary V. Moreover, dataset interpretations are defined with respect to an entailment regime E, as defined in SPARQL 1.1 Entailment Regimes. Let KE be the set of all E-interpretations. The interpretation of an RDF Dataset (G, (<n1>,Gn1), ..., (<nk>,Gnk)) over vocabulary V is a pair (I,Con) where I is an E-interpretation of G (the default graph) and Con is a mapping from V to KE.

A dataset-interpretation (I,Con) of a vocabulary V wrt entailment regime E satisfies an RDF Dataset (G, (<n1>,Gn1), ..., (<nk>,Gnk)) iff I E-satisfies G, and for all i in [1..k], Con(ni) exists and E-satisfies Gni.

Following standard definitions, we say that a dataset D=(G, (<n1>,Gn1), ..., (<nk>,Gnk)) entails a dataset (H, (<m1>,Hm1), ..., (<mp>,Hmp)) iff all dataset-interpretation (I, Con) that satisfies D also satisfy H.

Some RDF engines or graph store implementation may want to add constraints on valid interpretations, for instance that the default RDF Graph entails all the named RDF Graphs, but this is not mandated by this specification since there are use cases for the contrary. A typical restriction would be that IRIs must be interpreted identically by all Con(u) for all u.


Examples (we use the TriG notation here):

@prefix : <http://example.com/>
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
# This is the default graph
:age  rdfs:domain  :Person .
:n1 {
  :ageInText  rdfs:subPropertyOf  :age;
              rdfs:range  rdf:LangString .
  :alice  :ageInText  "twenty-eight"@en .
}
:n2 {
  :age  rdfs:range  xsd:decimal .
  :alice  :age  31 .
}

This dataset is consistent since the incompatible statements of :n1 and :n2 are kept in distinct graphs. This dataset would entail the following dataset:

@prefix : <http://example.com/>
:n1 {
  :alice  :age  "twenty-eight"@en .
}

but not the following:

@prefix : <http://example.com/>
:n1 {
  :alice  a  :Person .
}

because the default graph does not necessarily hold in "the context of the named graphs".

Discussion

Pros

  • Can be used to serialize a snapshot of a set of named g-boxes
  • Essentially lifts what's already defined and deployed in SPARQL to the level of RDF
  • Is compatible with the community notion of named graphs
  • Uses an already established term from SPARQL: RDF Dataset
  • Provides the building blocks for solving many if not all TF-Graphs use cases in a simple way

Cons

  • Can't address use cases while staying inside the plain RDF data model
  • Doesn't address “small-scale” multi-graphs well, e.g., a set of triples where each has a different confidence value attached
  • There are use cases that would benefit from “nesting”, that is, having multiple parallel RDF datasets with the same graph names; these are not well handled

Etc

  • Role of the default graph is a bit underspecified. How should/would one use it?