“RDF Graph Identification”

This document makes an attempt to provide a minimal set of concepts around the vague term of “Graph Identification” that may serve as the basis for a consensus in the RDF Working Group. By concentrating on that minimal level and the issues listed in this document it can be hoped that the RDF WG can move away from its current deadlock on this subject.

The goal is to rely and reuse the corresponding notions in the SPARQL 1.1 specification and introduce new notions when necessary and for completeness. In particular, this specification introduces the notion of RDF spaces—modifiable places to store RDF triples. Examples of RDF spaces include: an HTML page with embedded RDFa or microdata, a file containing RDF/XML or Turtle data, and a SQL database viewable as RDF using R2RML. RDF spaces provide a mutable counterpart of SPARQL’s named graphs appearing in datasets. Figure 1 gives an overview of the relationships among the different concepts as described in this document.

This document is not meant to be published as such. Instead, it collects those editorial fragments which, together, may hopefully lead to a consensus of the RDF Working Group on the general issue of Named Graphs. Once this document is settled, individual sections may be spread over the final deliverables, ie, RDF Concepts, RDF Semantics, and, probably, a new serialization document on named graphs. Each section is marked with the intended distribution of the material contained therein.

There are a number of issues on the details that have to be discussed and decided upon by the Working Group. An attempt has been made to clearly separate and highlight those in the document.

Introduction

There is no intended destination document for the material in this section - it is presented solely to facilitate discussion within within the RDF Working Group. Document editors may pull from this material as they see fit.

The Resource Description Framework (RDF) provides a simple declarative way to store and transmit information. It also provides a trivial but effective way to combine information from multiple sources, with graph merging. This allows information from different people, different organizations, different units within an organization, different servers, different algorithms, etc, to all be combined and used together, without any special processing or understanding of the relationships among the providers.

For some applications, the basic RDF merge operation is overly simplistic, as extra processing and an understanding of the relationships among the providers may be useful. This document specifies a way to conveniently handle information coming from multiple sources, by modeling each one as a separate space, and using RDF to express information about these spaces. In addition to this important concept, we provide a pair of languages—extensions to existing RDF syntaxes— which can be used to store or transmit in one document the contents of multiple spaces as well as information about them.

The RDF WG recognises that many existing implementations include the notion of modifiable places to store RDF triples for eminently practical reasons. Implementations using SPARQL 1.1, the SPARQL Protocol, the Linked Data API, Linked Open Data and various evolving forms of Linked Data for enterprises have created names for mutable RDF graphs that are coincident with their operational URLs. The RDF WG is thus encouraged to discover a formalisation of graph identification concepts that align with implementation experience.

Concepts

The intended destination document for this material is RDF 1.1 Concepts.

Figure 1 gives an overview of the relationships among the different concepts as described in this document.

Figure 1: Relationships among RDF spaces, graphs, datasets, and Graph Stores.

RDF Space

The term "space" might change. The final terminology has not yet been selected by the Working Group. Other candidates include "g-box", "data space", "graph space", "(data) surface", "(data) layer", "sheet", and "(data) page". The contributors also note that the term “resource” was considered, and could be used but for possible ambiguities with other, partially overlapping, uses of that term. The term “RDF space” is intended to be synonymous with the term “g-box”, as defined by the RDF Working Group.

This document is only concerned with resources that have state, and doesn’t take a particular stance on the question what kind of resources can have state. For more on this, see URI/Resource Relationships in AWWW.

An RDF space is anything that can reasonably be said to explicitly contain zero or more RDF triples and has an identity distinct from the triples it contains. Therefore, an RDF space is a mutable container, like a “set” data structure in programming. It may hold some RDF triples. Two spaces can happen to have the same contents (right now) while being distinct from each other. Spaces’ contents may change: today a particular space might contain the triples { my:a my:b _:x. my:a my:c _:x }, and tomorrow it might instead contain { my:a my:b _:x. my:a my:c2 _:x }.

The term “RDF space” is intended to be synonymous with the term “slot” used in SPARQL 1.1 Update (in place of the immutable RDF Graph currently used in that document) when used in the context of a SPARQL Graph Store and its contents. However, an RDF space is intended to be a more broadly applicable term to be used whenever referring to a mutable RDF container. The state of an RDF Space at any time is an RDF Graph.

Examples of an RDF space include but are not limited to the following:

a human-readable Web page, such as an HTML page containing RDFa markup, microdata markup, or embedded turtle.
a file, in a computer’s filesystem, containing RDF data expressed in RDF/XML, N-Triples, Turtle, etc.
a machine-readable Web page containing RDF data expressed in RDF/XML, N-Triples, Turtle, etc.
a SQL database which provides an RDF view of its data, perhaps using R2RML
the default graph or any of the named graphs available via a SPARQL endpoint
a Web Service that unambigously returns RDF triples in some (serialization) format

...provided that the requirement for mutability is maintained. That is, each of the above examples would not but spaces if the only met the definition of an RDF Graph.

Examples of things that are not spaces:

Natural language text. While it might be possible extract some of the meaning of the text and express that meaning in RDF triples, those triples are not explicit and in practice might vary from one extractor to the next. Note that if the extractor is also well specified (e.g., Zemanta, Open Calais, etc.) so that the resulting RDF triples are unambiguously defined, then the same text combined with that specific engine can be considered an RDF space, too.
(Abstract) RDF Graphs, as defined in the RDF Concepts’ document. Since they are just mathematical sets of RDF triples, they have no distinct identity apart from their contents. For example, if two systems have in memory the RDF graph { <a> <b> <c> }, any metadata about the graph in one system logically applies to the graph in the other system, since technically it is the same graph. (If this seems counter-intuitive, you may be among the many who have been using the term “graph” to refer to what is called “space” in this document.).

Dataset

A dataset is defined by SPARQL 1.1 as a structure consisting of:

A distinguished RDF Graph called the default graph
A set of (name, graph) pairs, where name is an IRI and the graph is an RDF Graph. No two pairs in a dataset may have the same name.

This definition forms the basis of the SPARQL Query semantics; each query is performed against the information in a specific dataset.

Although the term is sometimes used more loosely, a dataset is a pure mathematical structure, like an RDF Graph or a set of integers, with no identity apart from its contents. Two datasets with the same contents are in fact the same dataset, and one dataset cannot change over time.

The word “default” in the term “default graph” refers to the fact that, in SPARQL, this is the graph a server uses to perform a query when the client does not specify which graph to use. The term is not related to the idea of a graph containing default (overridable) information. The role and purpose of the default graph in a dataset varies with application.

Named Graph

SPARQL formally defines a named graph, to be any of the (name, graph) pairs in a dataset.

In practice, the term is often used to refer to the graph part of those pairs. This is the usage we follow in this document, saying that a graph is a named graph in some dataset if and only if it appears as the graph part of a (name, graph) pair in that dataset. Note that “named graph” is a relation, not a class: we say that something is a named graph of a dataset, not simply that it is a named graph.

Graph Store

SPARQL 1.1 Update defines a mutable (time-dependent) structure corresponding to a dataset, called a Graph Store. It is defined as:

A distinguished slot for an RDF Graph
A set of (name, slot) pairs, where the slot holds an RDF Graph and the name is an IRI. No two pairs in a Graph Store may have the same name.

SPARQL's notion of a Graph Store is a “mutable container of RDF graphs managed by a single service” that can be manipulated through the SPARQL Update language and/or through the SPARQL HTTP Graph Store Protocol.

The definition in the SPARQL 1.1 clearly refers to a mutable graph for a “slot’; in other words, a “slot” in this definition is actually an RDF space. The “distinguished slot” corresponds to the default graph of a dataset.

A dataset can be thought of as the state of a Graph Store, just like an RDF graph can be thought of as the state of an RDF space.

Note that the term “named graph” is also sometimes used to refer to the slot part of the (name, slot) pairs in a Graph Store. For example, the text of SPARQL 1.1 Update says, “This example copies triples from one named graph to another named graph”. For clarity, we avoid calling these “named graphs” (which refer to immutable content) and instead call them “named slots”, or RDF spaces, of the Graph Store.

Figure 1 gives an overview of the relationships among the different concepts.

Semantics

The intended destination document for this material is RDF 1.1 Semantics.

Interpretation of RDF Datasets

The interpretation of an RDF dataset is the interpretation of its default graph. The presence or absence of named graphs does not affect the truth of a dataset.

This semantics can also been referred to as “quoting” semantics, because an interpretation has no relevance to the triples inside the individual named graphs, only to the triples in the Default Graph. This quoting behavior is considered to be important; it avoids the “superman” effects that plagued RDF reification.

A semantic issue related to dataset, and not reflected by the statment above, is whether a “name” can be a blank node or not. This is a decision to be taken by the Working Group.

Interpretation of RDF Datasets (undecided material, for discussion)

This section needs revision by experts in formal semantics. It is intended to express the same interpretation as the preceding section, but may require more work to indeed do so. If no suitable mathematical formalism can be used, or if the resulting formalism would become too complicated, the Working Group may decide not to add anything more than the formal sentence above to the RDF Semantics.

This section suggests an interpretation of RDF Datasets, as a possible extension to the various RDF and RDFS interpretations defined in the RDF Semantics document.

In this section the “equality” of graphs in a dataset means that they are mutually inferable through simple entailment.

Let DS = (DG, (u₁,G₁),…,(u_n,G_n)) be a dataset. The vocabulary for the dataset is defined as V(DS) = V(DG) ∪ {u_i: i = 1,…,n} ∪ rdfV, where V(DG) is the vocabulary set of DG, and rdfV is the RDF Vocabulary (as defined in the RDF Semantics document). The following conditions on V(DG) also hold:

{G_i : i = 1,…,n} ∩ LV = ∅ (i.e., named graphs are not literals and literals cannot denote a graph)
∀i,j, i,j=1,…,n: if u_i = u_j then G_i and G_j are equal.
∀i, i=1,…n: u_i is not a blank node (this constraint depends on the decision of the Working Group, see above.)

Let I be an RDF interpretation on V(DS) for which the following conditions also hold:

{G_i : i = 1,…,n} ⊂ range(I)
∀i: I(u_i) = G_i

then I is also an interpretation of the RDF Dataset. Replacing rdfV by the corresponding RDFS or OWL Vocabulary the same definition automatically extends to these (in the case of OWL that means the RDF Compatible Semantics of OWL).

Semantic Extension Points (undecided material, for discussion)

There have been discussions in the group on (slightly) more complex semantics for datasets (see, e.g., on of the proposals). An earlier discussion occurred around a possible extension point that would give the possibility for different communities and/or applications to define their own semantics. If the group finds a consensus on this (or similar mechanism) then this could end up in the final documents, otherwise the group may stay silent on this.

A possible extension point for the Semantics is to assign types to graph names. By default, in case of a named graph pair (n,G), the additional

n rdf:type rdf:Graph .

triple also holds (this must be added to the semantic constraints of the interpretation function). Further classes can be defined by communities; for example, a community may define

ex:nonQuote rdfs:subClassOf rdf:Graph .
n rdf:type ex:nonQuote .

which signals a reasoner that the content of G should be merged with the default graph for the purpose of graph interpretation and inference. Another example is

ex:GetSemantics rdfs:subClassOf rdf:Graph .
n rdf:type ex:GetSemantics .

which signals the RDF environment that doing an HTTP GET operation on 'n' should result in a serialization of the graph 'G'.

The RDF Working Group has not decided whether to define some or any of these additional classes or not. By default, the definition of these classes is intended to be left to communities.

Dataset Languages

The intended destination documents for this material are the individual syntax specification documents.

This section contains specifications of languages for serializing datasets. Dataset information may also be conveyed and manipulated using SPARQL or using RDF triple-based tools and languages.

TriG

Specification of TriG is possibly the subject of a separate Recommendation or Note to be published by the Working Group.

The current TriG grammar, slightly reformulated to link to the current Turtle Grammar, is as follows:

[1g]	`trigDoc`	::=	`statement*`
[2g]	`statement`	::=	`directive "." \| namedGraph \| wrappedDefault`
[3g]	`namedGraph`	::=	`iri "="? "{" triples "}""."?\| "{" "}""."?`
[4g]	`wrappedDefault`	::=	`"{" triples "}""."?` `\| "{" "}"` `"."?`

Where the grammar symbols directive, triples, and iri are defined in the Turtle Grammar

Some notes on this grammar:

It forbids directives between curly braces, and there is also no syntax to express “nested” graphs. The latter is also in line with WG resolution that considers nested graphs out of scope.
It requires curly braces for the content of the default graph
It allows for an optional “=” character between the name and the graph, and an optional “.” after the graph.
It allows for the expression of an (the) empty graph.

An issue with the current grammar is its incompatibility with the SPARQL grammar. As Turtle has been brought together with SPARQL as a result of WG the resolution of ISSUE-1, similar argument can hold for TriG grammar: try to ensure, as much as possible, compatibility with SPARQL. The corresponding, alternative syntax may therefore be:

[1g]	`trigDoc`	::=	`statement*`
[2g]	`statement`	::=	`directive "." \|triples \|namedGraph \| wrappedDefault`
[3g]	`namedGraph`	::=	`"GRAPH"?` `iri "="? "{" triples "}""."?\| "{" "}""."?`
[4g]	`wrappedDefault`	::=	`"{" triples "}""."?` `\| "{" "}"` `"."?`

This syntax:

Permits the usage of the GRAPH keyword preceding the graph name
Permits the default graph to be expressed without curly braces

Note that the usage of the “=” remains as a possible source incompatibility but maintaining it ensures that deployed TriG content remain valid. (It is unclear how widely that particular idiom is used, i.e., how much deployed material would be broken if it was removed from the grammar.)

The Working Group has to make a decision on whether the SPARQL compatible syntax should be chosen over the current TriG syntax, and whether the usage of the "=" character should remain in case the SPARQL compatible syntax is chosen.

The current syntax allows for an empty graph to be expressed in TriG. That detail has to be reinforced or invalidated by a WG resolution.

Should we call this something other than Trig, since it’s a bit different? Also, to avoid confusion, it may be useful to refer to this language explicitly as an extension to Turtle. Qurtle? Mugr (multi-graph-rdf)? Turtle2? Turtle Full?

Are blank node labels scoped to the document, the curly-brace expression, or the graph name? Assuming document-scope for now. This is Issue-21.

If TriG is to be published as a document by the RDF Working Group, the Working Group should register a media type for TriG that is different from the media type of Turtle.

Several possible extensions to the TriG syntax were considered, but rejected because they would break compatibility both with SPARQL 1.1 and deployed TriG content. Some of these are:

Add a special symbol (e.g., DEFAULT) to be used in the naming production, to specify the Default Graph.
Allow for several (comma separated) IRIs in the naming production (something like [GRAPH] g1, g2, DEFAULT { ... }), meaning that triples are added to each corresponding named graph.
Can we allow allow people to re-use subject, like:
g1 { ... }; :lastModified ....

JSON-LD

JSON-LD already has a syntax for datasets; this section is just a placeholder for further synchronization between the current JSON-LD terminology and the RDF Working Group's evolving notions.

RDF/XML

There are no plans to extend the RDF/XML syntax to include named graphs.

N-Quads

This document takes no position on syntactical changes to N-Quads, on whether N-Quads should be standardized separately or published as a WG Note. This has to be decided by the Working Group.