Yet Another Dataset Proposal

From RDF Working Group Wiki
Revision as of 14:08, 1 June 2012 by Rcygania2 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Informative Introduction

When working with RDF, it is often desirable to have a data structure that can comprise multiple RDF graphs. This is especially true when dealing with data on the web, where data may come from multiple sources with different levels of information quality, and it is important to keep track of what came from where. RDF datasets are a mechanism for achieving this.

IRIs in RDF graphs denote resources. In RDF datasets, IRIs can also have an additional associated RDF graph, called the state of the denoted resource.

RDF does not constrain what kind of resources can have state, or what exactly might be the state of any given resource. Given an IRI, there are no formal restrictions on the relationship that must hold between its denoted resource and the associated RDF graph.

Note: Although there is no formal restriction on the relationship between resources and state graphs, there are certainly good practices. For IRIs that dereference to a representation, if the representation can be parsed as RDF, it is considered good practice to treat the resulting RDF graph as the state of the resource. To promote interoperability, this interpretation should generally be followed when RDF datasets are used on the web.

Example: In the following example RDF dataset, the resources denoted by :g1 and :g2 are said to be valid in different years. Furthermore, both resources have associated state graphs. The graph for :g1 claims that Joe works for ACME, while the graph for :g2 claims that Joe works for Google.

    :g1 :valid "2008"^^xsd:gYear.
    :g2 :valid "2009"^^xsd:gYear.
:g1 {
    :Joe :worksFor :ACME_Inc.
:g2 {
    :Joe :worksFor :Google_Inc.

Abstract Syntax

An RDF dataset is a collection of RDF graphs and comprises:

  • Exactly one default graph, being an RDF graph. The default graph does not have a name and may be empty.
  • Zero or more state pairs. Each state pair consists of an IRI (called the graph IRI of the pair) and an RDF graph (called the named graph of the pair). There can only be one state pair for any given IRI within an RDF dataset. The named graph is said to be the state of the resource denoted by the graph name.

An RDF dataset is a pure mathematical structure, with no identity apart from its contents. Two RDF datasets with the same contents are in fact the same single RDF dataset, and an RDF dataset cannot change over time.

An RDF dataspace is a structure that can change over time, and its state at any given time is an RDF dataset. It can be thought of as a mutable RDF dataset.

Note: Not all implementations keep track of empty named graphs in RDF datasets and RDF dataspaces. Therefore, to maximize interoperability, users and applications should not ascribe significance to the distinction between the presence of a state pair with empty RDF graph and the absence of said state pair.

Semantics of RDF datasets: Overview

The semantics of RDF datasets are defined as an extension to the semantics of RDF graphs. DS-interpretations extend the concept of simple interpretations by assigns truth values not just to RDF graphs, but also to state pairs and RDF datasets. The semantics introduce a new property rdf:entails that holds between resources A and B if the state of A entails the state of B.

Some inferences: The most interesting inferences that can be drawn with this semantics are:

  1. Two datasets are inconsistent if they assign different states to the same graph IRI.
  2. For any graph G2 entailed by some state graph G1, we can infer a new state pair of the form <skolemIRI,G2>. (If blank nodes were allowed as graph names, then the state pair would associate a fresh blank node with G2. Since they are not allowed, we use a skolem IRI in its place.)
  3. If one graph in the dataset entails another, we can infer an rdf:entails statement between their associated resources (e.g., G1 rdf:entails G2).

Semantics of RDF datasets: Formal definition

Given an entailment regime E, a DS+E-interpretation is an E-interpretation extended with:

  • a state relationship S that is a set of pairs <x,y> where x in IR and y an RDF graph and no x appears twice in the set.

A DS+E-interpretation of vocabulary (V union {rdf:entails}) must satisfy the following semantic conditions:

  1. If SP is a state pair <i,G> then I(SP) = true if <I(i),G> is in S, and false otherwise.
  2. If DS is an RDF dataset, then I(DS) =
    • false if I(DG) is false for the default graph DG of DS
    • false if I(SP) is false for any state pair SP in DS
    • true otherwise.
  3. If <x,G1> is in S and G1 E-entails G2, then there exists an y such that <y,G2> is in S.
  4. <x,y> is in IEXT(I(rdf:entails)) if and only if <x,G1> is in S and <y,G2> is in S and G1 E-entails G2.

The first and second conditions assign truth values to state pairs and RDF datastes, respectively. The third condition ensures that for every graph G2 entailed by some state graph G1, a resource with state G2 does actually exist. That resource may not have a name, but we could refer to it via a blank node or via a skolem IRI. The fourth condition makes rdf:entails work.

Note: The semantics leaves the meaning of the named graphs themselves entirely isolated from the other contents of the RDF dataset; statements made in the default graph or in another named graph do not affect the interpretation of a named graph. Semantic extensions can impose additional conditions if desired.

Semantics of web datasets

Here we will capture the intuitive meaning of IRIs and dereferencing on the Web by defining additional semantic conditions on DS+E-interpretations.

A WDS+E-interpretation is a DS+E-interpretation that must satisfy the following additional semantic conditions:

  1. If x is a dereferenceable IRI that is not a skolem IRI, then <I(x),G> is in S if and only if x dereferences to a representation that can be parsed as an RDF graph G.
  2. E is itself a WDS+E-interpretation with the same state relationship.

The first condition requires that the state relationship is the dereference+parse relationship. We exclude skolem IRIs so that they can be used as “throw-away” identifiers for inferred graphs without requiring them to be dereferenceable.

The second condition is a way of forcing all dereference+parseable IRIs across default graph and all named graphs to have the same state, as all will be using the same state relationship. In practical terms this means that any rdf:entails triple that holds in the default graph will also hold in any of the named graphs.

@@ Problem! The formalism doesn't quite work here. The intention is that an IRI mentioned anywhere in the dataset—in the default graph, as a graph name, or in a named graph—have the same associated state (even though the formalism doesn't guarantee that they denote the same thing, which is incidental but perhaps a good thing). But the definition above is recursive and therefore doesn't work.


Example 1: The following two datasets are DS-inconsistent. S assigns each resource to only a single state graph. There can be no DS-interpretation whose S assigns I(:g1) to both given graphs.

:g1 { :x a :y. }
:g1 { :x a :y,:z. }

Example 2: The first dataset DS-entails the second. Let skolem: be a prefix that expands to a namespace of skolem IRIs.

:g1 { :x a :y,:z. }
skolem:g2 { :x a :z. }

Example 3: The first dataset DS-entails the second. Let skolem: be a prefix that expands to a namespace of skolem IRIs.

:g1 { :x a :y. :y rdfs:subClassOf :z. }
{ :g1 rdf:entails skolem:g2. }
:g1 { :x a :y. :y rdfs:subClassOf :z. }
skolem:g2 { :x a :z. }