Supporting Change Propagation in RDF

In this submission we explore existing work around the issue of propagating changes to an RDF graph. We use this to identify two elements that would facilitate solutions to this class of problems. By establishing community agreement around these elements, rather than directly suggesting an infrastructure focused on solving the specific use case, we hope that the elements can be reused for other problems and, through widespread deployed functionality of RDF toolkits, reduce the need to redevelop such basic building blocks.

The common elements are: standardized formats for a collection of graphs and an approach to referring to blank nodes.

Use cases

Mirroring of a Graph

By recording changes to an RDF graph, a number of tasks can be achieved such as replicating a graph, monitoring changes or creating an up-to-date merge of several data sources. Graph data may be sufficiently large that replication of heavily used information or it may be sufficiently important that only relying on a remote copy is judged too high a risk, graph replication for caching is important.

Editing a Graph

Another task is that of remote editing of an RDF graph. Here, changes are sent to the graph. The focus is on being about to identify locations in the graph to change, anchoring the removal and additions to existing RDF terms in the graph.

Existing Approaches

In this section we review some work in the area that is directly related to RDF. There is a large body of work relating to diffs and synchronization of other data types that we do not review here.

Delta

In Delta [1], the authors considers the problem of updating or monitoring changes to a graph. After analyzing the case of using the serialized of the graph and considering textual diffs [2], the authors note that this can be unstable in the presence of small changes because of potentially large changes in the serialized form. They propose an ontology that handles a change to a graph at the triple level captured as a set of triples to be deleted and a set of triples to be added. They use the N3 [] extension for graph literals to record in a single document.

They also categorize changes as weak diffs, that must be applied to the same graph and strong diffs that carry sufficient information that they model the change of the information in an context-independent manner. Blank nodes are only considered where identifiable by functional or inverse functional properties.

ChangeSets

The approach taken by ChangeSets [3] is resource-centric. A ChangeSet refers to an resource being changed and all the triples should refer to the same subject. A change set is an RDF graph with data about the change and two multivalue properties to link to the statements to remove and statements to add. Reification is used to record the triples to be removed and added.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:cs="http://purl.org/vocab/changeset/schema#">
  <cs:ChangeSet rdf:about="http://example.com/changesets#change">
    <cs:subjectOfChange rdf:resource="http://example.com/res#thing"/>
    <cs:createdDate>2006-01-01T00:00:00Z</cs:createdDate>
    <cs:creatorName>Anne Onymous</cs:creatorName>
    <cs:changeReason>Change of title</cs:changeReason>
    <cs:removal>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://example.com/res#thing"/>
        <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
        <rdf:object>Original Title</rdf:object>
      </rdf:Statement>
    </cs:removal>
    <cs:addition>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://example.com/res#thing"/>
        <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
        <rdf:object>New Title</rdf:object>
      </rdf:Statement>
    </cs:addition>
  </cs:ChangeSet>
</rdf:RDF>

In addition, the triples to be removed must all be present in the target graph before the application of the Changset starts. This acts as a precondition for the change operation. Change set are applied by POSTing to a URL of the graph with content of MIME type application/vnd.talis.changeset+xml . Blank nodes are not supported.

RDFSync

In the work on RDFSync [4], the authors describe a mechanism inspired by rsync [ref]. The graph to be synchronized is decomposed into units (bounded descriptions), and the hash of each unit calculated. Units with different hashes are candidates for transfer and the receiver decides which units to transfer, allowing partial graph replication. The scheme is based on a custom protocol use of HTTP so both graph publisher and data consumer must provide the necessary machinery.

RDF Difference Models

In RDF Difference Models the author comes to a similar conclusion that a set of "forward difference statements" and a set of "reverse difference statements" are needed but also adds preconditions in the form of statements to be in the two graphs before the diff is applied. Precondition statements can be used as locks or version management.

They introduce a new parse type Statements to collect a number of triples together to make the object of a property (c.f. N3 graph literals and a container of reified statements). This is a collection of reified statements.

SemVersion

As part of the work SemVersion [6], there is a design for a diff format. It also uses reification to encode the addition and removal graphs into a single document. In order to handle blank nodes, a URI identifier is allocated so that multiple RDF diffs can be applied, referencing the same blank node. Like other systems, it is regarding the RDF graph as an abstract syntactic form where addressing terms in the graph makes sense.

Designing for Change

A different approach is taken [7] with the work of the UK Government. The data model has change and versioning built-in. URI schemes include the versioned URI as well as an unversioned URI but also the units of change are each a single named graph. It is not possible to change triples within a graph, only replace a named graph with another version. Each named graph records it's own provenance information, links what it replaces, and links to the unversioned URI. A set of changes is then a set of named graphs. The overall dataset is the RDF merge of the individual graphs together with rules for which information is authoritive and which is supplemental.

Requirements

These systems use collections of triples to represent change. A frequent pattern is to the have two collections of triples , one for the triples to remove, the other the collection of triples to add. There are several proposals for ways to combine the two collections into a single transferrable unit.

It is also desirable for the changes to be represented in a single file and to add information about the change itself into the same unit. There are a number of ways that can be considered, some of which have been tried in the systems surveyed above.

Blank nodes are either dealt with identifying information, which may be introduced in order to identify them.

Graph Packaging

Reification

Reification can be used to record information about statements without the statement referred to being "true" in the graph. It can form the basis of provenance. It has only found partial acceptance as a mechanism which we suggest is based on a number of factors.

Firstly, it is verbose, leading to more RDF triples in the graph than statements being described. Secondly the unit of reification is the statement yet the it is the graph that is the unit of information exchange on the web. Thirdly, to collect a number of reified statements together, the application must use some grouping mechanism, either repeated properties, RDF collections or RDF containers. Different choices lead to interoperability problems so there is no common representation of a quoted graph by using reification.

Graph literals

N3 extends RDF with graph literals. A graph literal, or formula, is an RDF term that has the value of an RDF graph. It's value is defined by it's contents. Formulae can be nested to arbitrary depth and provide a more compact reified graph that statement-level reification in standard RDF. Being an extension to RDF, not widely supported outside of the community of N3-like engines, there is a barrier to adoption of this approach based on the need for RDF toolkits to be upgraded.

SPARQL Update

One possibility is the upcoming SPARQL update language. The SPARQL working group is considering two update mechanisms, one that is concerned with the manipulation of whole graphs via the usual HTTP verbs, "SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs" and "SPARQL 1.1 Update" (SPARQL Update), a language that allows for manipulation of the content of graphs at the triple level.

As a language, SPARQL Update, has a syntactic form that can be used record the changes to a graph, or several graphs, in an RDF dataset.

One feature of this approach is that it is leveraging a format that is likely to be widely deployed as part of the SPARQL standardization work but it lacks the ability to carry metadata about the update itself because the SPARQL Update request does not contain RDF data about itself. Another issue with this approach is that the script may contain other update actions, including deleting data without explicitly giving it as triples, as shown above.

RDF Dataset

An RDF Dataset is usually encountered as the source of data for an RDF query. The concept arose out of the observation that a significant number of RDF storage systems at the time were already storing multiple graphs in one store. Using the work of Named Graphs, Provenance and Trust [8], Data Access Working Group codifed the practice into a lightweight framework covering a number of usage patterns. An RDF dataset is a default graph, which is always present, and zero or more named graphs. Graphs are named with URIs. The named graphs are accessed in a SPARQL query using the GRAPH keyword:

The current SPARQL-WG is taking this to forward with an update protocol that exposes the named graphs in a store on the web, making the individual graph within a dataset accessible via HTTP.

An RDF dataset need not be simply a storage concept. Systems already use "quad" formats to dump and load data from a number of graphs together. Formats include Trix, TriG and N-Quads, each allowing a single document to contain the information for an RDF dataset. These are not standards but are finding a de facto role.

There is no defined requirement for the default graph to have a specific relationship to the named graphs and while one common pattern is that the default graph if the union of the named graphs, it is not the only possibility. One possible pattern of use for an RDF dataset is for the default graph to contain details about the named graphs and so could be used for our graph difference example by stating which of the named graphs are to be taken as information to be removed from the target and which are to be taken as information to add to the target. Unlike a SPARQL Update request, information about the change itself (time, reason, version) can included in the maifest part of the document. Unlike SPARQL update, the changes are more limited in scope and subject more limited processing, reducing the security risks.

Blank Nodes

Some changes to a graph are made in order to complete the information. As the editor use case shows, sometimes the changes to the graph are at the syntactic level and so we need a mechanism to deal with changes involving blank nodes at this syntactic level, making such use clearly separate from the semantic information. This need only work when a blank node is identified in the context of the graph where it originated syntactically.

One pragmatic approach already adopted is to use a pseudo URIs of the form <_:label>, or some other URI scheme name. The advantage of the approach is that existing parsers and tool chains continue to work, resulting in a solution that is an agreement between sender and received of the information, tunnelled through existing infrastructure.

Summary

A common theme is the transmission as a single unit of a number of related graphs. We have looked at a number of existing and upcoming technologies for this and the named graph approach is the most flexible for fixed data, while SPARQL Update gives more detailed control of the changes to existing points in the graph.

Blank nodes remain a difficult area. Sometimes, in order to make changes to a graph, it is necessary to refer to blank nodes at a syntactic level, for example when editing to add the necessary identifying triples. We suggest accepting that for detailed manipulation, pragmatic identity is necessary and a mechanism to enable this is codified. This should not be a substitute for proper blank nodes. This needs to be tied to scoping label issues in multi-graph serializations.

References

[2] A file comparison program. Webb Miller, Eugene W. Myers; Software: Practice and Experience 15 (1985) 1025�1040

[8] Named graphs, provenance and trust Jeremy J. Carroll, Christian Bizer, Pat Hayes, Patrick Stickler
WWW '05: Proceedings of the 14th international conference on World Wide Web (2005), pp. 613-622.