Layers

From RDF Working Group Wiki
Jump to: navigation, search

__NUMBEREDHEADINGS__

This is pretty much obsolete, replaced by RDF Data Layers and Datasets.

This is a proposal for how to address our charter items for "multiple graphs and graph stores". It is (optimistically) written something like a spec (or part of a spec). This design expressed here is basically the same as 6.3 but it has been turned inside out (focusing on Layers more than Datasets), and uses the term "layer" instead of "graph resource" or "g-box" (with much thanks to Dan Brickley's email). -- Sandro

RDF Layers and Datasets

1 Introduction

An RDF Layer is a mutable set of RDF Triples with its own identity. This simple concept is an important building block for the Semantic Web. RDF Layers are what some people have called Named Graphs, but they are different from both RDF Graphs and Named Graphs as those terms are defined in W3C specifications.

The difference between RDF Layers and RDF Graphs is that while they are both sets of triples, RDF Layers can conceptually change over time and have an identity separate from their contents. RDF Graphs, in contrast, are pure mathematical sets, so it makes no sense to talk about them "changing", and two graphs which contain the same triples are by definition the same graph. RDF layers are different; they keep their own identify even when the happen to contain exactly the same triples. This distinction is important when there is data (or metadata) about the set of triples; if the metadata is about a set with its own identity, an RDF Layer should be used instead of a Graph.

The name "layer" is chosen to evoke the image of a transparent surface on which an RDF Graph can be temporarily drawn and then easily aligned with other layers to make a larger, combined graph. Sometimes we work with the combined graph, but when we want to, we can still separate the layers. This is particularly important if the data come from different sources which might potentially update their contribution; we need to be able to erase and redraw just their layer.

( Dan Brickley's pictures: http://www.flickr.com/photos/danbri/3472944745/ and http://farm4.static.flickr.com/3613/3384528143_8304792836_b.jpg )

A technical feature of layers is that they can potentially share RDF blank nodes. This allows a graph using blank nodes to be separated into layers, as illustrated in Example 2. In our image of layers as transparent sheets, blank nodes on different layers are the same if they lie in the same spot, one on top of another.

1.1 Datasets

SPARQL defines a Dataset as a structure consisting of a "default" graph and a set of (name, graph) pairs. We now give this structure declarative semantics, allowing dataset to be used as logical statements, like RDF Graphs. We define a dataset as being true if and only if (1) its default graph is true, and (2) for every (name, graph) pair in the dataset, the layer denoted by name contains every triple in graph.

These dataset semantics allow SPARQL — which is defined in terms of datasets — to be used for communication with a database of information about layers. In SPARQL 1.1 Update, the slots in a Graph Store are each a layer.

With these semantics, we can use a dataset syntax to convey both general information (as with RDF) and information about triples being in layers. In this document, we specify to two languages for serializing datasets, N-Quads and TriG. They are briefly introduced here, used in the examples, and fully defined in section @@@.


1.1.1 N-Quads

N-Quads is a simple extension to N-Triples, where a fourth term is optionally added at the end of each line after the subject, predicate, and object. In the original N-Quads specification, this fourth field is called the "context", but we now define it to be the identifier of the layer in which that triple is being declared to occur. For example, the N-Quads line:

  <a> <b> <c> <d>.

is now defined to mean that the triple <a> <b> <c> is in layer <d>. This does not preclude:

  <a> <b> <c> <e>.

which means that the triple <a> <b> <c> is also in the layer <e>. N-Quads lines which do not include the layer term are considered to be in an anonymous layer, called the "default layer", and the graph on that layer is called the "default graph".

1.1.2 TriG

The syntax of TriG is similar to Turtle, except that triples can be grouped with curly braces, and those groups given a label. For example:

  <d> { <a> <b> <c>.  <e> <f> <g> }

is now defined to mean that the triples <a> <b> <c> and <e> <f> <g> are in layer <d>. This expression does not preclude the possibility of those triples also being in other layers, and other triples being in layer <d>. Later in the same TriG file, or in a different TriG file, we might encounter

  <d> { <h> <i> <j> }

which tells us that a third triple, <h> <i> <j>, is also in layer <d>.

TriG and N-Quads are interchangeable, equivalently-expressive ways to convey the same information, like Turtle and N-Triples. In most situations, Turtle and TriG document are shorter and easier for people to read and write, while N-Triple and N-Quads documents are easier for programs to parse.

2 Examples

2.1 Federated Phonebook

As a first example of how to use layers, consider an organization which has 25 different divisions around the world, each with its own system for managing the list of its employees and their phone numbers. The parent organization wants to create a unified "HQ" directory with all this information. With their HQ directory, they will be able to look up an employee's phone number without first knowing the employee's division.

They decide to use RDF layers. Each division is asked to publish its phonebook on an internal website, in a W3C-Recommended RDF syntax, using the vcard-rdf vocabulary. Each division submits the URL at which this file will appear. For example, the uswest division might publish the RDF version of its phonebook at http://uswest.internal.example.com/employees.rdf and the Japan division might publish theirs at http://ja.example.com/hr/data/export371. The URL itself doesn't matter, but the division must be able to maintain the content served there and HQ must be able to easily fetch the content.

The HQ staff assembles this list of 25 feed URLs and puts them into the default graph of a SPARQL database, so the database looks like this:

   @prefix hq: <http://example.com/hq-vocab#>.
   # default graph
   {
      hq:parentCo hq:division hq:div1, hq:div2, hq:div3, ...
      <http://uswest.internal.example.com/employees.rdf> 
         hq:feedFrom hq:div1.
      <http://ja.example.com/hr/data/export371>
         hq:feedFrom hq:div2.
      ...
   }

Then they write a simple Web client which looks in the database for those feed URLs, dereferences them, and parses the RDF. It then puts the parse-result into the database in a layer whose name is the same as the name of the feed. This makes sense, because in this deployment each feed is considered to be a layer; the name of the feed is the same as the name of the layer. The HQ client is copying data about the layer from the division databases to the HQ database, but it's still the same information about the same layers.

For performance reasons, the client is designed to use HTTP caching information. This will allow it to efficiently re-fetch the information only when it has changed. To make this work, the client will need to store the value of the "Last-Modified" HTTP header and also store (or compute, in some configurations) the value of the "Expires" header.

In the end, the database looks something like this:

 @prefix hq: <http://example.com/hq-vocab#>.
 @prefix v:  <http://www.w3.org/2006/vcard/ns#>.
 @prefix ht: <http://example.org/http-vocab#>.
 <http://uswest.internal.example.com/employees.rdf> {
    # an employee
    [ a v:VCard
      v:fn "John Wayne" ;
      v:email "wayne@uswest.example.com" .
      v:tel [ a v:Work, v:Pref ;
              rdf:value "+213 555 5555" ]
    ]
    # another employee
    ...
 }
 <http://ja.example.com/hr/data/export371> {
    # an employee
    [ a v:VCard
      v:fn "Toshiro Mifune" ;
      v:email "wayne@uswest.example.com" .
      v:tel [ a v:Work, v:Pref ;
              rdf:value "+81 75 555 5555" ]
    ]
    # another employee
    ...
 }
 ...    other divisions
 # default graph, with all our metadata
 {
   hq:parentCo hq:division hq:div1, hq:div2, hq:div3, ...
   # stuff we need to know the efficiently keep our copy in sync
   <http://uswest.internal.example.com/employees.rdf> 
     hq:feedFrom hq:div1;
     ht:last-modified "2012-03-14T02:22:10"^^xs:datetime;
     ht:expires "2012-04-29T00:15:00"^^xs:datetime.
   <http://ja.example.com/hr/data/export371> 
     hq:feedFrom hq:div2;
     ht:last-modified "2012-04-01T22:00:00"^^xs:datetime;
     ht:expires "2012-04-29T00:35:00"^^xs:datetime.
 }

The URL of each layer appears in four different roles in this example:

1. It is used as a label for a graph. Here, it says which layer the triples in that graph are in. That is, the triples about employee "John Wayne" are in the layer named "http://uswest.internal.example.com/employees.rdf". Information about what triples are in that layer originates in the master database for each division, then is copied to the slave database at HQ.

2. It is used as the subject of an hq:feedFrom triple. This information is manually maintained (or maintained through a corporate WebApp) and used to help guide the HQ fetching client. Because in this deployment we are equating layers and feeds, the name of the layer is also the URL of the feed.

3. It is used as the subject of an ht:last-modified triple. The information in this triples comes from the HTTP Last-Modified header. The meaning of this header in HTTP lines up well with its intuitive meaning here: this is the most recent time the set of triples in this layer changed. (This header can be used during a refresh, with the If-Modified-Since headers, to make a client refresh operation very simple and fast if the data has not changed.)

4. It is used as the subject of an ht:expires triple. This information also comes from HTTP headers, although some computation may be needed to turn it into the absolute datetime form used here. Strictly speaking, what is expiring at that time is this copy of the information about the layer, not the layer itself. This slight conflation seems like a useful and unproblematic simplification.

Given this design, it is straightforward to write SPARQL queries to find the phone number of an employee, no matter what their division. It is also easy to find out which layer is about to expire or has expired and should be refreshed soon.

Some alternative designs:

  • Divisions could push their data, instead of waiting to be polled. That is, the divisions could be given write access to the HQ database and do SPARQL UPDATE operations to their own layers. This is simpler in some ways but may require more expertise from people in each division. It also requires trusting people in each division or having a SPARQL server that can be configure to grant certain users write access to only certain layers. This also turns HQ into more of a bottleneck and single-point of failure. With the polling approach, other systems could be given the list of feed URLs and then offer an alternative combined directory, or use the same data for other purposes, without any involvement from the divisions.
  • The HQ client could fetch or query all the divisions at query time, rather that gathering the data in advance. This might use the SPARQL 1.1 Federated Query features. Which approach is superior will depend on the particulars of the situation, including how large the data is, how often it changes, and the frequency of queries which need data from different divisions. Federated Query would probably not be ideal for the situation described here, but should be considered by people building systems like this.

2.2 Storing Derived Information

The Federated Phonebook example shows several features of layers, but leaves out a few. In this example we will show the use of privately-named layers and of sharing blank nodes between layers.

The scenario is this: some divisions use only vcard:n to provide structured name information (keeping given-name and family-name separate), while others use only vcard:fn to provide a formatted-name (with both parts combined). The politics of the organization make it impractical to tell the divisions to all use vcard:n or all use vcard:fn, or both. Meanwhile, several different tools are being written to use this employee directory, including a WebApp, a command-line app, and apps for several different mobile platforms. Does each app have to be written to understand both vcard:n and vcard:fn?

HQ decides the solution is for them to run a single process which normalizes the data, making sure that every entry has both vcard:n and vcard:fn data, no matter what the division provided. The process is fairly simple; after any layer is reloaded, a program runs which looks at that layer and fills in the missing name data.

Because of the tricky politics of the situation, however, HQ decides it would be best to keep this "filled in" data separate. In some cases their program might not fill in the data properly. For example, how can a program tell from the formatted name "Hillary Rodham Clinton" that "Rodham Clinton" is the family-name? The solution is to keep the output of the program in a separate layer, so clients (and people trying to debug the system) can tell this filled-in data did not come from the division itself.

The result is a dataset like this:

 @prefix hq: <http://example.com/hq-vocab#>.
 @prefix v:  <http://www.w3.org/2006/vcard/ns#>.
 @prefix ht: <http://example.org/http-vocab#>.
 <http://uswest.internal.example.com/employees.rdf> {
    # an employee
    _:u331 a v:VCard
           v:fn "John Wayne" ;
           v:email "wayne@uswest.example.com" .
           v:tel [ a v:Work, v:Pref ;
                   rdf:value "+213 555 5555" ].
    ...
 }
 hq:namefill602 {
    _:u331 v:n [
           v:family-name "Wayne";
           v:given-name "John"
    ]
 }
 ...
 # default graph has metadata
 {
   hq:parentCo hq:division hq:div1, hq:div2, hq:div3, ...
   <http://uswest.internal.example.com/employees.rdf> 
     hq:feedFrom hq:div1;
     hq:namefillLayer hq:namefill602
  ...
 }

In serializing this, we needed to introduce a blank node label ("_:u331"), because that blank node (representing the employee) occurs in two different layers.

The example also shows the creation of a new layer name (hq:namefill602) for the layer filled in by our namefill program. We use one new layer for each feed, instead of one layer for all the output of the namefill program, so we have less work to do when a single feed layer is reloaded.

The techniques in this example apply equally well to information that is derived as part of logical inference, such as done by an RDFS, OWL, or RIF reasoner. In these more general cases, it may be that one layer can be used for all derived information, or, at the other end of the granularity spectrum, that a new layer is used for the triples derived in each step of the process.

2.3 Archival (Static) Layers

One more variation on the federated phonebook scenario: what if HQ wants to be able to look up old information? For instance, what happens when an employee leaves and is no longer listed in a division phonebook? It could be nice if the search client could report that the employee is gone, rather than leaving people wondering it they've made a mistake with the name.

To address this, HQ's data-loading client will not simply delete a layer before reloading it. Instead, it will first copy the data to a new, archival layer. After three reloads, the database might look something like this:


 @prefix hq: <http://example.com/hq-vocab#>.
 @prefix hqa: <http://example.com/hq/archive/>
 @prefix v:  <http://www.w3.org/2006/vcard/ns#>.
 @prefix ht: <http://example.org/http-vocab#>.
 hqa:0001 {
    ... oldest version
 }
 hqa:0002 {
    ... middle version
 }
 <http://uswest.internal.example.com/employees.rdf> {
    ... current version
 }
 # default graph
 {
   hqa:0001 hq:startValidTime ...time...  ;
            hq:endValidTime  ...time... .
   hqa:0002 hq:startValidTime ...time...  ;
            hq:endValidTime  ...time... .
   <http://uswest.internal.example.com/employees.rdf> 
       hq:snapshot hqa:0001, hqa:0002.
   ....
 }

This model uses static layers, whose contents are never supposed to change. (They are still different from RDF Graphs in that they retain their own identity; two static layers containing the same triples can have different metadata.) For each static layer, we record the time interval for which it was current (its valid time) and what it is a snapshot of.

The URL for each static layer is generated by incrementing a sequence counter. To follow Linked Data principles, HQ should provide RDF serializations of the contents of each layer in response to dereferences of these URLs. (When the state of layers is obtained like this, with separate HTTP responses for each one, a blank node appearing on multiple layers will appear as multiple blank nodes. For blank node sharing to work, the dataset which serializes the contents of all the relevant layers must be transmitted or queried as a unit.)

There is nothing about this architecture that prevents the archival data from being modified. The people maintaining the system simply agree not to change it. If this is not sufficient, other approaches could be designed, such as generating the URL using a cryptographic hash of the layer contents.

Another variant on this design is to put the feed data directly into an archival layer, instead of having the current data in a the same layer as the feed.. If the data is likely to grow stale (not be kept in sync with the feed master data), this may be a better approach, reducing the possibility of people unknowingly using outdated information.