Redland RDF Storage and Retrieval

Dave Beckett
The Institute for Learning and Research Technology (ILRT)
University of Bristol, UK.

Abstract

An overview of the Redland storage and retrieval architecture and implementation along with the current problems and issues.

1. Redland Introduction

Redland was designed and implemented from early 2000 with a modular approach to the internal parts - each has an interface and a set of implementations of that interface (details of how to do this in C and cross-languages are described in [1]). The storage architecture similarly abstracted the triple storing (librdf_storage class in Redland) away from the user access the RDF graph (librdf_model).

2. Identity in Redland

In Redland an RDF triple (librdf_statement) and the three triple terms, subject, predicate and object are nodes (librdf_node) of three types: RDF URI references also known as resources, blank nodes or literals (including datatyped, with/without languages). The nodes are independent of any graph and carry no other state than their respective identity or (for literals) strings.

Redland takes advantage of this and interns both nodes and (RDF) URIs so that triples in the system are three unique pointers to the node objects. All nodes and triples are identified by their content and not attached to any particular graph. This allows triples and nodes to be copied by simple reference counting. When stored, no other information need be recorded for a triple.

RDF graphs do not provide any other structure inside the graph between the triple level and the entire graph; such as sub-graphs or "quoted" triples. As will be discussed later in section 5, Redland introduces a facility called Redland Contexts[4] to aid with the common task of graph merging and tracking the original triple/graph sources.

3. Triple retrieval API in Redland

The user API presented in Redland is a triple-based API one where the triples can be added removed, and searches done by passing in triples with any of the fields allowed to be omitted. This is typically called a "triples matching" query and is a common metaphor for retrieval seen in many RDF APIs and applications. Redland does not provide any way to store standalone nodes and arcs, but navigating the graph that is formed from triples can be done in a node-centric form by using model methods that return the object of a triple with a given (subject, predicate), that is, the node at the end of an arc. such as the pseudo code:

node=model.get_target(subject, predicate)

Retrieval of information from an RDF graph can be in the form of asking for triples that match a certain value - asking for some parts of the triples given other parts that are known. These return RDF concepts such as triples or nodes.

In relation databases, the SQL language provides retrieval by a textual query language rather than concepts in software, and the result are a set of variable bindings. This more classic query language style support is not currently directly available in Redland but underway and in the original design in [1] as another software module. There are many experimental RDF query languages as discussed in [10] but only a few with multiple interoperable implementations such as Squish[9] and RDQL[11] (not a complete list, see [10] for more).

4. Triple storing in Redland (librdf_storage class)

class

The two store classes that were implemented (and always-present) were an in-memory store (storage name "memory") and a store using hashes of key:value ("hashes"), another class provided by Redland internally (librdf_hash). The Redland librdf_hash class provides key:value maps with duplicate values allowed for a key (aside: GDBM does not support this, which is why it was dropped, see section 4.2). The duplicate values allows duplicate triples to be stored, but this is not the primary reason for allowing them. They are needed to provide redland contexts, described later in section 5.

The current hash implementations are in-memory one based on a custom hash implementation and a persistent one using Sleepycat/Berkeley DB[2] (BDB) or any other. This means that an in-memory store, an in-memory indexed store or a persistent indexed store are all possible.

The in-memory store was designed to be simple to understand so that its operation was trivially verifiable and formed the basis for the storage unit tests. The indexed store was then developed, providing a more sophisticated storage solution.

Table 1 shows the current use of hashes, evolved from the indexes suggested after personal communications with Dan Brickley, informed by earlier work such as rdfDB[5] and earlier projects at the ILRT. The key piece of information used was that the disk access is so much slower than the transfer time or processor time, that it is more expensive and complex to have multiple hashes, identify the key and values uniquely and assign short IDs, than to just store all of the triples in each hash, the approach taken here. It is not clear if this balance still applies several years after the original work.

When a query is done using the triples matching style, Redland picks the most appropriate hash if it is available. When it is not, Redland lists all the (key,value) pairs of one of the hashes that stores all S,P,O and uses the resulting set of triples to match against. In this way, Redland can perform queries when the hash is not available with no user interaction. This store model can thus work with 1 hash only, since the S,P,O are stored in all of the hashes (1-4).

HashKeyValueUse
1 - SP2O Subject and Predicate Object Optimises node-centric query: get targets of (source, arc)
2 - PO2S Predicate and Object Subject Optimises node-centric query: get sources of (arc, target)
3 - SO2P Subject and Object Predicate Optimises node-centric query: get arcs of (source, target)
4 - P2SO
(optional)
Predicate Subject and Object Finding nodes using a predicate - useful for structural predicates such as rdf:type or important ones for some application such as foaf:knows in FOAF.
5 - contexts
(optional)
- - Used for contexts, see section 5.
Table 1: Current Hash Storage Indexes

The hashes are used both for the statement queries and the node centric ones. The former are provided by serialising the hash and filtering via the querying statement. This can be very slow for large models so the node-centric indexes are used when only one of the elements of the statement is blank. Node-centric queries mean querying using the Model relative to a particular resource node or arc. The SP2O hash finds outgoing nodes from a resource with a given arc, the PO2S hash finds incoming nodes with a given arc and destination and the SO2P hash finds the arcs between two given nodes. These combinations of indexes have been found to be quite useful in experiments and testbeds implemented previously at ILRT, without the need to have full combination of indexes.

Table 2 shows an alternate set of hashes, which more reflect the use of the triple APIs. It is more often the case that nodes and properties are followed "forward", i.e. from subject and predicate to object, rather than backwards. The SP2O hash would indeed be the best candidate for a 1-hash store solution. As with any choice, this is very application specific, and the P2SO hash might be critical in performance to some work.

HashKeyValueUse
1 - SP2O Subject and Predicate Object Most common use of triples, asking for the object from a given subject and predicate
2 - S2P Subject Predicate Next most common use; from a given subject, find its predicates
3 - contexts
(optional)
- - Still optionally needed for contexts, see section 5.
Table 2: Alternate Hash Storage Indexes

Redland uses a dynamic hash configuration so that different stores could set up the hashes they want in any combination. This has not been exposed to the user api at this point due to the complexity. It might be better instead to offer a set of hash indexing choices from small to large or optimised for certain uses.

4.1 In-memory stores

Reviewing what has already been mentioned and outlining the issues with the current in-memory stores, some further extrapalations can be made on the possible aproaches for improving the store.

1. Store "memory"

This is implemented as a doubly linked list which is very simple to develop, understand and trivial to verify correctness. This approach allowed test cases to be built.

Very fast with small models due to simplicity but O(n) in all searches due to no indexing and inappropriate for graphs with moderate number sof triples.

2. Store "hashes" with hash type "memory"

Using in-memory hash types, this provides an indexed and fast in-memory store.

3. Store "hashes" with hash type "bdb" and no filename

BDB has a feature that if no filename is provided for the persistent store, it will create an in-memory one instead. This is possible but has not been tested.

Future
Use the advantages of the interned triples to provide an in-memory store that can have a regular memory structure. It could even be made memory-mapped to files or possible to fast load. An in-memory store when in use that was also persistent would have advantages in certain cases, for hot triples such as schema information.

4.2 Persistent Stores

The goal of portability for Redland led to the on-disk store being developed that used either entirely standalone resources or widely available and stable databases. The first store based on hashes for the storing and indexing let the system depend on earlier work on persistent btree-based storage such as BDB.

BDB is widely available on many systems (Unix, Linux, POSIX, others) and has a long history, being developed in the early 1990s. There are four major versions of BDB seen in deployed systems each with different and slightly incompatible APIs; some of the minor versions also have APi changes. This meant a lot of torturous configuration and options to support but Redland currently can use BDB from version 2.4 to 4.1 (or later).

The persistent store using on-disk BDB hashes originally also had a GNU DBM (GDBM) implementation, but that persistent store proved inadequate - it both did not support duplicates and was not being regularly maintained.

1. Store "hashes" with BDB: Redland Hashed Indexes
As described in section 4 using BDB hashes on-disk. Generates very large files since the full triples including full URIs and literals are in each triple. Limited by disk size and speed, but simple to index. Has proved fast enough up to low millions of triples.
2. Store "3store": 3store
When 3store[3],[6] triplestore from the AKTs project is available, Redland can use it via its triple API. This API is not where most of the 3store development is currently focused, and it is not from after personal communications with the developers, if this API will continue to be maintained. In particular, the RDF schema development is less complete for the triple API, compared to the RDQL one.
3. Store "mysql": MySQL
A new MySQL store has been developed and operated by Morten Frederiksen and recently added to the main Redland sources, in CVS, after the 0.9.14 Redland release
Future

Nothing is currently planned to extend this at present. The MySQL schema might be optimised to take advantage of other schema ideas from the workshop. It might be worth creating a BDB store that matched the binary format used by others such as rdflib.

For SWAD Europe we have already written in [7] and [8] in detail about other storage systems, APIs and looked at alternate ways to represent the triples.

5. Contexts

A facility called Redland Contexts[4] were added to Redland in version 0.9.12 for supporting managing using multiple RDF graphs and merging graphs. The use case was to allow triples to be asserted from multiple sources (documents) and then during use of the union graph, for the information returned by the Redland model calls to return some link to the original source.

This feature was implemented by allowing,when triples are being added to the graph, for an additional librdf_node to be passed in as another parameter. This is the context node and can be retrieved as the result of any redland triple query, by using a method from the librdf_iterator or librdf_stream objects.

This was as a major update to the Redland internals for the librdf_iterator and librdf_stream classes and requried changes to all the implemented store SPIs and implementations.

This feature can be used for a variety of things depending on how the context nodes are used with the triples. The following is not an exhaustive list:

Enable true graph merging / updating / demerging
Identify the subgraphs (sets of triples from particular sources) with context nodes.
Statement Identity
Add each triple with a different context node. RDF's model does not assign identity to triples. There is reification also which might be used with this approach.
Statement Provenance
Use the context node as the subject of other statements about the statement that is returned.
Subgraphs
Similar to the merging approach but consider the RDF graph to be a set of graphs and manipulate them as such. (Aside: Redland is gaining model aggregation to do this explicitly).

Contexts used this way do have a downside in that they cannot be transfered using the RDF/XML syntax outside Redland to other systems without extending the syntax. They are also probably not the same as N3 Formulae which amongst other things, quotes the triples in the contexts as not asserted in the containing graph. Redland context node information should also be available by any network API.

6. Benchmarking

The current applications of Redland have been up to 3.5M triples using the BerkeleyDB stores on typical desktop PC hardware[1], with acceptable interactive response, but this is not considered a large triplestore. There are no standard data sets that are typically used across RDF triple stores (possibly the NCI cancer ontology[12], but that is rather specialised) to compare load and query times.

7. Issues

References

[1]
The Design and Implementation of the Redland RDF Application Framework, Dave Beckett, The Institute for Learning and Research Technology (ILRT), University of Bristol. In proceedings of WWW10, Hong Kong, May 2001.
[2]
Berkeley DB Datastore, Sleepycat Software Inc.
[3]
3store: Efficient Bulk RDF Storage, Stephen Harris and Nicholas Gibbins, University of Southampton, UK. In Proceedings 1st International Workshop on Practical and Scalable Semantic Systems (PSSS'03), pages 1-15, Sanibel Island, Florida.
[4]
Redland Contexts, Dave Beckett, The Institute for Learning and Research Technology (ILRT), University of Bristol, Redland technical Note, last updated 28 October 2003.
[5]
rdfDB: An RDF Database, R.V. Guha, 2001-2002.
[6]
3store, Stephen Harris and Nicholas Gibbins, AKTors project, University of Southampton, UK
[7]
Mapping Semantic Web Data with RDBMSes, Dave Beckett and Jan Grant, The Institute for Learning and Research Technology (ILRT), University of Bristol, report for SWAD Europe, last updated 18 February 2003.
[8]
Scalability and Storage: Survey of Free Software / Open Source RDF storage systems, Dave Beckett and Jan Grant, The Institute for Learning and Research Technology (ILRT), University of Bristol report for SWAD Europe, last updated 17 February 2003
[9]
SquishQL, Libby Miller, The Institute for Learning and Research Technology (ILRT), University of Bristol
[10]
Databases, Query, API, Interfaces report on Query languages, Libby Miller, The Institute for Learning and Research Technology (ILRT), University of Bristol report for SWAD Europe, last updated 1 April 2003
[11]
RDQL - RDF Data Query Language, Andy Seaborne, HP Laboratories, Bristol, UK
[12]
National Cancer Institute Thesaurus, hosted by the MINDlab, University of Maryland, College Park College Park, Maryland, USA

Dave Beckett