An overview of the Redland storage and retrieval architecture and implementation along with the current problems and issues.
Redland was designed and implemented from early 2000 with a modular approach to the internal parts - each has an interface and a set of implementations of that interface (details of how to do this in C and cross-languages are described in [1]). The storage architecture similarly abstracted the triple storing (librdf_storage class in Redland) away from the user access the RDF graph (librdf_model).
In Redland an RDF triple (librdf_statement) and the three triple terms, subject, predicate and object are nodes (librdf_node) of three types: RDF URI references also known as resources, blank nodes or literals (including datatyped, with/without languages). The nodes are independent of any graph and carry no other state than their respective identity or (for literals) strings.
Redland takes advantage of this and interns both nodes and (RDF) URIs so that triples in the system are three unique pointers to the node objects. All nodes and triples are identified by their content and not attached to any particular graph. This allows triples and nodes to be copied by simple reference counting. When stored, no other information need be recorded for a triple.
RDF graphs do not provide any other structure inside the graph between the triple level and the entire graph; such as sub-graphs or "quoted" triples. As will be discussed later in section 5, Redland introduces a facility called Redland Contexts[4] to aid with the common task of graph merging and tracking the original triple/graph sources.
The user API presented in Redland is a triple-based API one where the triples can be added removed, and searches done by passing in triples with any of the fields allowed to be omitted. This is typically called a "triples matching" query and is a common metaphor for retrieval seen in many RDF APIs and applications. Redland does not provide any way to store standalone nodes and arcs, but navigating the graph that is formed from triples can be done in a node-centric form by using model methods that return the object of a triple with a given (subject, predicate), that is, the node at the end of an arc. such as the pseudo code:
node=model.get_target(subject, predicate)
Retrieval of information from an RDF graph can be in the form of asking for triples that match a certain value - asking for some parts of the triples given other parts that are known. These return RDF concepts such as triples or nodes.
In relation databases, the SQL language provides retrieval by a textual query language rather than concepts in software, and the result are a set of variable bindings. This more classic query language style support is not currently directly available in Redland but underway and in the original design in [1] as another software module. There are many experimental RDF query languages as discussed in [10] but only a few with multiple interoperable implementations such as Squish[9] and RDQL[11] (not a complete list, see [10] for more).
The two store classes that were implemented (and always-present) were an in-memory store (storage name "memory") and a store using hashes of key:value ("hashes"), another class provided by Redland internally (librdf_hash). The Redland librdf_hash class provides key:value maps with duplicate values allowed for a key (aside: GDBM does not support this, which is why it was dropped, see section 4.2). The duplicate values allows duplicate triples to be stored, but this is not the primary reason for allowing them. They are needed to provide redland contexts, described later in section 5.
The current hash implementations are in-memory one based on a custom hash implementation and a persistent one using Sleepycat/Berkeley DB[2] (BDB) or any other. This means that an in-memory store, an in-memory indexed store or a persistent indexed store are all possible.
The in-memory store was designed to be simple to understand so that its operation was trivially verifiable and formed the basis for the storage unit tests. The indexed store was then developed, providing a more sophisticated storage solution.
Table 1 shows the current use of hashes, evolved from the indexes suggested after personal communications with Dan Brickley, informed by earlier work such as rdfDB[5] and earlier projects at the ILRT. The key piece of information used was that the disk access is so much slower than the transfer time or processor time, that it is more expensive and complex to have multiple hashes, identify the key and values uniquely and assign short IDs, than to just store all of the triples in each hash, the approach taken here. It is not clear if this balance still applies several years after the original work.
When a query is done using the triples matching style, Redland picks the most appropriate hash if it is available. When it is not, Redland lists all the (key,value) pairs of one of the hashes that stores all S,P,O and uses the resulting set of triples to match against. In this way, Redland can perform queries when the hash is not available with no user interaction. This store model can thus work with 1 hash only, since the S,P,O are stored in all of the hashes (1-4).
Hash | Key | Value | Use |
---|---|---|---|
1 - SP2O | Subject and Predicate | Object | Optimises node-centric query: get targets of (source, arc) |
2 - PO2S | Predicate and Object | Subject | Optimises node-centric query: get sources of (arc, target) |
3 - SO2P | Subject and Object | Predicate | Optimises node-centric query: get arcs of (source, target) |
4 - P2SO (optional) |
Predicate | Subject and Object | Finding nodes using a predicate - useful for structural predicates such as rdf:type or important ones for some application such as foaf:knows in FOAF. |
5 - contexts (optional) |
- | - | Used for contexts, see section 5. |
The hashes are used both for the statement queries and the node centric ones. The former are provided by serialising the hash and filtering via the querying statement. This can be very slow for large models so the node-centric indexes are used when only one of the elements of the statement is blank. Node-centric queries mean querying using the Model relative to a particular resource node or arc. The SP2O hash finds outgoing nodes from a resource with a given arc, the PO2S hash finds incoming nodes with a given arc and destination and the SO2P hash finds the arcs between two given nodes. These combinations of indexes have been found to be quite useful in experiments and testbeds implemented previously at ILRT, without the need to have full combination of indexes.
Table 2 shows an alternate set of hashes, which more reflect the use of the triple APIs. It is more often the case that nodes and properties are followed "forward", i.e. from subject and predicate to object, rather than backwards. The SP2O hash would indeed be the best candidate for a 1-hash store solution. As with any choice, this is very application specific, and the P2SO hash might be critical in performance to some work.
Hash | Key | Value | Use |
---|---|---|---|
1 - SP2O | Subject and Predicate | Object | Most common use of triples, asking for the object from a given subject and predicate |
2 - S2P | Subject | Predicate | Next most common use; from a given subject, find its predicates |
3 - contexts (optional) |
- | - | Still optionally needed for contexts, see section 5. |
Redland uses a dynamic hash configuration so that different stores could set up the hashes they want in any combination. This has not been exposed to the user api at this point due to the complexity. It might be better instead to offer a set of hash indexing choices from small to large or optimised for certain uses.
Reviewing what has already been mentioned and outlining the issues with the current in-memory stores, some further extrapalations can be made on the possible aproaches for improving the store.
This is implemented as a doubly linked list which is very simple to develop, understand and trivial to verify correctness. This approach allowed test cases to be built.
Very fast with small models due to simplicity but O(n) in all searches due to no indexing and inappropriate for graphs with moderate number sof triples.
Using in-memory hash types, this provides an indexed and fast in-memory store.
BDB has a feature that if no filename is provided for the persistent store, it will create an in-memory one instead. This is possible but has not been tested.
The goal of portability for Redland led to the on-disk store being developed that used either entirely standalone resources or widely available and stable databases. The first store based on hashes for the storing and indexing let the system depend on earlier work on persistent btree-based storage such as BDB.
BDB is widely available on many systems (Unix, Linux, POSIX, others) and has a long history, being developed in the early 1990s. There are four major versions of BDB seen in deployed systems each with different and slightly incompatible APIs; some of the minor versions also have APi changes. This meant a lot of torturous configuration and options to support but Redland currently can use BDB from version 2.4 to 4.1 (or later).
The persistent store using on-disk BDB hashes originally also had a GNU DBM (GDBM) implementation, but that persistent store proved inadequate - it both did not support duplicates and was not being regularly maintained.
Nothing is currently planned to extend this at present. The MySQL schema might be optimised to take advantage of other schema ideas from the workshop. It might be worth creating a BDB store that matched the binary format used by others such as rdflib.
For SWAD Europe we have already written in [7] and [8] in detail about other storage systems, APIs and looked at alternate ways to represent the triples.
A facility called Redland Contexts[4] were added to Redland in version 0.9.12 for supporting managing using multiple RDF graphs and merging graphs. The use case was to allow triples to be asserted from multiple sources (documents) and then during use of the union graph, for the information returned by the Redland model calls to return some link to the original source.
This feature was implemented by allowing,when triples are being added to the graph, for an additional librdf_node to be passed in as another parameter. This is the context node and can be retrieved as the result of any redland triple query, by using a method from the librdf_iterator or librdf_stream objects.
This was as a major update to the Redland internals for the librdf_iterator and librdf_stream classes and requried changes to all the implemented store SPIs and implementations.
This feature can be used for a variety of things depending on how the context nodes are used with the triples. The following is not an exhaustive list:
Contexts used this way do have a downside in that they cannot be transfered using the RDF/XML syntax outside Redland to other systems without extending the syntax. They are also probably not the same as N3 Formulae which amongst other things, quotes the triples in the contexts as not asserted in the containing graph. Redland context node information should also be available by any network API.
The current applications of Redland have been up to 3.5M triples using the BerkeleyDB stores on typical desktop PC hardware[1], with acceptable interactive response, but this is not considered a large triplestore. There are no standard data sets that are typically used across RDF triple stores (possibly the NCI cancer ontology[12], but that is rather specialised) to compare load and query times.