Triple Store

Jack Rusher, Radar Networks, jack@rusher.com

Introduction

A triple store is designed to store and retrieve identities that are constructed from triplex collections of strings (sequences of letters). These triplex collections represent a subject-predicate-object relationship that more or less corresponds to the definition put forth by the RDF standard.

The problem space of storing this sort of data has been explored by the graph database, object database, PROLOG language and, more recently, semantic web communities. A more thorough backgrounder is provided by the work on Datalog, Jena, Dave Beckett’s Redland, the AT&T Research Communities of Interest project, Ora Lassila’s Wilbur, and the activities of the W3C Semantic Web project, among others.

Set Theory

It turns out that this logical basis for storing and querying relationships is nearly ideally suited to the human-friendly indexing of textual data (this web site, for instance). The basic properties of such a system are set theoretical, following the rules prescribed here:

Let T = { t1, t2, t3, ... } (the set of all triples)

Let t = ƒ( s, p, o ) such that:

Performance

The performance of most similar systems is bounded by the speed of an external relational database (with the attendant overhead of IPC and double buffering betwixt the two processes) or a link-library (usually Berkeley DB) that is tuned for good general case performance, but which is not tuned for this workload. My approach has been the creation of a from-scratch database with an on-disk layout that’s designed specifically for this purpose.

The hard part, as is always the case in computer science, is less the creation of a conformant system than the creation of a conformant system that is also performant. I have thus far achieved this order of complexity for basic searches:

Let N = the number of triples.

Let M = the number of triples matched.

However, this sort of classical analysis isn’t the whole story. The underlying structures include B+Trees that attempt to pack as much data as possible onto each B-Page in order to minimize the number of disk seeks required to collect the requested data and maximize the efficiency of the page cache maintained by the database.

All text nodes are stored in a separate data structure from the graph itself. Graph nodes store long integers that represent the indices of string values in an on-disk table. This is more efficient in time and space and allows a number of optimizations based on persistent tries to reduce the size of the string table.

A random triple can be retrieved, on a cold cache, in around sixty milliseconds. Organic workloads perform very well because the system attempts to place data together on disk so that, for instance, the string table for OWL Lite can be stored in one data page that is kept in cache by frequent accesses.

Disk Usage

The current implementation takes the 488K of XML source for this website and transforms it into a 750K database. These numbers are slightly misleading because the database does not grow linearly with the growth of the source data (a database with many unique string literals will grow more quickly than one without).

Example Pseudo-Code

The following code fragment would query the triple store for a list of triples concerning the things that man is known to be (e.g. mortal) and print them:


    TripleStore ts = TripleStore.getInstance();
    TripleSet tset = ts.find( "man", "is", null );
    Iterator i = ts.iterator();
    while( i.hasNext() ) {
        Triple t = (Triple)i.next();
        System.out.println( t.toString() );
    }
    

A complex solution to a simple problem. The source code will be available within the next few months under the LGPL license. The included toolkit will feature a servlet wrapper to ease integration of the triple store with custom semantic web applications.


Original version at rhetoricaldevice.com