RDF Storage and Retrieval with rdflib

Daniel Krech,
Mindswap, University of Maryland College Park,
eikeon@eikeon.com

rdflib is a Python library for working with RDF. The library attempts to follow the terminology and concepts found in the RDF Semantics document as closely as possible (with a few notable exceptions such as the use of "TripleStore" instead of "Model"). Among other functionality, the library implements a TripleStore interface tailored to the Python language; for example, the methods for retrieving information from the TripleStore are implemented using Python generators.

There are several TripleStore backends, ranging from an in-memory to a persistent backend that uses Sleepycat BTrees. There are also a number of parsers and serializers -- RDF/XML, NTriples, and a few others in the works -- for getting RDF into and out of the TripleStore.

The initial implementation focus for rdflib was on the TripleStore interface, and the initial TripleStore backend used nested, in-memory dictionaries for the indices. (For example, spo, pos, osp...) Because of this initial focus, rdflib is partially optimized for small tasks where an in-memory storage is sufficient.

After the initial implementation experience, and as applications started to outgrow the simple in-memory TripleStore, a persistent backend was added using nested ZODB BTrees and then a persistent backend using Sleepycat Btrees without nesting.

Some applications, which might otherwise have used a TripleStore fruitfully, require additional interface support for aggregation and provenance. A simple use case for this additional data is a web spider that needs to be able to update data that it has already spidered. As a result of these applications, rdflib has a TripleStore variant which supports the idea of assertional contexts -- what rdflib calls an InformationStore.

The pluggable TripleStore backend capability of rdflib -- in addition to some of Python's charms -- has made it an ideal platform for experimentation, particularly with regard to different backend strategies and implementations. For example, rdflib has an in-memory backed that uses KDTrees. Other experiments have been done, using ideas from Donald Knuth's Sorting and Searching book[1], the Reiser4 white paper[2], and Dan Gusfield's Algorithms on Strings, Trees and Sequences book[3].

References

[1] Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition) by Donald E. Knuth http://www.amazon.com/exec/obidos/ASIN/0201896850/102-8454493-8075306

[2] Reiser4 http://www.namesys.com/v4/v4.html

[3] Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield http://www.amazon.com/exec/obidos/tg/detail/-/0521585198/102-8454493-8075306

Appendix: Storage models and database schemas

Persistent backend using nested Zope BTrees.

cspo
cpos
spo
pos
c

Persistent backend using Sleepycat BTrees (Flat, not nested).

i2k
k2i
c
cspo
cpos
cosp
spo
pos
osp

RDF Storage and Retrieval with rdflib

Daniel Krech, Mindswap, University of Maryland College Park, eikeon@eikeon.com

References

Appendix: Storage models and database schemas

Daniel Krech,
Mindswap, University of Maryland College Park,
eikeon@eikeon.com