Our experiences of constructing Semantic Web applications have involved large datasets of up to twenty-five million RDF triples. For our CS AKTiveSpace application [9], we used the hyphen.info dataset, a constantly-updated knowledge base describing computer science research in the UK. This was constructed by members of the AKT project as a testbed for Semantic Web research, and to use as a foundation on which to build Semantic Web applications and investigate tools. This dataset is expressed in RDF form using the AKT Reference Ontology [2], which consists of around 200 classes and 150 properties. We expect other applications under development to include separate ontologies and sub-ontologies of a similar size.
We determined the base scale requirements for an RDF store from our experiences and made the decision to construct a system which would be able to handle at least 20 million triples and 5000 classes and properties. Our specific requirements relate to the performance of such a system for query processing, for its ancillary capabilities and for its inferential capabilities..
Many of the applications under development in AKT require interactive-level performance while evaluating queries containing significant numbers of constraints. For example, when formulated in RDQL, the CS AKTiveSpace user interface uses queries with between four and twelve triple patterns in the WHERE clause, returning a few hundred result rows. These applications are commonly based around a Web browser interface; the expectation of most users is that such an interface should be no less responsive than simple Web browsing, so response time for the queries used must be kept to the order of a few milliseconds on available hardware, in order to maintain the responsiveness of the interfaces.
An orthogonal concern is that of the time taken to assert new knowledge. The knowledge sources that the AKT project uses as a testbed for its research applications are gathered on a variety of schedules ranging from daily to monthly. Maintaining the integrity of the data while it is being reasserted is an important concern, and for this reason the time during which the knowledge base is potentially inconsistent or incomplete should be kept to a minimum. The store must also be able to import and replace RDF data sufficiently quickly to be able to refresh its contents on a daily basis, in order to deal with rapidly changing knowledge bases. Given the minimum gathering schedule, it should be possible to assert the entire AKT RDF knowledge base in a few hours and replace significant portions of it in a reasonable time for an overnight batch process.
Finally, if an RDF store is to be more than just a database for storing triples, it should be able to perform some inference over the data asserted within it. Following our requirements for scale and performance, and based on the expected inference requirements for our applications, we required only the entailments described in the RDF semantics [6], rather than those of any more expressive language such as DAML+OIL or OWL [8].
3store is implemented in C as an RDF abstraction layer on top of an RDBMS using a schema which has been designed for the efficient storage and retrieval of triples, and uses the Raptor toolkit [3] for parsing RDF/XML syntax. Our implementation records the origin of the triples it contains at the granularity of (RDF/XML) files. Such simple context provides the basis of a mechanism for managing provenance. In our experience of building SW applications, we have found file-level provenance to be sufficient, especially compared to the cost of triple-level provenance.
The inferential capabilities of 3store are implemented as a hybrid of forward- and backward-chaining production rules, in which we have tried to find a compromise between eager entailment evaluation (which leads to efficient querying) and lazy entailment evaluation (which reduces the size of the stored data).
The previous (prototype) version of our 3store software supported an RDF-based dialect of the Open Knowledge Base Connectivity (OKBC)[5] API which used HTTP as its transport layer. This was intended as a lightweight interface by which RDF-aware clients could invoke the knowledge base through a set of web services which provided specific competences relating to the manipulation of a knowledge base in a frame-based manner. Typical examples of such services are get-class-subclasses (return the subclasses of a class), slot-has-value (return the value of a property on an object) or get-frame-sentences (return the assertions involving a given object). The OKBC-HTTP interface was used in a number of our existing applications, so the maintenance of this interface was a requirement to ensure backwards compatibility with previous versions, as well as an opportunity to reimplement it in a more efficient manner.
In addition to this OKBC API, we felt it appropriate to implement a more natural and versatile RDF query interface, based on the RDQL query language [7]. Our existing applications made heavy use of the stored procedure capability of the OKBC API, indicating that the simple API calls were not expressive enough to support the development of sophisticated applications. This RDQL interface provides an HTTP interface that returns the results in an XML format, and a database-style C API that queries the knowledge base directly and which could be used to provide bindings in other languages.
Our development of 3store has identified a expanded set of requirements which present a number of avenues for future development: