Jena position paper

SWAD-Europe workshop on Semantic Web storage and retrieval

Authors:

Kevin Wilkinson, HP Laboratories, Palo Alto, USA
Dave Reynolds, HP Laboratories, Bristol, UK
Andy Seaborne, HP Laboratories, Bristol, UK

Summary

Jena2 is a second generation RDF toolkit. It uses a simple abstraction of the RDF graph as its central internal interface. This is used uniformly for in-memory triple stores, database backed RDF data, and inference engines. Two rich interrelated presentation APIs are layered in top of this graph abstraction: the Jena Model API provides convenient access to the underlying RDF; the Ontology API upgrades the earlier DAML support (which is still available to support porting of legacy code).

Both Jena1 and Jena2 provide RDF persistence by mapping triples to a relational DBMS. Any number of relational engines are supported. However, Jena2 uses a different database layout than Jena1. The Jena2 implementation replaces the normalized schema used in Jena1 by a denormalized schema which yields a significant performance improvement at the cost of storage space. The additional storage space required depends on the degree of denormalization (set by configuration options) and the RDF being stored. Initial experiments indicate that Jena2 can come within a factor of two of the Jena1 space usage.

To augment the denormalized schema, in the future Jena2 will support custom layouts for particular properties or classes to exploit regularities in access patterns. In particular, property tables and property-class tables (see below) should offer high performance when properties are retrieved in groups. Currently, Jena2 uses a property-class table to store reification quads. This provides a performance speed-up of 4x over storing the quads in a triple table.

Jena2, like Jena1, supports the RDQL language. Unlike Jena1 the matching of sets of triple patterns can be delegated en masse to the graph layer so that specific storage implementations can optimize implementations. The database-backed stores can compile such patterns directly in to the appropriate SQL queries so that the RDBMS performs the required joins.

Jena2 supports inference graphs to support RDF entailment. An inference graph can be configured to access a range of inference engines and Jena2 ships with both a forward-chaining engine and a backward-chaining engine. The inference graph can work over persistent graphs but currently, inferences are performed in memory. In the future, optimizations to push down some inference operations to the database will be investigated.

Database storage

Storage schema

In Jena1 database-backed models were stored using a normalized schema. Resources, literals and namespaces were each stored in separate tables. The main triple table referenced the resources and literals through database identifiers (either hashes or sequence numbers depending on application preference). This normalized schema offers good space efficiency but in order to retrieve a single triple three joins must be performed. Some caching of resources alleviated this somewhat but it still performed poorly compared to the Berkeley DB implementation which avoided such joins.

The Jena2 schema trades-off space for time. It uses a denormalized schema in which resource URIs and simple literal values are stored directly in the statement table. Separate resource and literal tables are only used to store long literals or URIs whose length exceeds a threshold. This threshold is a configuration parameter which enables applications to vary the degree of denormalization.

Statement table:

subject	predicate	object
mylib:doc1	dc:title	Jena2
mylib:doc1	dc:creator	HP Labs Bristol
mylib:doc1	dc:description	101
201	dc:title	Jena2 persistence

Literal and resource tables:

ID	Value
101	The description - a very long literal that might be stored as a blob

ID	Value
201	hp:aResourceWithAnExtremelyLongURI

This approach avoids joins in retrieving most statements but at the cost of some storage space overhead. This space overhead is addressed in several ways. First, common prefix URIs such as namespaces are compressed. They are stored in a separate table and the prefix replaced by a database reference. It expected that the number of prefixes will be small enough to be cacheable in memory thus avoid database joins. Second, long values are only stored once. Third, Jena2 will support property tables (see below) which offer a modest reduction in space consumption.

Both Jena1 and Jena2 permit multiple graphs to be stored in a single database instance. In Jena1 all graphs were stored in a single statement table. However, Jena2 supports the use of multiple statement tables in a single database so that applications can flexibly map graphs to different tables. In this way, graphs that are often accessed together may be stored together (using a separate graph identifier column not shown) while graphs that are never accessed together may be stored separately. This capability is used internally to store system metadata as RDF statements in its own table.

Like Jena1, Jena2 supports transactions using the underlying RDBMS transaction machinery. Use of a transaction wrapper can substantially improve performance for loading of large sets of statements.

Optimizations

An RDF graph will typically have a number of common statement patterns. For example, classes which follow some frame structure with a core set of properties always being stored for each class instance. One particular example of that within RDF itself is the representation of reified statements by instances of rdf:Statement.Jena2 supports optimization of such important cases through the use of property tables. A property table stores all instances of a given property in the graph.

A special type of property table is the property-class table. It is a property table that stores properties of a class as well as a boolean column indicating if the subject has been explicitly asserted as a class member. Currently, Jena2 implements reification using a property-class table with three columns for rdf:subject, rdf:predicate and rdf:object. This saves space compared to the explicit storage of the four reification triples but easily supports partially reified statements.

Support for arbitrary property and property-class tables is future work. They significantly complicate query processing in two ways. First, the properties for a subject may be stored across multiple tables and the optimizer needs techniques to determine which tables to scan. Second, property tables store statements in a compressed form so it is difficult to directly join statements in a property table with the denormalized statement table. But, it is worth noting that once generalized property tables are supported, it is a relatively small step to extend them to access arbitrary legacy relational database tables directly from Jena2. A technique such as that proposed in D2R could be used.

Performance notes

To demonstrate the storage overheads of the denormalized schema the following table compares space and load times for 22,000 RDF statements in Jena1 and Jena2 with different configurations.

Configuration	Space	Time	Comments
Jena1	4 MB	71 s	normalized schema
Jena2 (time-opt)	16 MB	30 s	no URI compression, 256 = short threshold
Jena2 (space-opt)	8 MB	55 s	URI compression, 10= short threshold
Jena2 (time-opt)	9 MB	24 s	URI compression, 256 = short threshold

We can see that the Jena2 is at best double the storage cost of Jena1 but that URI prefix compression has a big impact on the space/time tradeoff.

To demonstrate the overhead of normalization on simple retrieval the following table compares the time to retrieve 1000 statements when the object value is stored in the statement table (denormalized case) and when the object value is stored in the literals table (normalized case).

Retrieve 1000 stmts	Normalized	Denormalized	Speed-up
First run	3270 ms	2850 ms	1.2
Last run	870 ms	420 ms	2.0

The initially slow run is due to hardware caching effects and once the caches are warm we see the denormalized case is twice as fast, just as expected (since only one table select operation is needed as opposed to two for the normalized case).

To illustrate the impact of property tables the next result table compares the effect of storing reification quads using property-class tables compared to storing them in the normal statement table.

Reified stmts retrieved	Property-class	Stmt table	Speedup
200 (first run)	1000 ms	1860 ms	1.8
200 (last run)	270 ms	1470 ms	5.4
1000 (first run)	1330 ms	7380 ms	5.5
1000 (last run)	700 ms	6970 ms	10.0
5000 (first run)	4220 ms	34380 ms	8.1
5000 (last run)	3470 ms	34270 ms	9.9

Interestingly the speed-up is more that the 4x that one might expect. Perhaps this is due to the effect of database caching since the property-class tables improve locality of reference for the lookups and also make better use of the cache since the property URIs are not explicitly stored.

Query processing

Jena2 separates the syntactic and API issues of the query language from query processing. Further, inference, whether for OWL, RDFS or for application rule sets, is part of the graph abstraction and not part of the query language or execution.

RDQL provides information access by finding solutions to a query expression returning results as sets of variable bindings. This is a common and convenient paradigm for an application to interact with RDF and there is also an RDF vocabulary for expressing result sets.

However, this is not composable because the result of query is not another RDF graph extracted from the original knowledge base. No RDF statements from the original KB are part of the result set, only the graph labels. In Joseki, the result of any query is a subgraph of the original graph. This enables a client to extract part of a large KB, then process it locally with an RDF API of its choice. The alternative of remotely providing the same RDF API as is available locally would create a large number of fine grain network operations.

For any RDQL query on KB, there is a unique smallest subgraph that yields the same answers for the query as executing the query on the large KB. Optimal calculation of this subgraph is not necessarily by executing a query then taking the variable binding to calculate the subgraph because one small part of the graph may contribute to many query solutions, especially if the query consists of disjoint graph pattern components. There is a trade-off to be made between exploiting mature relational database technology against explicit calculation of the subgraph.

Current Work

Current work on Jena2 is focused on combining the fast-path query architecture, that allows delegation of all or large parts of a query down to storage subsystems, with constraint expressions so that it is possible to delegate constraints to the storage if possible, or to split query processing into stages that the storage can handle, with constraints executed at a suitable point in the query pipeline, not just at the end.

Fast-path is complicated by the introduction of property tables. A basic fast-path query processing algorithm is implemented but it does not yet optimize some common and important cases. The implementation of constraints is complicated for two reasons. First, the use of datatypes enables new types of optimizations and operators, such as range comparisons (see below). Second, the different database engines supported by Jena2 have varying degrees of support for constraint evaluation. Constraint expressions efficiently processed on one database system may not be supported on another and so take much longer.

RDF Datatypes

The Jena2 API provides support for RDF typed literals. A registry enables mapping of arbitrary datatype URIs to serialization/deserialization code. Registered implementations for all the the relevant XML Schema datatypes is provided using internal Xerces library routines. The design separates equality of the java datatypes from semantic equality at the RDF layer.

Jena2 supports datatypes on literals by storing the URI for the datatype along with the string representation of the literal value. This enables equality matching but does not leverage database support for known datatypes such as numerics. In the future, Jena2 will investigate storing datatyped literals using the native database types.

RDF Entailment

Jena2 supports the notion of inference graphs. An inference graph implements the standard graph interface and so can be manifested through any of the user level access APIs (RDF API, Ont API, legacy DAML API). It also supports additional interfaces for consistency checking, access to deduction traces and query with posits. In turn the inference graphs can be layered on top of any graph implementation - including plain in-memory graphs and database backed graphs.

The inference graph API can be used to access a range of different inference engines. A registry mechanism is used to allow run time addition and querying of available inference engines. For example, there has been some early experimental work on connecting to the Racer description logic engine.

Jena2 ships with a pair of generic rule engines - a forward chaining engine based on the standard RETE algorithm and backward chaining engine which uses SLD style backtracking with with tabling. These engines can be connected in a hybrid mode which allows the forward rules to assert additional backward chaining rules which are then used to answer queries. Jena2 includes built-in rule sets which enable these rule engines to provide RDFS and a useful subset of OWL inferences.

The current rule engine implementations will work over database stores, and have been used that way, but there is clearly scope for further optimization. The rule languages are essentially syntactically-sugared datalog and so could be passed directly onto a database storage layer so long as that layer can implement recursive queries. However, at the time of writing, there has been no specific investigation into direct deductive database implementations and all current inferences are performed in memory.