ReMark - A Strawman Proposal for an RDF Benchmark Suite

Kevin Wilkinson
HP Labs - Palo Alto

Introduction

The past few years have seen the development of many viable first and second-generation RDF storage systems. Although they all provide the same basic functionality of RDF storage and retrieval, the systems are tuned and optimized for different application requirements. With the recent interest in developing practical and scalable Semantic Web applications, researchers and developers have the challenge of choosing an RDF storage system that best meets their application needs. Many have observed the need for tools and benchmarks that could help users make these decisions. Indeed, discussion at the recent workshop on Persistent and Scalable Semantic Systems at the 2003 International Semantic Web Conference (ISWC) motivated this strawman proposal.

This proposal argues for the creation of a small working group (3-5 persons) to define and implement an initial RDF benchmark. Such a benchmark would provide a comparative evaluation of different RDF stores. It could also be used to compare different versions of the same store, e.g., to evaluate the impact of a tuning knob or a new feature. The purpose is not to identify a best system. Rather, it is to show the range of performance possible for various operations under different configurations and to illustrate what performance trade-offs are possible. And, to the extent that it identifies deficiencies in RDF stores that can be corrected, it will indirectly improve the overall performance of those systems. 

For discussion purposes, this proposal includes a strawman for what an RDF benchmark might look like. The next three sections describe the benchmarking dataset, the benchmarking operations and benchmarking metrics. A final section addresses miscellaneous topics such as packaging. In the remainder of this section, we rationalize the overall approach taken in this strawman proposal.

We feel that in the area of comparative benchmarking, the Semantic Web community would do well to follow the example of the database community. In the early 1980's, database users were faced with many implementations of SQL and had few objective ways to compare them. The Wisconsin database benchmark was a breakthrough success for both users and developers. We feel it was successful for two reasons: it was simple and it was well packaged. By being simple, it was easy to understand, easy to implement and easy to interpret the results. It is tempting to design a realistic benchmark in the hope of modeling a real-world application. But, too often this results in a benchmark where it is hard to interpret the results. The Wisconsin benchmark did not model a real-world application. It used synthetic data and measured the performance of small database operations over that data.

The Wisconsin benchmark was well packaged and documented. This made it easy to download and run or to port and modify. It also simplified the process of verifying the correctness of the benchmark (and made it hard to cheat). In summary, it was an intuitive benchmark that did not take a large amount of effort to use. It served the purpose of bootstrapping the benchmarking process. It was subsequently followed by the formation of the Transaction Processing Council and their industry-standard benchmarks that modeled application domains such as transaction processing (TPC-A, TPC-C) and decision-support (TPC-D). These application-specific benchmarks are more realistic but are significantly more complicated. They took much longer to define and require companies to spend significant amounts of resources just to run.

The lesson is to start small and get a baseline for comparison. Then build on that initial effort with subsequent benchmarks that address specific domains. For this reason, we feel the initial focus should be on RDF storage and retrieval. Although performance of inferencing is of high interest, we feel that it is too early to focus a lot of effort on evaluating its performance. We propose an initial benchmark designed around simple RDF graph operations (add, remove) and relatively simple queries. The dataset would be generated synthetically and would be scalable to enable evaluation across a wide range of problem sizes. It is desirable to also include some simple inferencing queries if possible. For want of a name, we propose the term ReMark, for Resource Benchmark. The initial benchmark suite could be RDF ReMark, perhaps followed by Lite ReMark, DL ReMark, or application-benchmarks such as Lehigh ReMark (based on the Lehigh benchmark introduced at 2003 ISWC).

RDF ReMark Data Set

The RDF ReMark data set should be a scalable, synthetic data set. Synthetic means that the data is programmatically created and does not have realistic values, e.g., generating random strings for names. Scalable means there is a range of data set sizes. Of course, each generated data set has identical characteristics, e.g., same relative numbers of class instances, identical property value distributions, etc. Scalable data sets are convenient for development. The smaller sizes can be loaded quickly for development and testing. The larger sizes enable estimates of system scalability.

The ontology for the specified data sets should describe a few classes. We propose an ontology of documents (technical papers?) based on Dublin Core. There would be three classes (perhaps a few more if needed). The primary class would be Document comprising a number of Dublin Core properties. A second class would be Author, referenced by the creator property of Document. This class would provide identifying information for authors, using a blank node for compound name and/or address. A third class would be Subject, referenced by the subject property of Document. The Subject class instances would have a self-referencing property, IS-A, that relates subjects in a topic hierarchy. This will enable ancestor-type queries. The Document class instances may include a bibliography property that references another Document instance. In this way, queries that follow chains of document references and links among authors may be created. For scalability, there would be a fixed ratio among the number of Document, Author and Subject class instances.

The properties of Document, Author, Subject may be either literal values or references (to be determined). For each class property, the number of instances of that property in a class instance would be specified by a property cardinality distribution. This enables optional, single-valued and multi-valued properties. For literal property values, the distribution of the property values would be specified by a property value distribution. Varying this distribution enables different selectivities for properties, e.g., the values could be unique, could be one of n values, etc.  For referential property values, the referenced class must be specified along with the class cardinality and the distribution of references, e.g., uniform, Zipfian, etc.

For simplicity, the synthetic data would not include reified statements, containers or collections and nothing that would enable interesting inferencing. However, if desired, the ontology could include some simple sub-classes, e.g., sub-types of documents or authors to support queries that do simple inferences. In terms of implementation, the RDF storage system is free to choose whichever storage method it wishes to use. However, the reporting section will require full disclosure of the methods used and the storage overhead. As an aside, we should make as few requirements as possible on the RDF storage system, e.g., use of an underlying relational database is not required nor is persistence for that matter.

The Jena team have developed a tool for synthetic data generation that supports many of the capabilities described here. This tool could be made freely available. Distributing the tool rather than the data seems desirable so that developers could modify or extend the data sets as needed. For example, some systems may have limited support for long object values or may not support datatypes or language tags. Additionally, users may wish to augment the generated data with new classes that demonstrate additional features of their system. Of course, any changes to the data would need to be documented in the results.

ReMark Operations

The specific benchmark operations remain to be determined. This section outlines what is needed. The set of benchmark operations would include many simple queries and updates along with a few more complicated query and update operations. The simple queries would retrieve a property value given the subject or vice versa. The number of values (or subjects) returned depends on the property queried (since different properties have different selectivities). The simple updates would add a single triple to the graph or remove a triple from the graph.

More complicated queries would retrieve all property values for a class instance or would follow chains of property references, e.g., authors of papers referenced by a class instance. Complicated update operations might include adding a complete Document instance, removing a part of the subject hierarchy, etc.

Lacking a common API for all RDF storage systems, an open question is the encoding used for the benchmark operations and the expected results. The obvious choice would be to encode the operations and results as RDF graphs. However, for convenience, it may also be desirable to include an RDQL-like specification for the queries, and an RQL one and an ...

One complication in benchmarking simple operations is that the response time may be undetectable if the timing granularity is small. For example, within Java programs, the technique for measuring response time is getTimeMillis() which has a granularity of milliseconds. Operations that are on the order of a few milliseconds cannot be reliably measured by this technique. Consequently, for accurate measurement, it may be necessary to aggregate simple operations into groups of identical simple operations. For example, instead of measuring the time to get a single property value, measure the time to get 50 values of that property for randomly chosen subjects.

ReMark Metrics and Reporting

Users of the benchmark cannot be required to publish their results. However, if results are reported, they should conform to a prescribed template. The results should include the hardware and software environment under which the results were obtained, e.g., CPU type/clock, bus clock, disk speed and size, number of CPUs, disks, main memory capacity, OS, DBMS (if any), relevant middleware, etc. The results should completely describe all configuration options used for the RDF storage system, including the storage scheme and indexes, if any. In terms of metrics, the initial benchmark should focus on memory usage and response time, Throughput measures are too difficult for now. The report should include the scaling factor for the database. Scale-up results that show relative results for multiple scaling factors are optional but encouraged.

Specific metrics of interest include the memory footprint for the RDF storage system (both main memory and on-disk), the total triples created, the total storage required for the triples (including space for indexes), the number of bytes per triple (ratio of two previous numbers) and the total load time for the data set. For each operation, the response time should be reported. It remains to be determined if this should be an average of several executions (warm) or a first-time (cold) measurement. It is desirable to be able to normalize the response time measurements to simplify comparison across disparate systems. For example, some TPC benchmarks are reported in dollars per transactions. An open issue is how to do this accurately and easily for ReMarks (it may not be possible, which is fine).

Finally, an open issue is whether the response time should be measured from the command line or from the API, i.e., within a program. The former seems more portable. The latter seems more accurate in terms of measuring just the operation of interest (no process creation overhead, etc.).

An ontology will be provided for the reporting sections and it is expected that the benchmark results would be published in a document using mark-ups from this ontology. Additionally, the measurements of the benchmark operations could be published as an RDF graph.

Miscellaneous

Due to the variety of platforms and languages used for RDF stores, it will be difficult to offer a complete implementation of the benchmark that can be downloaded and executed. The portable parts are the synthetic data generator, the ontologies, the reporting templates. However, the benchmark operations and test harness would be specific to the storage engine. However, it should be possible to develop generic test harnesses for specific programming languages, e.g., one for Java, one for C, etc. These harnesses could be shared among users.

Conclusions

The proposal argues for the formation of a small working group to develop a benchmark suite for RDF storage systems. It would measure the response time of relatively simple RDF operations over a synthetically generated data set. The motivation is to keep the benchmark simple to facilitate rapid design, implementation and deployment. This initial benchmarking effort would serve as a learning experience that would lead to more complicated application-specific benchmarks in the future.