RDF Store Benchmarking
This page collects references to RDF benchmarks, benchmarking results and papers about RDF benchmarking.
At the end of the page, we collect use cases for RDF benchmarking and offer ideas for future discussion on benchmarking triple stores.
- Berlin SPARQL Benchmark (BSBM), provides for comparing the performance of RDF and Named Graph stores as well as RDF-mapped relational databases and other systems that expose SPARQL endpoints. Designed along an e-commerce use case. SPARQL and SQL version available.
- Lehigh University Benchmark (LUBM) is developed to facilitate the evaluation of Semantic Web repositories in a standard and systematic way. The benchmark is intended to evaluate the performance of those repositories with respect to extensional queries over a large data set that commits to a single realistic ontology.
- Ontology Benchmark (UOBM) extends the LUBM benchmark in terms of inference and scalability testing. UOBM ontology and data set.
- The SP2Bench SPARQL Performance Benchmark, provides a scalable RDF data generator and a set of benchmark queries, designed to test typical SPARQL operator constellations and RDF data access patterns.
- Social Network Intelligence Benchmark (SIB) A benchmark suite developed by people at CWI and Openlink taking the schema from Social Networks for generating test areas where RDF/SPARQL can truly excel, and challenging query processing over highly connected graph.
- DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data.
- FedBench - Benchmark for measuring the performance of federated SPARQL query processing (ISWC2011 paper about FedBench).
- Linked Data Integration Benchmark (LODIB) is a benchmark for comparing the expressivity as well as the runtime performance of Linked Data translation/integration systems.
- JustBench analyses the performance of OWL reasoners based on justifications for entailments (project website).
- The ISLab Instance Matching Benchmark, provides for benchmarking instance matching and itentity resolution tools.
- THALIA Testbed for testing the expressiveness of relational-to-RDF mapping languages.
- A Benchmark for Spatial Semantic Web Systems, by Dave Kolas, extends the LUBM with sample spatial data.
- Linked Open Data Quality Assessment (LODQA) is a benchmark for comparing data quality assessment and data fusion systems.
- LinkBench - A Database Benchmark Based on the Facebook Social Graph (SIGMOD, 2013 Paper)
- Waterloo SPARQL Diversity Test Suite (WatDiv) is a highly customizable stress testing tool for RDF data management systems (ISWC, 2014 Paper).
Results provided by store implementers themselves:
- Virtuoso BSBM benchmark results (native RDF store versus mapped relational database)
- Jena TDB BSBM benchmark results (native RDF store)
- OWLIM Benchmark results (LUBM, BSBM and Linked Data loading/inference)
- SemWeb .NET library BSBM benchmark results
- Virtuoso LUBM benchmark results
- AllegroGraph 2.0 Benchmark for LUBM-50-0
- Sesame NativeStore LUBM benchmark results
- RacerPro LUBM benchmark results
- SwiftOWLIM benchmark results for the LUBM and City benchmark (from slide 27 onwards)
- Oracle 11g benchmark results for the LUBM and Uniprot benchmark (from slide 20 onwards)
- Jena SDB/Query performance and SDB/Loading performance
- Bigdata BSBM V3 Reduced Query Mix benchmark results
- Stardog LUBM, BSBM, and SP2B benchmark results
Results provided by third parties:
- Cudre-Mauroux, et al.: NoSQL Databases for RDF: An Empirical Evaluation (November 2013, Uses the BSBM benchmark with workloads from 10 million to 1 billion triples to benchmark several NoSQL databases).
- Peter Boncz, Minh-Duc Pham : Berlin SPARQL Benchmark Results for Virtuoso, Jena TDB, BigData, and BigOWLIM (April 2013, 100 million to 150 billion triples, Explore and Business Intelligence Use Cases).
- Christian Bizer, Andreas Schultz: Berlin SPARQL Benchmark Results for Virtuoso, Jena TDB, 4store, BigData, and BigOWLIM (February 2011, 100 and 200 million triples, Explore and Update Use Cases).
- Christian Bizer, Andreas Schultz: Berlin SPARQL Benchmark Results for Virtuoso, Jena TDB and BigOWLIM (November 2009, 100 and 200 million triples).
- L.Sidirourgos et al.: Column-Store Support for RDF Data Management: not all swans are white. An experimental analysis along two dimensions – triple-store vs. vertically-partitioned and row-store vs. column-store – individually, before analyzing their combined effects. In VLDB 2008.
- Christian Bizer, Andreas Schultz: Berlin SPARQL Benchmark Results. Benchmark along an e-commerce use case comparing Virtuoso, Sesame, Jena TDB, D2R Server and MySQL with datasets ranging from 250,000 to 100,000,000 triples and setting the results into relation to two RDBMS. 2008. (Note: As discussed in Orri Erling's blog, the SQL mix results did not accurately reflect steady-state of all players, and should be taken with a grain of salt. Warm-up steps will change for future runs.)
- Michael Schmidt et al.: SP2Bench: A SPARQL Performance Benchmark. Benchmark based on the DBLP data set comparing current versions of ARQ, Redland, Sesame, SDB, and Virtuoso. TR, 2008 (short version of the TR to appear in ICDE 2009).
- Michael Schmidt et al.: An Experimental Comparison of RDF Data Management Approaches in a SPARQL Benchmark Scenario. Benchmarking Relational Database schemes on top of SP2Bench Suite. In ISWC 2008.
- Atanas Kiryakov: Measurable Targets for Scalable Reasoning
- Baolin Liu and Bo Hu: An Evaluation of RDF Storage Systems for Large Data Applications
- Christian Becker: RDF Store Benchmarks with DBpedia comparing Virtuoso, SDB and Sesame. 2007
- Kurt Rohloff et al.: An Evaluation of Triple-Store Technologies for Large Data Stores. Comparing Sesame, Jena and AllegroGraph. 2007
- Christian Weiske: SPARQL Engines Benchmark Results
- Ryan Lee: Scalability Report on Triple Store Applications comparing Jena, Kowari, 3store, Sesame. 2004
- Martin Svihala, Ivan Jelinek: Benchmarking RDF Production Tools Paper comparing the performance of relational database to RDF mapping tools (METAmorphoses, D2RQ, SquirrelRDF) with native RDF stores (Jena, Sesame)
- Michael Streatfield, Hugh Glaser: Benchmarking RDF Triplestores, 2005
Publications about RDF Benchmarking
- Carlos Rivero, Andreas Schultz, Christian Bizer, David Ruiz: Benchmarking the Performance of Linked Data Translation Systems. 5th Linked Data on the Web Workshop (LDOW2012), April 2012.
- Songyun Duan et al: Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets, SIGMOD 2011.
- Y. Guo, Z. Pan, and J. Heflin: LUBM: A Benchmark for OWL Knowledge Base Systems. Journal of Web Semantics 3(2), 2005, pp158-182
- Christian Bizer, Andreas Schultz: The Berlin SPARQL Benchmark. In: International Journal on Semantic Web & Information Systems, Vol. 5, Issue 2, Pages 1-24, 2009.
- Orri Erling and Ivan Mikhailov: Towards Web-Scale RDF
- Yuanbo Guo et al: A Requirements Driven Framework for Benchmarking Semantic Web Knowledge Base Systems
- Li Ma et al.:Towards a Complete OWL Ontology Benchmark (UOBM)
- Timo Weithöner et al.: What’s Wrong with OWL Benchmarks?
Blog Posts about RDF Benchmarking
- Peter Boncz: Big Data RDF Store Benchmarking Experiences (May 2013)
- Orri Erling: Suggested Extensions to the BSBM (September 2010)
- Orri Erling: Virtuoso Vs. MySQL: Setting the Berlin Record Straight
- Orri Erling: ISWC 2008: The Scalable Knowledge Systems Workshop
- Orri Erling: Virtuoso - Are We Too Clever for Our Own Good?
- Orri Erling: Virtuoso Update, Billion Triples and Outlook
- Orri Erling: A quick look at the SP2B SPARQL Performance Benchmark
- Orri Erling: Configuring Virtuoso for Benchmarking
- Orri Erling: BSBM With Triples and Mapped Relational Data
- Orri Erling: Virtuoso Optimizations for the Berlin SPARQL Benchmark
- Orri Erling: DBpedia Benchmark Revisited
- Orri Erling: What's Wrong With LUBM?
- Orri Erling: LUBM results with Virtuoso 6.0
- Orri Erling: Latest LUBM Benchmark results for Virtuoso
- Orri Erling: Virtuoso LUBM Load Update
- Orri Erling: RDF Data Integration Benchmarking
- Orri Erling: RDF Benchmarking, Role, Motives and Rationale
- Orri Erling: Social Web RDF Store Benchmark
Large real-world data sets that could be used for benchmarking
- Web Data Commons Dataset 2012: 7.3 billion quads current RDFa, Microdata and Miroformat data extracted from the Common Crawl
- Billion Triple Challenge Dataset 2011: 2 billion RDF triples that have been crawled from the Web in May/June 2011.
- UNIPROT: 380 million triples describing proteins.
- US Census data set: 1 billion triples census data.
- DBpedia data set: 100+ million triples derived from Wikipedia.
- HCLS Data from Banff Demo: 450 million triple life science data.
Workshops and Events
- 9th International Workshop on Scalable Semantic Web Knowledge Base Systems at ISWC2013, October 21, 2013.
- 1st International Workhop On Benchmarking RDF Systems (BeRSys 2013) at ESWC2013, May 26, 2013.
- 4th International Workshop on Scalable Semantic Web knowledge Base Systems (SSWS2008) at ISWC2008, Karlsruhe, Germany, October 26-30, 2008
- 3rd International Workshop On Scalable Semantic Web Knowledge Base Systems(SSWS 2007), Vilamoura, Portugal, Nov 27, 2007. (Accepted Papers)
- Sixth International Workshop on Evaluation of Ontology-based tools and the Semantic Web Service Challenge (EON-SWSC2008), at ESWC 2008, Tenerife, Spain, 1st June 2008
Use Cases and Future ideas
- Basic triple storage and retrieval. The LUBM benchmark captures many aspects of this.
- Recursive rule application. The simpler cases of this are things like transitive closure.
- Mapping of relational data to RDF. Since relational benchmarks are well established, as in the TPC benchmarks, the schemas and test data generation can come from there. The problem is that the D/H/R benchmarks consist of aggregates and grouping exclusively but SPARQL does not have these.
Benchmarking Triple Stores
An RDF benchmark suite should meet the following criteria:
- Have a single scale factor.
- Produce a single metric, "queries per unit of time", for example. The metric should be concisely expressible, for example, "
10 qpsR at 100M, options 1, 2, 3". Due to the heterogeneous nature of the systems under test, the result's short form likely needs to specify the metric, scale and options included in the test.
- Have optional parts, such as different degrees of inferencing and maybe language extensions such as full text, as this is a likely component of any social software.
- Have a specification for a full disclosure report, TPC style, even though we can skip the auditing part in the interest of making it easy for vendors to publish results and be listed.
- Have a subject domain where real data are readily available and which is broadly understood by the community. For example, SIOC data about on-line communities seems appropriate. Typical degree of connectedness, number of triples per person, etc., can be measured from real files.
- Have a diverse enough workload. This should include initial bulk load of data, some adding of triples during the run, and continuous query load.
The query load should illustrate the following types of operations:
- Basic lookups, such as would be made for filling in a person's home page in a social networks app. List data of user plus names and emails of friends. Relatively short
- Graph operations like shortest path from individual to individual in a social network
- Selecting data with drill down, as in faceted browsing. For example, start with articles having tag t, see distinct tags of articles with tag t, select another tag t2 to see the distinct tags of articles with both t and , so forth.
- Retrieving all closely related nodes, as in composing a SIOC snapshot over a person's post in different communities, the recent activity report for a forum, etc. These will be construct or describe queries. The coverage of describe is unclear, hence construct may be better.
If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries-per-second metric, we can define the mix similarly to TPC-C. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of secondary queries, typically more complex ones.
Full Disclosure Report
The report contains basic TPC-like items such as:
- Metric qps/scale/options
- Software used, DBMS, RDF toolkit if separate
- Hardware - number, clock, and type of CPUs per machine; number of machines in cluster; RAM and disk per machine; manufacturer; price of hardware/software
These can go into a summary spreadsheet that is just like the TPC ones.
Additionally, the full report should include:
- Configuration files for DBMS, web server, other components.
- Parameters for test driver, i.e., total number of clicks, how many clicks concurrently. The tester determines the degree of parallelism that gets the best throughput and should indicate this in the report. Making a graph of throughput as function of concurrent clients is a lot of work and maybe not necessary here.
- Duration in real time. Since for any large database with a few G of working set the warm up time is easily 30 minutes, the warm up time should be mentioned but not included in the metric. The measured interval should not be less than 1h in duration and should reflect a "steady state," as defined in the TPC rules.
- Source code of server side application logic. This can be inference rules, stored procedures, dynamic web pages, or any other server side software-like thing that exists or is modified for the purpose of the test.
- Specification of test driver. If there is a commonly used test driver, its type, parameters and version. If the test driver is custom, reference to its source code.
- Database sizes. For a preallocated database of n G, how much was free after the initial load, how much after the test run? How many bytes per triple.
- CPU and I/O. This may not always be readily measurable but is interesting still. Maybe a realistic spec is listing the sum of CPU minutes across all server machines and server processes. For I/O, maybe the system totals from iostat before and after the full run, including load and warm-up. If the DBMS and RDF toolkits are separate, it is interesting to know the division of CPU time between them.
OpenLink has a multi-threaded 'C' program that simulates n web users multiplexed over m threads, e.g., 10000 users over 100 threads. Each user has its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart, etc. This can be modified to browse a social network database and produce the desired query mix. This generates HTTP requests; hence, it would work against a SPARQL end-point or any set of dynamic web pages.
The program produces a running report of the clicks-per-second rate, and a full set of statistics at the end, listing the min/avg/max times per operation.
This can be packaged as a separate open source download once the test spec is agreed upon.
For generating test data, a modification of the LUBM generator is probably the most convenient choice.
Benchmarking Relational-to-RDF Mapping
This area is somewhat more complex than triple storage.
At least the following factors enter into the evaluation:
- Degree of SPARQL compliance. For example, can one have a variable as a predicate? Are there limits on
- Are the data being queried split over multiple RDBMS and
JOINed between them?
- Type of use case: Is this about navigational lookups (OLTP) or about statistics (OLAP)? It would be the former, as SPARQL does not really have aggregation (except through extensions). Still, many of the interesting queries are about comparing large data sets.
The rationale for mapping relational data to RDF is often data integration. Even in simple cases like OpenLink's ODS applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.
A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.
A real world case is OpenLink's ongoing work for mapping Wordpress, Mediawiki, phpBB, and possibly other popular web applications, into SIOC.
Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems, and the test driver might even be the same as with the above-proposed triple-store benchmark. The query mix might have to be somewhat tailored.
Another "enterprise style" scenario might be to take the TPC-C and TPC-D databases (after all both have products, customers, and orders), and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.
Considering the times and the audience, the Wordpress/Mediawiki scenario might be culturally more interesting and more fun to demo.
The test has two aspects: Throughput and coverage. I think these should be measured separately.
Throughput can be measured with queries that are generally sensible, such as "get articles by an author that I know with tags t1 and t2."
Then there are various pathological queries that work especially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, or if the graph is not given -- we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.
In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like "list all predicates and objects of everything called gizmo where the predicate is in the product ontology".
It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.