Difference between revisions of "RdfStoreBenchmarking"

From W3C Wiki
Jump to: navigation, search
(RDF Benchmarks)
(Blog Posts about RDF Benchmarking: added several series from Orri Erling)
Line 73: Line 73:
 
== Blog Posts about RDF Benchmarking ==
 
== Blog Posts about RDF Benchmarking ==
  
* Peter Boncz: [http://lod2.eu/BlogPost/1584-big-data-rdf-store-benchmarking-experiences.html Big Data RDF Store Benchmarking Experiences] (May 2013)
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1802 In Hoc Signo Vinces (part 15): TPC-H and the Science of Hash] (2014-05)
* Orri Erling: [http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1631 Suggested Extensions to the BSBM] (September 2010)
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1800 In Hoc Signo Vinces (part 14): Virtuoso TPC-H Implementation Analysis] (2014-05)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1484 Virtuoso Vs. MySQL: Setting the Berlin Record Straight]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1798 In Hoc Signo Vinces (part 13): Virtuoso TPC-H Kit Now on V7 Fast Track] (2014-04)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1471 ISWC 2008: The Scalable Knowledge Systems Workshop]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1796 In Hoc Signo Vinces (part 12): TPC-H: Result Preview] (2014-04)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1465 Virtuoso - Are We Too Clever for Our Own Good?]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1793 In Hoc Signo Vinces (part 11): TPC-H Q2, Q10 - Late Projection] (2014-04)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1448 Virtuoso Update, Billion Triples and Outlook]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1789 In Hoc Signo Vinces (part 10): TPC-H Q9, Q17, Q20 - Predicate Games] (2014-03)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1422 A quick look at the SP2B SPARQL Performance Benchmark]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1779 In Hoc Signo Vinces (part 9): TPC-H Q18, Ordered Aggregation, and Top K] (2014-03)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1418 Configuring Virtuoso for Benchmarking]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1756 In Hoc Signo Vinces (part 8): TPC-H: INs, Expressions, ORs] (2013-11)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1409 BSBM With Triples and Mapped Relational Data]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1755 In Hoc Signo Vinces (part 7): TPC-H Q13: The Good and the Bad Plans] (2013-11)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1400 Virtuoso Optimizations for the Berlin SPARQL Benchmark]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1753 In Hoc Signo Vinces (part 6): TPC-H Q1 and Q3: An Introduction to Query Plans] (2013-11)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1358 DBpedia Benchmark Revisited]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1747 In Hoc Signo Vinces (part 5): The Return of SQL Federation] (2013-11)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1312 What's Wrong With LUBM?]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1744 In Hoc Signo Vinces (part 4): Bulk Load and Refresh] (2013-11)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1308 LUBM results with Virtuoso 6.0]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1742 In Hoc Signo Vinces (part 3): Benchmark Configuration Settings] (2013-11)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1304 Latest LUBM Benchmark results for Virtuoso]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1741 In Hoc Signo Vinces (part 2): TPC-H Schema Choices] (2013-11)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1284 Virtuoso LUBM Load Update]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1739 In Hoc Signo Vinces (part 1): Virtuoso meets TPC-H] (2013-11)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1276 RDF Data Integration Benchmarking]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1733 E Pluribus Unum, or, Star Schema Meets Cluster] (2013-07)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1274 RDF Benchmarking, Role, Motives and Rationale]
+
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1732 Annuit Coeptis, or, Star Schema and The Cost of Freedom] (2013-06)
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1269 Social Web RDF Store Benchmark]
+
* Peter Boncz: [http://lod2.eu/BlogPost/1584-big-data-rdf-store-benchmarking-experiences.html Big Data RDF Store Benchmarking Experiences] (2013-05)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1726 LDBC: A Socio-technical Perspective] (2012-12)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1724 LDBC - the Linked Data Benchmark Council] (2012-11)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1720 LDBC Technical User Community Meeting] (2012-11)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1684 Benchmarks, Redux (part 15): BSBM Test Driver Enhancements] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1683 Benchmarks, Redux (part 14): BSBM BI Mix] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1682 Benchmarks, Redux (part 13): BSBM BI Modifications] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1681 Benchmarks, Redux (part 12): Our Own BSBM Results Report] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1678 Benchmarks, Redux (part 11): The Substance of RDF Benchmarks] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1677 Benchmarks, Redux (part 10): LOD2 and the Benchmark Process] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1675 Benchmarks, Redux (part 9): BSBM With Cluster] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1673 Benchmarks, Redux (part 8): BSBM Explore and Update] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1671 Benchmarks, Redux (part 7): What Does BSBM Explore Measure?] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1669 Benchmarks, Redux (part 6): BSBM and I/O, continued] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1667 Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1665 Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1663 Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore] (2011-03)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1660 Benchmarks, Redux (part 2): A Benchmarking Story] (2011-02)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1658 Benchmarks, Redux (part 1): On RDF Benchmarks] (2011-02)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1631 Suggested Extensions to the BSBM] (2010-09)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1484 Virtuoso Vs. MySQL: Setting the Berlin Record Straight] (2008-11)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1471 ISWC 2008: The Scalable Knowledge Systems Workshop] (2008-11)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1465 Virtuoso - Are We Too Clever for Our Own Good?] (2008-10)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1448 Virtuoso Update, Billion Triples and Outlook] (2008-10)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1422 A quick look at the SP2B SPARQL Performance Benchmark] (2008-09)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1418 Configuring Virtuoso for Benchmarking] (2008-08)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1409 BSBM With Triples and Mapped Relational Data] (2008-08)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1400 Virtuoso Optimizations for the Berlin SPARQL Benchmark] (2008-08)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1358 DBpedia Benchmark Revisited] (2008-05)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1312 What's Wrong With LUBM?] (2008-02)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1308 LUBM results with Virtuoso 6.0] (2008-02)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1304 Latest LUBM Benchmark results for Virtuoso] (2008-02)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1284 Virtuoso LUBM Load Update] (2007-12)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1276 RDF Data Integration Benchmarking] (2007-11)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1274 RDF Benchmarking, Role, Motives and Rationale] (2007-11)
 +
* Orri Erling: [http://www.openlinksw.com/weblog/oerling/?id=1269 Social Web RDF Store Benchmark] (2007-11)
  
 
== Large real-world data sets that could be used for benchmarking ==
 
== Large real-world data sets that could be used for benchmarking ==

Revision as of 14:47, 21 August 2014

RDF Store Benchmarking

This page collects references to RDF benchmarks, benchmarking results and papers about RDF benchmarking.

At the end of the page, we collect use cases for RDF benchmarking and offer ideas for future discussion on benchmarking triple stores.

RDF Benchmarks

Benchmarking Results

Results provided by store implementers themselves:

Results provided by third parties:

Publications about RDF Benchmarking

Blog Posts about RDF Benchmarking

Large real-world data sets that could be used for benchmarking

SPARQL Compliance

SPARQL Implementation Coverage Report (results of running the DAWG SPARQL test cases against different RDF stores)

Workshops and Events

Use Cases and Future ideas

Use Cases

  • Basic triple storage and retrieval. The LUBM benchmark captures many aspects of this.
  • Recursive rule application. The simpler cases of this are things like transitive closure.
  • Mapping of relational data to RDF. Since relational benchmarks are well established, as in the TPC benchmarks, the schemas and test data generation can come from there. The problem is that the D/H/R benchmarks consist of aggregates and grouping exclusively but SPARQL does not have these.

Benchmarking Triple Stores

An RDF benchmark suite should meet the following criteria:

  • Have a single scale factor.
  • Produce a single metric, "queries per unit of time", for example. The metric should be concisely expressible, for example, "10 qpsR at 100M, options 1, 2, 3". Due to the heterogeneous nature of the systems under test, the result's short form likely needs to specify the metric, scale and options included in the test.
  • Have optional parts, such as different degrees of inferencing and maybe language extensions such as full text, as this is a likely component of any social software.
  • Have a specification for a full disclosure report, TPC style, even though we can skip the auditing part in the interest of making it easy for vendors to publish results and be listed.
  • Have a subject domain where real data are readily available and which is broadly understood by the community. For example, SIOC data about on-line communities seems appropriate. Typical degree of connectedness, number of triples per person, etc., can be measured from real files.
  • Have a diverse enough workload. This should include initial bulk load of data, some adding of triples during the run, and continuous query load.

The query load should illustrate the following types of operations:

  • Basic lookups, such as would be made for filling in a person's home page in a social networks app. List data of user plus names and emails of friends. Relatively short JOINs, UNIONs, and OPTIONALs.
  • Graph operations like shortest path from individual to individual in a social network
  • Selecting data with drill down, as in faceted browsing. For example, start with articles having tag t, see distinct tags of articles with tag t, select another tag t2 to see the distinct tags of articles with both t and , so forth.
  • Retrieving all closely related nodes, as in composing a SIOC snapshot over a person's post in different communities, the recent activity report for a forum, etc. These will be construct or describe queries. The coverage of describe is unclear, hence construct may be better.

If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries-per-second metric, we can define the mix similarly to TPC-C. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of secondary queries, typically more complex ones.

Full Disclosure Report

The report contains basic TPC-like items such as:

  • Metric qps/scale/options
  • Software used, DBMS, RDF toolkit if separate
  • Hardware - number, clock, and type of CPUs per machine; number of machines in cluster; RAM and disk per machine; manufacturer; price of hardware/software

These can go into a summary spreadsheet that is just like the TPC ones.

Additionally, the full report should include:

  • Configuration files for DBMS, web server, other components.
  • Parameters for test driver, i.e., total number of clicks, how many clicks concurrently. The tester determines the degree of parallelism that gets the best throughput and should indicate this in the report. Making a graph of throughput as function of concurrent clients is a lot of work and maybe not necessary here.
  • Duration in real time. Since for any large database with a few G of working set the warm up time is easily 30 minutes, the warm up time should be mentioned but not included in the metric. The measured interval should not be less than 1h in duration and should reflect a "steady state," as defined in the TPC rules.
  • Source code of server side application logic. This can be inference rules, stored procedures, dynamic web pages, or any other server side software-like thing that exists or is modified for the purpose of the test.
  • Specification of test driver. If there is a commonly used test driver, its type, parameters and version. If the test driver is custom, reference to its source code.
  • Database sizes. For a preallocated database of n G, how much was free after the initial load, how much after the test run? How many bytes per triple.
  • CPU and I/O. This may not always be readily measurable but is interesting still. Maybe a realistic spec is listing the sum of CPU minutes across all server machines and server processes. For I/O, maybe the system totals from iostat before and after the full run, including load and warm-up. If the DBMS and RDF toolkits are separate, it is interesting to know the division of CPU time between them.

Test Drivers

OpenLink has a multi-threaded 'C' program that simulates n web users multiplexed over m threads, e.g., 10000 users over 100 threads. Each user has its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart, etc. This can be modified to browse a social network database and produce the desired query mix. This generates HTTP requests; hence, it would work against a SPARQL end-point or any set of dynamic web pages.

The program produces a running report of the clicks-per-second rate, and a full set of statistics at the end, listing the min/avg/max times per operation.

This can be packaged as a separate open source download once the test spec is agreed upon.

For generating test data, a modification of the LUBM generator is probably the most convenient choice.

Benchmarking Relational-to-RDF Mapping

This area is somewhat more complex than triple storage.

At least the following factors enter into the evaluation:

  • Degree of SPARQL compliance. For example, can one have a variable as a predicate? Are there limits on OPTIONALs and UNIONs?
  • Are the data being queried split over multiple RDBMS and JOINed between them?
  • Type of use case: Is this about navigational lookups (OLTP) or about statistics (OLAP)? It would be the former, as SPARQL does not really have aggregation (except through extensions). Still, many of the interesting queries are about comparing large data sets.

The rationale for mapping relational data to RDF is often data integration. Even in simple cases like OpenLink's ODS applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.

A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.

A real world case is OpenLink's ongoing work for mapping Wordpress, Mediawiki, phpBB, and possibly other popular web applications, into SIOC.

Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems, and the test driver might even be the same as with the above-proposed triple-store benchmark. The query mix might have to be somewhat tailored.

Another "enterprise style" scenario might be to take the TPC-C and TPC-D databases (after all both have products, customers, and orders), and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.

Considering the times and the audience, the Wordpress/Mediawiki scenario might be culturally more interesting and more fun to demo.

The test has two aspects: Throughput and coverage. I think these should be measured separately.

Throughput can be measured with queries that are generally sensible, such as "get articles by an author that I know with tags t1 and t2."

Then there are various pathological queries that work especially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, or if the graph is not given -- we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.

In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like "list all predicates and objects of everything called gizmo where the predicate is in the product ontology".

It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.