LargeTripleStores

From W3C Wiki
Jump to: navigation, search

This page is for references to signed quotes of deployments of large triples stores rather than predictions of what some software might scale to.

Table of Contents:



Oracle Spatial and Graph with Oracle Database - 1.08 Trillion triples (edges)

LUBM 4400k: 1.08 Trillion triples loaded, inferenced and queried executing the Lehigh University Benchmark (LUBM) on an Oracle Exadata Database Machine X4-2.
A graph containing 1.08 trillion triples (edges) about universities and their departments was created and ordered into 4.4 million named graphs by expanding the triples into quads. There was one named graph per university. The overall graph included 605.4 Billion unique asserted quads and an inference that produced another 475.6+ Billion quads.

  • Data Loading Performance: 1.420 million QLIPS (Quads Loaded* and Indexed Per Second), a total of 115.2 hours to load 605.4 billion quads and create two indexes. (*Loading included checking the quads were well formed and removing duplicates.)
  • Inference Performance: 1.527 million TIIPS (Triples Inferred and Indexed Per Second), a total of 86.5 hours to infer over 475.6 billion triples and create two indexes.
  • SPARQL Query Performance: 1.130 million QRPS (Query Results Per Second), a total of 22.5 hours to generate 92.5 billion answers.

Setup:
Hardware: Oracle Exadata Database Machine X4-2 High capacity full rack, ZS3-2 with 2 controllers, 8 trays of disks, Eight compute nodes of Exadata
Software: Oracle 12.1.0.1 DB standard install on Exadata, SGA_TARGET=132GB and PGA_AGGREGATE_TARGET=100G, Open cursors=1000, Processes=1000, 32K blocksize given to all graph tablespaces, TEMP group created with 3 bigfile tablespaces.
The test was performed September 2014.

LUBM 200k: 48+ Billion triples
A graph containing over 48 Billion triples about universities and their departments was created and ordered into 200,000 named graphs by expanding the triples into quads. There was one named graph per university. The overall graph included 26.6 Billion unique asserted quads and an inference that produced another 21.4 Billion quads.

  • Data Loading Performance: 273K QLIPS (Quads Loaded* and Indexed Per Second) - 27.4 billion quads loaded* in 13 hrs 11 min. + two indexes created in 11 hrs 18 min. = 24 hrs 29 min. (*Loading included checking the quads were well formed and removing .8B duplicates.)
  • Inference Performance: 327K TIIPS (Triples Inferred and Indexed Per Second) - 21.4 billion triples Inferred in 12 hrs 56 min. + two indexes created in 5 hrs. = 17 hrs 56 min.
  • SPARQL Query Performance: 459K QRPS (Query Results Per Second) - 4.18 Billion answers in 2.53 hrs.

Setup:
Hardware: One node of a Sun Server X2-4, 3-node Oracle Real Application Cluster (RAC)

  • The node was configured with 1TB RAM, and 4 CPUs (2.4GHz 10-Core Intel E7-4870) having 40 total Cores and 80 Parallel Threads.
  • Storage: Dual Node 7420, both heads configured as: Sun ZFS Storage 7420 4 CPU 2.00GHz 8-Core (Intel E7-4820)256G Memory 4x SSD SATA2 512G (READZ) 2x SATA 500G 10K. 4 disk trays with 20 x 900GB disks @10Krpm, 4x SSD 73GB (WRITEZ)

Software: Oracle Database 11.2.0.3.0, SGA_TARGET=750G and PGA_AGGREGATE_TARGET=200G
The test was performed April 2013.

LUBM 25k: 6.1 Billion triples

  • Data Loading Performance: 539.7K TLIPS - 3.4 Billion triples loaded* and indexed in 105 min. (*Loading included checking the triples were well formed, removing duplicates and creating two indexes on the graph.)
  • Inference Performance: 281.3K TIIPS - 2.7 Billion triples inferred in 160 min. (Inference included creating two indexes on the inferred data in the graph.)
  • SPARQL Query Performance: 900.4K QRPS (470+ Million answers in 8.7 min.)

Setup:
Hardware: A Sun M8000 was configured with 512 GB RAM, 16 CPUs (SPARC64 VII+ 3.0 GHz) having 64 total Cores and 128 Parallel Threads, and Dual F5100 Flash Arrays having 160 total drives.
Software: Oracle Database 11.2.0.2.0 + Patch 9825019: SEMANTIC TECHNOLOGIES 11G R2 FIX BUNDLE 3, SGA_TARGET=256G and PGA_AGGREGATE_TARGET=206G

LUBM 8k: 1.969 Billion triples

  • Data Loading Performance: 650.5K TLIPS - 1.1 Billion triples loaded* and indexed in 28 min. 11 sec. (*Loading included checking the triples were well formed, removing duplicates and creating two indexes on the graph.)
  • Inference Performance: 233.6K TIIPS - 869 Million triples inferred in 62 min. (Inference included creating two indexes on the inferred data in the graph.)
  • SPARQL Query Performance: 577.5K QRPS (149+ Million answers in 4.3 min.)

Setup: See the LUBM 25k setup.

Please go to the Oracle Technology Network for more information about Oracle Spatial and Graph support for RDF graphs.

AllegroGraph (1+Trillion)

Franz announced at the June 2011 Semtech conference a load and query of 310 Billion triples as part of a joint project with Intel. In August 2011, with the help of Stillwater SC and Intel we achieved the industry's first load and query of 1 Trillion RDF Triples. Total load was 1,009,690,381,946 triples in just over 338 hours for an average rate of 829,556 triples per second.

The driving force has been Amdocs and their AIDA platform. Here are two video presentations Semtech 2011 and Semtech 201.

We currently load LUBM 8000 in just over 36 minutes. Query times are also very fast. We do not preprocess the strings, and we do not need to apply the graph (or named context) to Universities in order to gain better performance (see Note 1, below). The benchmark section of our website is updated with each product release.

Franz is in late-stage development on a clustered version of AllegroGraph that will push storage into trillions of triples. We use hash-based partitioning for our triples so that the query engines don't have to engage in map/reduce operations. The two biggest challenges we are addressing right now are [1] to develop smarter query techniques to limit trips across machine boundaries, and [2] to keep the database ACID in a clustered environment.

Note 1: AllegroGraph provides dynamic reasoning and DOES NOT require materialization. AllegroGraph's RDFS++ engine dynamically maintains the ontological entailments required for reasoning; it has no explicit materialization phase. Materialization is the pre-computation and storage of inferred triples so that future queries run more efficiently. The central problem with materialization is its maintenance: changes to the triple-store's ontology or facts usually change the set of inferred triples. In static materialization, any change in the store requires complete re-processing before new queries can run. AllegroGraph's dynamic materialization simplifies store maintenance and reduces the time required between data changes and querying. AllegroGraph also has RDFS++ reasoning with built in Prolog.

Stardog (50B)

Clark & Parsia announced that the 2.1 release of Stardog can scale up to 50 Billion triples on a $10k server (32 cores, 256G of RAM) with load speeds of 500k triples/sec for 1B triples and over 300k for 20B triples.

Stardog is a pure Java RDF database which supports all of the OWL2 profiles using a dynamic (backward-chaining) approach. It also includes unique features such as Integrity Constraint Validation and explanations, ie proofs, for inferences and integrity constraint violations. It also integrates a full-text search index based on Lucene.

OpenLink Virtuoso v6.1 - 15.4B+ explicit; uncounted virtual/inferred

LOD Cloud Cache is a live instance now serving more than 15.4 Billion Triples (and counting) — including the entire data.gov Catalog — on an 8-host share-nothing Virtuoso cluster, with one (1) Virtuoso Server Process per host, and each host with two (2) quad-core processors, 16GB RAM, and four (4) 1TB SATA-II Disks, each disk on its own channel. Latest bulk load added ~3 Billion triples in ~3 hours — roughly 275Ktps (Kilotriples-per-second) — with partial parallelization of load.

Previously, the LOD Cloud Cache handled 8.5 Billion+ Triples on a 2-host share-nothing Virtuoso cluster with 4 Virtuoso Server Processes per host, and each host with one (1) quad-core processor, 32GB RAM, and four (4) SATA Disks, each disk on its own channel.

As of August 11, 2009, the LUBM 8000 load speed was 160,739 triples-per-second on a single machine with 2 x Xeon 5520 and 72G RAM. Adding a second machine with 2 x Xeon 5410 and 16G RAM, and 1 x 1GigE interconnect, the load rate increased to 214,188 triples-per-second. The software is Virtuoso 6 Cluster, set up with 8 partitions per host. No inference was made. More run details are in the Virtuoso blog post discussing the original run on the smaller host, which delivered 110,532 triples-per-second load rate on its own.

Towards Web-Scale RDF white paper discusses why Triple Scale is function of cluster configuration. 100 Billion Triples with sub-second response time can be achieved with the right cluster configuration (primarily total memory pool delivered by the cluster).

Inferred triples are uncounted because they will vary with the query. Backward-chaining is preferred method. Virtuoso's reasoning currently (v6.1.1; May, 2010) includes support for owl:sameAs, rdfs:subClassOf, rdfs:subPropertyOf, owl:equivalentClass, owl:equivalentProperty, owl:InverseFunctionalProperty, owl:TransitiveProperty, owl:SymmetricalProperty, and owl:inverseOf.

Benchmarks data sources

Older comments

New Bitmap Indexing white paper shows how OpenLink Virtuoso handles loading the 1 billion triple LUBM benchmark set with a sustained rate of 12692 triples/s and the 47M triple Wikipedia data set at a rate of 20800 triples/s. Kingsley Idehen, OpenLink Software.

"The single query stream rate with 100K triples is 14 qps at 100K triples and 11 qps at 1G triples" -- LUBM and Virtuoso

Open source.

http://virtuoso.openlinksw.com/wiki/main/

BigOWLIM (12B explicit, 20B total); 100,000 queries per $1

BigOWLIM is a native RDF database engine with extensive reasoning support. It is a pure Java implementation; users are given a choice to use it either through Sesame or through Jena - in both cases getting its full performance and functionality.

Multiple independent evaluations of RDF repositories indicate superior overall performance and reasoning capabilities of OWLIM. Commercial license; freely available for research, evaluation and development.

Scalability and Loading Speed

Back in Jun 2009, BigOWLIM 3.1 demonstrated reasoning against 12.0 Billion explicit statements, loading the LUBM(90k) dataset. The reasoning over this data materialized additional 8.4 Billion statements; the total number of indexed statements was 20.4B Billion. Loading and inference took 290 hours on a single server worth less than $5,000.

BigOWLIM can deal with 1 Billion statements on a desktop machine worth $1000: it takes less than 5 hours to load the LUBM(8k) dataset at an average speed of 66 thousand statements per second. The "full cycle" run of LUBM(8k), including loading, inference, and query evaluation, takes 15.2 hours.

Details about the LUBM benchmark runs are available here. While more recent BigOWLIM versions and new hardware allow for measurable improvements in both scale and speed, we have not re-run these experiments, because: (i) the LUBM datasets are not sufficiently challenging and representative and (ii) for real-world applications even these figures make loading performance and scalability a no-brainer.

Query Performance, Horizontal Scalability in the Cloud

OWLIM Replication Cluster can handle 5 Million SPARQL queries per hour – in a recent test against BSBM 100M, a cluster of 100 Amazon EC2 instances scored 200,000 BSBM QMpH. The total Amazon EC2 costs for evaluating 100,000 SPARQL were 1$ !

OWLIM Replication cluster has also proven itself in the highest profile real world application of an RDF database – the BBC’s World Cup 2010 web site. Here, OWLIM demonstrated OWL reasoning with continuously changing data (hundreds of updates per hour) while handling millions of queries per day.

Performance features

BigOWLIM includes a range of features that considerably improve overall performance and usability:

  • Geo-spatial extensions: special-purpose indices allow queries involving constraints such as 'nearby point' and 'within region' to be evaluated efficiently. For example, finding airports within 50 miles of London in GeoNames becomes 500 times faster (try the sample query at FactForge).
  • Robust and transparent reasoning: RDFS, OWL 2 RL and QL are supported with no compromises to standards compliance throughout the entire "life cycle" of the data: loading, incremental updates, deletion of facts, schema and ontology updates.
  • Optimized owl:sameAs handling delivers dramatic improvements in performance and usability when data from multiple sources are integrated. It allows FactForge to materialize only 0.9 Billion statements instead of 8.4 Billion statements and to deliver concise, non-expanded, query results when executing queries against 8 of the most central LOD datasets, while carefully considering their semantics.

Garlik 4store (15B)

The store is called 4store. Currently we have 4 KBs of 3-4GT each loaded in our production systems - a cluster of 9 low-end servers running CentOS Linux. Loading time for one 4GT KB is about 8 hours, but it's an interactive process that involves running lots of queries, and doing small inserts.

As of 2009-10-21 it's running with 15B triples in a production cluster to power the DataPatrol application.

4store is now available under the GPLv3 license from 4store.org.

Bigdata(R) (12.7B)

We are in a shakedown period on the scale-out system and will post results as we get them.

6/30/2009: 1B triples stable on disk in 50 minutes (333k tps). 12.7B triples loaded. The issue with clients dying off has been resolved, as has the high client CPU utilization issue.

5/25/2009: 10.4 billion LUBM triples loaded in 47 hours (61k tps) on a 15 blade cluster (this run used 9 data servers, 5 clients, and 1 service manager). Max throughput was just above 241k triples per second. 1 billion triples was reached in 71 minutes, 2 billion in 161 minutes, 5 billion in 508 minutes. The clients are still the bottleneck and started failing one by one after 7.8B triples (throughput at that point was 141k tps).

5/22/2009: 9 billion LUBM triples loaded in 31 hours using the same hardware. The bottleneck was the clients, which were were not able to put out enough load. By the end of the trial the clients were at 100% utilization while the data services were less than 10% utilization.

5/21/2009: 5 billion LUBM triples loaded in 10 hours (135k tps) on a 15 blade cluster (10 data servers, 4 clients, 1 service manager).

Bigdata is an open-source general-purpose scale-out storage and computing fabric for ordered data (B+Trees). Scale-out is achieved via dynamic key-range partitioning of the B+Tree indices. Index partitions are split (or joined) based on partition size on disk and moved across data services on a cluster based on server load. The entire system is designed to run on commodity hardware, and additional scale can be achieved by simply plugging in more data services dynamically at runtime, which will self-register with the centralized service manager and start managing data automatically. Much like Google's BigTable, there is no theoretical maximum scale.

The bigdata RDF store is an application written on top of the bigdata core. The Bigdata RDF store is fully persistent, Sesame 2 compliant, supports SPARQL, and supports RDFS and limited OWL inference. The single-host RDF database is stable and is used at the core of an open-source harvesting system for the intelligence community. We are working towards a release of the scale-out architecture.

Please come see our presentation at the Semantic Technologies Conference in San Jose on June 18th.

More information on bigdata can be found here:

http://www.bigdata.com/blog/

And in this presentation at OSCON 2008:

http://bigdata.sourceforge.net/pubs/bigdata-oscon-7-23-08.pdf

Open source

YARS2 (7B)

YARS2: A Federated Repository for Querying Graph Structured Data from the Web describes the distributed architecture of the YARS2 quad store. With scalability experiments up to 7bn synthetically generated statements - LUBM(50000).

Proprietary, not distributed.

Jena TDB (1.7B)

TDB is a persistent graph storage layer for Jena. TDB works with the Jena SPARQL query engine (ARQ) to provide complete SPARQL together with a number of extensions (e.g. property functions, aggregates, arbitrary length property paths). It is a pure-Java, employing memory mapped I/O, a custom implementation of B+Trees and optimized range filters for XSD value spaces (integers, decimals, dates, dateTime).

TDB has been used to load UniProt v13.4 (1.7B triples, 1.5B unique) on a single machine with 64 bit hardware (36 hours, 12k triples/s).

TDB 0.5 Results for the Berlin SPARQL Benchmark (August 2008).

Open Source: License: Apache Software License

Jena SDB (650M)

SDB is a SPARQL database for graphs/named graphs for Jena. Can load `UniProt` (650M). Uses `PostgreSQL`, `MySQL`, Oracle or MS SQL Server. Also, HSQLDB and Apache Derby.

Open Source: License: Apache Software License

http://jena.apache.org/

Mulgara (500M)

"The Mulgara triple store is scalable up to 0.5billion triples (with 64-bit Java)" -- Norman Gray

Open source

http://www.mulgara.org/

RDF gateway (262M)

"the UniProt protein database (262 million triples) and RDF Gateway." -- Geoff Chappell, Intellidimension

Commercial

http://www.intellidimension.com/

Jena with PostgreSQL (200M)

"Our store is pretty big -- its about 200M triples.

We're currently using Jena on Postgres. For our needs this worked out better than Jena/MySQL, Sesame, and Kowari." -- Leigh Dodds, Ingenta

Open source

http://jena.apache.org/

Kowari (160M)

"My own testing has been in the 10-20M triple range." -- Chris Wilper

Addendum from Chris on Nov 7th, 2005: Since this was written, we have successfully loaded over 160M triples into Kowari on a 64-bit machine with 6GB physical memory. A 64-bit machine is really required to bring Kowari up to this level because it uses mapped files and needs a lot of address space. In our experience in this environment, simple queries still perform fairly well (a few seconds) and complex queries involving 8-10 triple patterns perform worse (a few minutes to an hour).

Open source, unmaintained (See Mulgara fork).

http://www.kowari.org/

3store with MySQL 3 (100M)

"The store my consortium produces (3store) is used successfully up to 100M triples or so. Beyond that it gets a bit sketchy. I'm currently looking at ways to make it scale to 10^9+ without specialising the store to a particular schema."

More specifically, one user is running it with 120M triples in MySQL 4.1. At that size query works fine, but assertion time is down to about 300 triples/second, which makes growing it any bigger too painful. I should note that 3store is an RDFS store, in version 3 it's possible to disable the inference, which should make it scale to much larger sizes, but there are plenty of other stores that can run vanilla RDF storage well. -- Steve Harris, AKT

Open source (GNU GPL).

http://threestore.sourceforge.net/

Sesame (70M)

(10-20 million triples) " is a lot, but most serious triple stores can handle this I'd say. Sesame certainly can, ..." -- Jeen Broekstra, Aduna

Addendum from Jeen on Feb 10 2006: The above comment should be taken as a minimum of what the store can handle. We recently have ran a few scalability tests on Sesame's Native Store (Sesame 2.0-alpha-3). Using the Lehigh University Benchmark we successfully added a LUBM-500 dataset (consisting of about 70 million RDF triples). The machine used was a 2.8GhZ P4 (32-bits) with 1GB RAM, running Suse Linux 10.0 (kernel 2.6), Sun J2SE 1.5.0_06. Upload took about 3 hours. Query performance on the LUBM test-queries was adequate to good: unoptimized, the worst query (Q2) took 1.3 hours to complete, but most queries completed within tens of milliseconds (Q4,5,6,7,8,10,12,13) or 1-5 minutes (Q1,3,9,11,14) - though some of these queries are just fast because they return no results (the native store does no RDFS/OWL inferencing). We have yet to explore larger datasets and performance using RDFS inferencing but it seems that 70M is not the ceiling and that Sesame can easily cope with even larger sets, especially when we use bigger hardware. But that's prediction not fact so I'll leave it at that for now ;)

Open source.

http://www.openrdf.org/

Others who claim to go big

Claims without signatures or quotes. Please move them from this section when they can be linked to a signed specific capacity measurement.

Questions

I know storing 200M triples is cool. But which store can handle simultaneous queries of about 10,000 users using RDFS inferencing? -- Anonymous

200M-300M or so seems to be about the max that anybody has reported. It would be very helpful if people could state whether they tried to scale further, and if not able to, what the problems were -- i.e., does it become too slow to add, perform trivial queries, perform complex queries, all of the above, etc. Additionally, it would be extremely helpful if hardware specs were included. Anyway, this is a great resource. -- CS

It would be nice of the postings here comment on the level of inference supported. Loading with forward-chaining and materialization is *much* heavier than just loading the data. The more general question is what part of the semantics of the loaded ontology/dataset is supported by the system. There are subtle differences in what "loading" means for the four systems with highest results above. RDF gateway supports the semantics of UNIPROT through backward-chaining. OWLIM supports the semantics of LUBM through forward-chaining. The sort of reasoning required in the UNIPROT load of RDF Gateway is much more complex than the one necessary for passing LUBM. Finally, Virtuoso and AllegroGraph are fairly undetermined with respect to reasoning involved in the experiments they report on. For instance, Virtuoso reports results on LUBM but says nothing about the completeness of the query evaluation. -- Atanas Kiryakov

Related