Persistent RDF Databases

Eric's Notes

This version:
Latest version:
Previous versions
Eric Prud'hommeaux <eric@w3.org>

Status of Document

This document represents an preliminary proposal for a persistent RDF database schema. It does not represent the views of the W3C team or member companies.

Table of Contents


The semantic web provides a wealth of opportunity to enable structured searches for search engines and web agents. This assumes the overhead and latency or a query-time search. A large-scale database makes this data immediately available for use by the infrastructure and searchs.


Beyond immediate search gratification, the RDF database can be used in the infrastructure: store web server configuration, store workflow information. The purpose of this design and implementation is to create the underlying database and API.

Scalability and Speed

Good database design and @@@word for uniquing strings in a DB@@@ should automatically produce a database that can store huge numbers of statements (RDF node relationships). Each statement is a set of 6 integers. The expense of string storage is unavoidable but most RDF data is self-referential and therefor has a comparitvely small number of unqiue strings. As the database grows larger, new documents will find a greater fraction of their strings already present in the database. Indexing allows the administrator to push the slider between speed and space efficiency.

Table Structure


@@@ s/uniqueness/that word for uniqueness in databases/g

The tables are constructed to provide as many unique indexes as possible. A breakdown of the tables can be seen in the machine-generated table description.


Following is a table-by-table breakdown of the relationships in the RDF persistent database.


The Uris table stores a list of the URIs without their fragments.


Because a URI may refer to data that changes over time, Revisions are an attempt to allow multiple versions of a document to co-exist in the same RDF database. One benifit of having Revisions is that data may be half-parsed from a new version without interfering with queries on triples in the previous version. This is pretty dicey and is likely to go away.


The Fragments table stores the Fragment portion of the URIs as well as the URI element which is the base name for the full fragment identifier.


More recent work on attributions is available at Source Attribution in RDF.

In an environment of hetrogeneous trust requirements, each statement needs some way to identify its source. Attributions stores the External ID of the document being parsed, including the most recent fragment ID encountered when the Attribution is stored.

The typical storage mechanism is:

  1. RDF Parser gets a SAX startDocument call with the External ID http://www.w3.org/1999/07/example.rdf for the docuement.
  2. RdfParser asks the RDF database for the Attribution for http://www.w3.org/1999/07/example.rdf and stores it as the current attribution.
  3. The RDF Parser gets a startElement call that causes the parser to insert some Statement into the database. The current attribution is stored with this Statement so it is know where this Statement was encountered.
  4. The XML Parser encounters an element with an RDF:ID="id1" in it and sets the current attribution to the Attribution for http://www.w3.org/1999/07/example.rdf#id1
  5. Subsequent Statements are stored with this new current attribution.

Used in:

search for statements by attribution:
SELECT Statements.id FROM Statements,Attributions,Fragments,Uris WHERE Statements.attribution=Attributions.id AND Attributions.fragment=Fragments.id AND Fragments.uri=Uris.id AND Uris.uri="http://www.w3.org/1999/07/example.rdf" GROUP BY Statements.id

Other APIs have other names for Attributions and similar concepts. See also:


The Strings table stores the uniqified strings encountered in the database. This has the interesting side-effect that (<identifier for Joe's car> --hasColor-> "blue") refers to the same blue node as (<identifier for Joe> --hasMood-> "blue"). Good? Bad? I don't know, but it will probably be useful to limit queries on nodes with an object that is a String. This will be especially pertinant in the rdf_browser where you won't want to see zillions of nodes pointing to "Blue".


Anonymous nodes are given GenIds to uniquely identify them for query relationships.


The RdfIds table is a convenience that simplifies the Statement table. Since the predicate is a fragment identifier, the subject is a fragment identifier or an anonymous node, and the object is any of those or a tring, queries where one Statement's object is another Statement's subject tend to get very cumbersom (like this sentence). The solution is to store each of these elements as an RdfId.


In an environment of hetrogeneous data sources, it is likely, (and even hoped) that a statement may come from multiple Attributions. It is, however, important to allow statements made by some data source to be recinded, for instance, when the document is replaced. All Nodes referring to the same RdfId represent all the data sources that have made assertions about that same object.

@@@ can I have a unique on rdfId and attribution? If so, change add('Nodes') to ensure('Nodes', 'u_rdfId_attribution'). While I'm at it, may I assert that the same statement made multiple times in the same document are redundant and should be counted only once?


The Statements table stores the Statements made in the parsed RDF.


Some statements are not made about a specific subject, but instead about each of the elementes in a container. This is implemented as a generic mapping from one (p,s,o) triple to another. In this case, the mapped from triple is the membership of X in a bag Y. The MappedNodes ...

Typical data:

| id | rdfId | mapFrom | description | container | attribution |
|  1 |    37 |      59 |          42 |        32 |          13 |
which corresponds to:
file:/test.rdf#B1 map from http://www.w3.org/1999/02/22-rdf-syntax-ns#li


The MappedStatements table stores the MappedStatements

Typical data:

| id | predicate | subjectNode | subjectMap | objectNode | objectMap | containerId | reifiedId | attribution |
|  1 |        24 |        NULL |          1 |        128 |      NULL |          32 |        52 |          13 |
which corresponds to:
 file:/test.rdf#B1 map from http://www.w3.org/1999/02/22-rdf-syntax-ns#li,
reified as file:/test.rdf#hata 
in file:/test.rdf#pre by file:/test.rdf

Used in:

insert pending nodes for a container insertion:
given (http://www.w3.org/1999/02/22-rdf-syntax-ns#li <59>,  file:/test.rdf#B1 <129 with RdfId 37>,  file:/test.rdf#U1 <130>) by file:/test.rdf

execute the SQL:

INSERT INTO Statements (predicate,subjectNode,objectNode,containerId,reifiedId,attribution)
 SELECT MappedStatements.predicate,130,MappedStatements.objectNode,
 FROM MappedStatements,MappedNodes
 WHERE MappedStatements.subjectMap=MappedNodes.id AND MappedNodes.rdfId=37 AND MappedNodes.mapFrom=59
which produces:
| predicate | 130 | objectNode | containerId | reifiedId | attribution |
|        24 | 130 |        128 |          32 |        52 |          13 |

or (http://www.w3.org/schema/certHTMLv1/hasAccessTo,file:/test.rdf#U1,file:/test.rdf#http://resource/a) reified as file:/test.rdf#pre in file:/test.rdf#hata for @@@


The Descriptions table stores the information that comes with an RDF:Description tag.

Used in:

search for nodes by type:
SELECT Nodes.id FROM Nodes,Descriptions WHERE Nodes.description=Descriptions.id AND Descriptions.type IN (40) GROUP BY Nodes.id

The table stores the

why not all RDF

The general goal in designing a relational database is to accurately represent the acquired data. This includes splitting the data into as many fields ascan be made useful in a typical query. However, since RDF is a way of storing data, and that data can be stored in triples, any fields beyon those necessary to store the predicate, subject, object and their types may be regarded as extraneous.

One possible schema is to store the attributions in the Statements table. It could be argued that one should not have to make queries on a statement's reifiedId. I put the reification information in separate Statements fields as I wanted to be able to implement a trust policy without doubling my number of joins. A nice benefit of this is that it is very easy to avoid presenting the triples that are the product of reification. I will probably not get around to experimenting with it the other way, but it probably should be done.


This section outlines the process for locating or creating nodes in the SQL database.


Links to background materials

Here are links to relevent specifications and products:

Last revised: $Date: 2003/11/12 12:09:45 $ by: $Author: eric $

Valid HTML 4.0!