Completed report, published 2003-01-23. Updates will be made over the lifetime of the SWAD-Europe project as tools change, new tools emerge and other relevant materials are added.
Comments on this document are welcome and should be sent to the authors or to the public-esw@w3.org list. An archive of this list is available at http://lists.w3.org/Archives/Public/public-esw/
1 Introduction
2 Existing Work
3 Triple Stores Implemented with Databases
4 Database Schemas used for Triple Stores
5 Mapping RDBMS schemas onto RDF
6 Mapping Tools
7 Summary and Conclusions
A References
B Changes
This report is part of SWAD-Europe Work package 10: Tools for Scalability and Storage and addresses how semantic web data storage relates to using relational database systems, in particular those that are licensed as Free Software / Open Source (FS/OS). This work package builds on SWAD-Europe Deliverable 10.1 Tools for Semantic Web Scalability and Storage: Survey of Free Software / Open Source RDF storage systems that surveyed systems providing general triple storage and did not consider the detail of existing database approaches. This report convers relational and other database systems that might be appropriate to use; any form that is not a general triple store.
There are two main thrusts in this workpackage dealing with the main issues that arise when dealing with RDF triple stores and relational databases. These can be summarised as the following Frequently Asked Questions (FAQs):
These two questions can be taken as parts of the more general question of how to map RDF data and its schema vocabulary to and from a relational database.
The analysis of this question and how to approach it requires the details of what an RDF triple store requires and what features typical relational systems provide. The first steps are therefore to look at triple store requirements and the features of RDBMSes that apply to performing this mapping between these data forms.
These may not all be required, depending on the application but good support for some of these triple store requirements may be crucial for efficient systems:
web of truststyle applications and also allows RDF graph merging/demerging to work better.
There have been only a few surveys of existing work on using RDBMSes for semantic web data storage and approaches taken to mapping between these data forms. These include the online survey by Melnik in Storing RDF in a relational database[RDFRELATIONAL1] done in 2000, with submissions from the authors of the various applications and various smaller reports [RDFQCASE],[W3CACLSQL],[DBVIEW] There has been no major work on methods of extracting semantic web data from existing RDBMSes or on detailed issues of using RDBMSes for storing semantic web data.
The Jena[JENA] project team analysed various approaches to database schemas as part of the implementation of the Jena RDBMS storage backend. One goal was to make this work over the standard Java database abstraction system JDBC, and thus allow support for many RDBMSes. This involved various tradeoffs, so specific support was added for several particular RDBMSes, to provide optimised support (PostgreSQL, MySQL, Interbase and others). On top of this, several different database schemas were taken (and all of them with/without support for multiple models per single database). and the resulting systems tested for performance.
In Jena relational database interface - performance notes[JENARDBPERF] Reynolds summarises the analysis in section Summary of observations as follows:
- Partitioning the statement table into attribute tables has no measurable performance benefit on small scale tests.
- The use of content hashes instead of database ID's has some performance cost on load and no performance benefit on these queries but does lead to mergable databases.
- The Berkeley DB storage manager (in non-transaction mode) is nearly an order of magnitude faster than most SQL database options. Though of those tested MySQL comes the closest. We assume transaction support is the primary overhead.
- For indexing the main statement table the two indices commonly used, namely subject+predicate and object, are indeed the best tradeoffs.
- So long as the database query plans are not pathological then allowing multiple models in a single database has no performance impact if you don't use it. Be careful on PostgreSQL though.
- Stored procedures can save around 25% in load times.
- Caching JDBC prepared statements can give a 2-3x improvement in performance for some databases.
These results are useful for comparing different database backends for RDBMS storage of semantic web data and hint at the tradeoffs that can be made.
In The RDFSuite: Managing Voluminous RDF Description Bases[MANVOLUME] the authors describe using the RSSDB - RDF Schema Specific DataBase (RSSDB)[RSSDB] part of ICS-FORTH RDFSuite[RDFSUITE] and comparing using the same RDBMS with a generic representation database schema using triples approach (with URI interning) versus a RDF schema specific representation database schema with indexes, where the tables are customised for the data. In all cases the schema-specific approach used less storage, was faster in loading and querying, and in some queries, very substantially quicker. This work was initially done over JDBC to PostgreSQL since it provided the support for subsumption that the project required. Later, support for MySQL was added but it isn't clear if the two approaches were checked over that database.
This project has already reported in SWAD-Europe Deliverable 10.1 Tools for Semantic Web Scalability and Storage: Survey of Free Software / Open Source RDF storage systems on existing approaches to storing semantic web data in triple form, rather than specifically for RDBMSes. An evaluation of queries will be done later in this project in SWAD-Europe Deliverable 7.2 Report comparing existing RDF query language functionality, documenting different scenarios and users for RDF query languages
This section outlines the features of the major semantic web data storage systems based on RDBMS or other backends. The main features considered are support for schemas (RDF, ontologies), inference, indexing and searching of literal strings and what are the dependencies or requirements such as implementation language or underlying relational databases. The next section describes the relational database schemas in use by the stores, where appropriate.
Jena[JENA] is a Java semantic web toolkit that provides a rich API including storage over using either Sleepycat / Berkeley DB or via JDBC to talk to a variety of RDBMSs including MySQL, PostgreSQL, Oracle, Interbase and others. Many of the support databases have specific optimisations driven from specialised configuration files, along with multiple forms off database schema that allow the user to pick an appropriate one for the application, such as allowing multiple models in a single RDBMS database, or choosing to use stored procedures. The RDBMS api can be easily customised for other databases, and a generic schema is available that will likely work with standard SQL. The performance of the approaches and databases has been analysed and optimised, and advice on the tradeoffs of these are documented[JENARDBPERF]. The Jena RDBMS backend is restricted to work only via JDBC directly, but a more specific RDBMS database storage system could be written relatively easily, which could enable the optimising of, for example, query operations into more easily optimizable forms.
License: Apache/BSD-style license without advertising.
The KAON project's[KAON] RDF Server as described in Karlsruhe Ontology and Semantic Web Infrastructure Developer's Guide[KAONDEV] provides an RDF repository as an Enterprise Java Beans (EJB) that can be used on J2EE application servers, with the data persisted in a relational database via JDBC. The default persistence is provided by the EJB container mechanism but an enhanced "Engineering" relational schema is provided as part of the Engineering Server for use when building ontologies, with a complex indexing custom scheme over the raw content. The KAON Server[KAON-SERVER] (produced as part of IST Project 2001-33052 WonderWeb) provides a high-level onotological interface including modules accessing triple store or RQL-based repositories and also allows pluggable inferencers to be added for higher level logics. The 2002-10-02 version works with any SQL2-compatible RDB, tested with MS SQL Server, and run on PostgreSQL, IBM DB2 and Oracle 9i.
License: LGPL
The Parka Database[PARKASW],[PARKADB] - part of the Parka-KB is a knowledge representation system based on semantic networks, layered on frames using a relational database beneath (presently an internal one, although it has been used over Oracle). It uses fixed table sizes for predicates for speed of operation and has similar optimised structures for handling inheritance of classes and properties, which are always stored in memory. The property tables are moved between disk and memory on demand. The system has been used with over 2M frames (not quite assertions) and although applied to the KR world, is now being updated for semantic web data (with URIs) such as RDF.
License: MIT License with advertising not required.
ICS-FORTH's Java RDFSuite[RDFSUITE] as described in The RDFSuite: Managing Voluminous RDF Description Bases[MANVOLUME] describes how it was designed and implemented and the approache taken to create an efficient persistent store based on an ORDBMS model. Two approaches were taken - a generic store or a specific RDBMS schemas for the application. It creates the latter from schema knowledge to automatically generate an Object-Relational (SQL3) representation of RDF metadata. Internally it has tables for Class, Property, SubClass, SubProperty and the particular classes (instances) and properties (source, target) in the RDF schema being used. It also handles XML Schema data types for literal values, grouping and filtering primitives and sorting.
License: Under the RDFSuite License (C-Web license on the web page) which allows free (price) use of the software as long as credit is kept.
Sesame[SESAMEPROJ] as described in Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema[SESAMETR][SESAME] and [BenchRDFS] was designed to use existing storage systems such as the various DBMSes available that which have different strengths and weaknessess. These can then be used via sesame in various ways to store RDF data. This variety was captured by the concentration of the DBMS-specific code into a Storage And Inference Layer (SAIL) which interfaces between the RDF-specific methods and the database API. There is also, an in-memory SAIL implementation which uses files for persistent storage.
The SAIL provides interfaces to sesame modules that implement the RQL query language, administration (including loading of RDF data and schema, deleting) and exporting RDF as RDF/XML. SAIL is a high-level and lightweight Java API that includes support for RDF schema semantics and data-streaming operation. At the time it was designed the existing Java APIs didn't support all of these features. SAIL layers stack so that actions are passed between layers until they are handled, for example a schema-caching SAIL was created without disturbing other layers.
SAIL can work on top of any RDBMS, ODBMS, existing RDF stores, RDF files or network services. In the current version (Sesame 0.7.1), Sesame supports using any DBMS with a JDBC-driver and currently explicitly supports PostgreSQL (7.0.2 or later), MySQL (3.23.47 or later) and Oracle9i. The PostgreSQL store uses the object-relational RDBMS features to support (transitive) subtable relations between tables, which are appopriate for providing RDF Schema class and property subsumption. The MySQL store also supports subsumption by another mechanism.
The SAIL uses a dynamic database schema inspired by that described in Querying Community Web Portals[QCOMWEBPOR] in which new tables are added for each new class and subclass relationships cause the table to be a subtable of the superclass tables. The same method is also used for properties.
Sesame also has an RDF model theory inferencer that can work directly on the SQL schema used, which although it use knowledge of the tables that are in use, it is independent of the actual RDBMS. A variant implementation of this inferencer also tracks which statements were used to infer other statements, called the Truth Maintainance System (TMS). This is available for the PostgreSQL and MySQL backing stores only.
The query module turns the RQL query into a model, optimises it to another model. The evaluation is then done in the RQL query engine over calls to the SAIL. The RQL query itself could have been passed to the repository, which could then be used directly if it supported it, but this would restrict it to only repositories that supported RQL. RDFSuite took the opposite approach and requires all repositories to support RQL for querying.
This provides adding/removing data, emptying a repository, adding a stream of data (such as from an RDF parser) and includes RDF schema information such as checking statements versus their schema and infering implied information.
Allows serializing a repository to the RDF exchange format, RDF/XML.
The PostgreSQL store is rather slow and in particular, schema changes since table creation is very expensive. Adding subclasses between two existing classes is very slow since it cannot be done on existing tables.
License: GPL/LGPL
The TAP Project[TAPPROJ] is a system written in C as an Apache module (TAPache) that implements an RDF storage, query and network/web service API and client libraries in Perl and Java. It supports RDF schema subclass and sub property queries and indexing in an efficient manner based on either a BerkeleyDB store or MySQL. The MySQL schema is optimised for the TAP Knowledge Base[TAPKB] properties hard-coded into the source, but could be customised to index additional schemas. It has a query language GetData[GETDATA] implemented as a SOAP XML interface that allows cross-server queries with RDFS subsumption across TAP servers.
License: Apache/BSD-style license with advertising.
Several database schemas are available for Jena over multiple
relational database backends. The detail of the implementation is described in
Jena relational database interface - introduction[JENARDBINTRO]
and shows that the flavours that can be chosen are related to
using MD5 hashes for resource and literal IDs (Hash
),
multiple models in a single database (MM
), particular
optimised database schemas for an implementation (PostgreSQL, MySQL,
Interbase, Oracle).
Table RDF_STATEMENTS | ||
Column name | Type | Comments |
---|---|---|
subject | id-ref | |
predicate | id-ref | |
object | id-ref | |
object_isliteral | smallint | flags whether "object" is in literal or resource table |
model | id-ref | only used in multiple-model variants |
isreified | smallint | not used at present |
Table RDF_LITERALS | ||
Column name | Type | Comments |
---|---|---|
id | id-ref | |
language | varchar | xml:lang value if available |
literal_idx | varchar | the literal itself or the largest subset of that which is indexable by the database |
literal | blob | the full literal value if the literal won't fit in literal_idx |
int_ok | smallit | flag to indicate that an parse of the literal into an integer is available |
int_literal | int | the integer value of the literal, only valid if int_OK=1 |
well_formed | smallint | preserve jena flag that the literal is well-formed xml |
Table RDF_RESOURCES | ||
Column name | Type | Comments |
---|---|---|
id | id-ref | |
namespace | id-ref | pointer to namespace table |
localname | varchar |
Table RDF_NAMESPACES | ||
Column name | Type | Comments |
---|---|---|
id | id-ref | |
uri | varchar |
Table RDF_MODELS | ||
Column name | Type | Comments |
---|---|---|
id | id-ref | |
name | varchar | Used when reopening a persistent model in a database that supports more than one model. |
Table RDF_LAYOUT_INFO - name/value pairs which define the layout properties | ||
Column name | Type | Comments |
---|---|---|
name | varchar | |
val | varchar |
The layouts currently defined are:
Layout | Supports multiple-models? | Uses hash ids? | Comments |
---|---|---|---|
Generic | no | no | See above for details |
MMGeneric | yes | no | |
GenericProc | no | no | Variant on generic that uses stored procedures for updates |
MMGenericProc | yes | no | Variant on generic that uses stored procedures for updates |
Hash | no | yes | |
MMHash | yes | yes |
Taken from Jena relational database interface - introduction[JENARDBINTRO]
Jena uses variants of the schemas given in the layouts described in Schema 1 the tradeoffs of which were analysed in Jena relational database interface - performance notes[JENARDBPERF] for typical operations.
KAON uses the standard Java Entity Bean storage for standard applications - either the container managed persistence or bean managed. For more powerful applications such as working on a complex ontology, there is a specialised "engineering server" relational database schema.
One of the storage systems is described as suitable for Ontology Engineering - Engineering Server which is a rich database schema which is highly indexed and unlike Sesame, does not have a table per class so does not require rebuilding as they are added and related.
It requires a SQL2-compatible DBMS and has been tested with MS SQL Server 2000, PostgreSQL, IBM DB2 and Oracle 9i.x
From EJB clases in Java package edu.unika.aifb.rdf.rdfserver.ejb
Table Model (edu.unika.aifb.rdf.rdfserver.ejb.ModelBean ) |
||
Column name | Type | Comments |
---|---|---|
alias | java.lang.String | model name |
LogicalURI | java.lang.String | URI of model |
Table Counter (edu.unika.aifb.rdf.rdfserver.ejb.CounterBean ) |
||
Column name | Type | Comments |
---|---|---|
type | varchar(255) | type of bean; used to give a counter for each bean type |
counter | int |
Table Statement (edu.unika.aifb.rdf.rdfserver.ejb.StatementBean ) |
||
Column name | Type | Comments |
---|---|---|
id | java.lang.Integer | |
modelKey | java.lang.Integer | pointer to Model table (alias) |
subKey | java.lang.Integer | pointer to Resource table (id) |
predKey | java.lang.Integer | pointer to Resource table (id) |
objKey | java.lang.Integer | pointer to Resource table (id) or Literal table (id) |
literal | java.lang.Integer | true if objKey is a literal (in Literal table, else in Resource) |
Table Resource (edu.unika.aifb.rdf.rdfserver.ejb.ResourceBean ) |
||
Column name | Type | Comments |
---|---|---|
id | java.lang.Integer | |
label | java.lang.String | resource URI |
Table Literal (edu.unika.aifb.rdf.rdfserver.ejb.LiteralBean ) |
||
Column name | Type | Comments |
---|---|---|
id | java.lang.Integer | |
label | varchar(255) | literal string |
The KAON engineering server schema additionally provides (from 2002-02-10 version) an enhanced schema with models, supermodels as well as indexing of advanced ontological relationships and constraints.
Table PKCounter | ||
Column name | Type | Comments |
---|---|---|
type | varchar(255) | |
counter | int |
Table OIModel | ||
Column name | Type | Comments |
---|---|---|
modelID | int | |
logicalURI | varchar(255) |
Table IncludedOIModel | ||
Column name | Type | Comments |
---|---|---|
includingModelID | int | |
includedModelID | int |
Table AllIncludedOIModels | ||
Column name | Type | Comments |
---|---|---|
includedModelID | int | |
includingModelID | int |
Table OIModelEntity | ||
Column name | Type | Comments |
---|---|---|
entityID | int | |
modelID | int | |
entityURI | varchar(255) | |
conceptVersion | int | |
propertyVersion | int | |
isAttribute | smallint | |
isSymmetric | smallint | |
isTransitive | smallint | |
inversePropertyID | int | |
inversePropertyModelID | int | |
instanceVersion | int |
Table ConceptHierarchy | ||
Column name | Type | Comments |
---|---|---|
modelID | int | |
superConceptID | int | |
subConceptID | int |
Table ConceptInstance | ||
Column name | Type | Comments |
---|---|---|
modelID | int | |
conceptID | int | |
instanceID | int |
Table PropertyHierarchy | ||
Column name | Type | Comments |
---|---|---|
modelID | int | |
superPropertyID | int | |
subPropertyID | int |
Table PropertyDomain | ||
Column name | Type | Comments |
---|---|---|
modelID | int | |
propertyID | int | |
conceptID | int | |
minimumCardinality | int | |
maximumCardinality | int |
Table PropertyRange | ||
Column name | Type | Comments |
---|---|---|
modelID | int | |
propertyID | int | |
conceptID | int |
Table RelationInstance | ||
Column name | Type | Comments |
---|---|---|
modelID | int | |
propertyID | int | |
sourceInstanceID | int | |
targetInstanceID | int |
Table AttributeInstance | ||
Column name | Type | Comments |
---|---|---|
modelID | int | |
propertyID | int | |
sourceInstanceID | int | |
textValue | varchar (255) |
Parka is a frame-based system and does not use an external relational store using SQL schema. It uses a lightweight internal mini-relational store designed in the frames/slots KR approach predating RDF-style triples. The frames (RDF triple subjects) are identified by integers and contain frame properties that are the non-hierachical relationships. The relational tables store the domain and range of the properties. The persistent store of properties is complemented with an in-memory cache of the structural links or hierarchical properties; the relations ISA, SUBCAT, INSTANCE-OF, INSTANCE so that class and property queries can be carried out more efficiently.
RDFSuite has a persistent RDF store based on an object-relational DBMS (ODBMS) with two types of schema - The RDF Schema Specific Database (RSSDB) also called SpecRepr and a generic, simple list of triples called GenRepr
The core model of the SpecRepr schema is represented by 4 tables with separation of data and schema information. Class tables store URIs of the resources (instances) of that class. Property tables store the uris of the source and target nodes of the property. It also has tables Class, Property, SubClass, SubProperty for recording the class and property details and their relationships. Sub tables are used for relating tables that are subclass/properties.
Table Class | ||
Column name | Type | Comments |
---|---|---|
id | int | |
nsid | int | pointer to namespace URI of the class |
lpart | text | local name of the class |
Table Property | ||
Column name | Type | Comments |
---|---|---|
id | int | |
nsid | int | pointer to namespace URI of the property |
lpart | text | local name of the property |
domaind | int | pointer to the domain class |
rangeid | int | pointer to the range class |
Table SubClass | ||
Column name | Type | Comments |
---|---|---|
subid | int | pointer to the sub-class |
superid | int | pointer to the super-class |
Table SubProperty | ||
Column name | Type | Comments |
---|---|---|
subid | int | pointer to the sub-property |
superid | int | pointer to the super-property |
Table NameSpace | ||
Column name | Type | Comments |
---|---|---|
id | int | |
URI | text | URI of namespace (of class/property) |
Table Types | ||
Column name | Type | Comments |
---|---|---|
type | text | names of RDF/S built-in-types such as rdf:Property, rdf:Bag, ... and literal types (string, integer, date) |
Instance Table (for a particular property) | ||
Column name | Type | Comments |
---|---|---|
source | text | URI of property source |
target | text | URI of property target |
Class Table (for a particular class) | ||
Column name | Type | Comments |
---|---|---|
URI | text | URI of class instance |
Indices are constructed on the attributes URI, source and target of the tables in schema 3 in order to speed up joins and the selection of specific tuples of the tables. Indices are also constructed on the attributes lpart, nsid and id of the tables Class and Property and on the attribute subid of the tables SubClass and SubProperty. Instance tables also connected throught the subtable relationship of ORDBMSs.
Table Triples | ||
Column name | Type | Comments |
---|---|---|
predid | int | |
subid | int | |
objid | int | Triple object resource pointer |
objvalue | text | Triple literal value string. |
Table Resources | ||
Column name | Type | Comments |
---|---|---|
id | int | |
uri | text | Resource URI |
The SAIL uses a dynamic database schema inspired by [QCOMWEBPOR] in which new tables are added for each new class and subclass relationships cause the table to be a subtable of the superclass tables. Similarly for properties.
The schema in schema 5 is from Sesame 0.7.1 (with MySQL, but it applies to PostgreSQL or other RDBMSes with the appropriate SQL changes). All resources and literal values are mapped to a unique ID (resources and literals tables). These are then used to form the triples (triples table) and the relationships between the classes and properties. Indexes are made on several of the tables to enhance domain, range, superclass etc. lookups.
Explicitly added statements during a transaction (can have duplicates)
Table addedTriples | ||
Column name | Type | Comments |
---|---|---|
subject | unsigned integer | |
predicate | unsigned integer | |
object | unsigned integer | |
explicit | bool |
All statements that were inferred by one inferrence rule
Table allInferred | ||
Column name | Type | Comments |
---|---|---|
subject | unsigned integer | |
predicate | unsigned integer | |
object | unsigned integer | |
explicit | bool |
All statements that were added during a transaction
Table allNewTriples | ||
Column name | Type | Comments |
---|---|---|
subject | unsigned integer | |
predicate | unsigned integer | |
object | unsigned integer | |
explicit | bool |
Table class | ||
Column name | Type | Comments |
---|---|---|
id | unsigned integer | pointer to class instance (resource) identifier |
All direct rdfs:subClassOf relations
Table direct_subclassof | ||
Column name | Type | Comments |
---|---|---|
sub | unsigned integer | |
super | unsigned integer |
All direct rdfs:subPropertyOf relations
Table direct_subpropertyof | ||
Column name | Type | Comments |
---|---|---|
sub | unsigned integer | |
super | unsigned integer |
Table domain | ||
Column name | Type | Comments |
---|---|---|
property | unsigned integer | |
class | unsigned integer |
All statements that were inferred by one inferrence rule.
Table inferred | ||
Column name | Type | Comments |
---|---|---|
subject | unsigned integer | |
predicate | unsigned integer | |
object | unsigned integer | |
explicit | bool |
All instance relations
Table instanceof | ||
Column name | Type | Comments |
---|---|---|
inst | unsigned integer | |
class | unsigned integer |
Table literals | ||
Column name | Type | Comments |
---|---|---|
id | unsigned integer | |
language | varchar(4) binary | |
value | (text) |
Table namespaces | ||
Column name | Type | Comments |
---|---|---|
id | unsigned integer | |
prefix | varchar(16) binary | |
name | (text) | Namespace URI |
export | bool |
All added statements that are actually new (i.e. not yet in the triples table)
Table newTriples | ||
Column name | Type | Comments |
---|---|---|
subject | unsigned integer | |
predicate | unsigned integer | |
object | unsigned integer | |
explicit | bool |
All direct instance relations.
Table proper_instanceof | ||
Column name | Type | Comments |
---|---|---|
inst | unsigned integer | |
class | unsigned integer |
Table property | ||
Column name | Type | Comments |
---|---|---|
id | unsigned integer |
Table range | ||
Column name | Type | Comments |
---|---|---|
property | unsigned integer | |
class | unsigned integer |
Repository metadata
Table rep_metadata | ||
Column name | Type | Comments |
---|---|---|
mkey | character varying(255) | |
mvalue | character varying(255) |
Table resources | ||
Column name | Type | Comments |
---|---|---|
id | unsigned integer | |
namespace | unsigned integer | |
localname | character varying(255) |
Table subclassof | ||
Column name | Type | Comments |
---|---|---|
sub | unsigned integer | |
super | unsigned integer |
Table subpropertyof | ||
Column name | Type | Comments |
---|---|---|
sub | unsigned integer | |
super | unsigned integer |
Table triples | ||
Column name | Type | Comments |
---|---|---|
subject | unsigned integer | |
predicate | unsigned integer | |
object | unsigned integer | |
explicit | bool |
TAP uses a mostly hard-coded database schema for MySQL only, but can create tables for particular predicates. The standard form is to index only the type (rdf:type) predicate; the remaining triples live in the 'tp' table. It also has built-in knowledge of namespaces which are stripped out of the URIs before they are added to the tables; so although namespaces are used, they are not in the database schema.
Table tp | ||
Column name | Type | Comments |
---|---|---|
source | text | namespace-shortened URI |
arc | text | namespace-shortened URI |
target | text | namespace-shortened URI |
The table 'io' (which TAP interally calls type)
is always present and stores statements with the rdf:type
property in the schema of the 'tp' table above.
Table lexicon (title, label, plural, adjective arcs) | ||
Column name | Type | Comments |
---|---|---|
source | text | namespace-shortened resource URI |
str | text | each word of literal |
fulltitle | string | full literal |
This tables is used to perform the literal indexing for certain
property arcs. It is generally queried like:
select source from 'lexicon' where str="word"
to identify candidate relevant resources.
There is a large quantity of existing data that is stored using relational database technology. We now compare relational schemas and idioms to features in RDF, and consider how best to expose relationally-stored data to RDF processors and other Semantic Web tools.
At the moment we only consider the problems of a data-dump into some serialised RDF format. Issues arising from wrapping existing relational database schemas with RDF triple-matching and querying APIs will be dealt with elsewhere in SWAD-Europe Deliverable 7.2 Report comparing existing RDF query language functionality, documenting different scenarios and users for RDF query languages.
Clear descriptions of the relational algebra[CODD] and its relationship to modern RDBMS are readily available elsewhere. Here we present a two-line "executive summary":
A simple approach to making data available in a format suitable for import or further manipulation with native RDF tools is obvious: simply generate a node for every row of every relation. To that node, attach property arcs (one per column in the relation / field in the table) with the field content as the property value.
Such an approach is reasonably straightforward to extend to provide a typical triple-based API. Such APIs need to associate private (model-specific) identifying information with blank nodes in order to preserve their identity. That mechanism can be used to associate with the generated blank nodes a row identifier for the underlying table (either using a ROWID feature, if present, or by encapsulating the values of sufficient fields to uniquely identify the row).
There are some conceptual problems with this simple approach, however. These may be effectively summed up as follows: the approach exposes an RDF description of the relational database, not the conceptual entities which the relational description is attempting to capture.
In particular, the expression of various problem-domain concepts within the idiom of relational algebra (particularly as implemented in a RDBMS) may end up using the same small set of RDBMS constructions to model several different problem-domain constraints. That is, there is not necessarily a perfect mapping from the problem-domain into the relational model (nor is it reasonable to expect there to be).
In addition, a developer of a RDBMS application may make a number of assumptions and optimisations that are possible due to (tacit) knowledge of the problem domain and/or experience with efficient database application implementation.
Therefore, a fully automatic solution to convert data from a relational schema into an RDF schema together with instance data can never prove to be fully satisfactory (although it may suffice in some circumstances). Instead, there are additional steps that may be taken (guided by knowledge of the initial problem domain) that can produce a more "natural" expression of the data using an idiom more familiar to RDF users.
We look at the case of a single relation first. Consider the relation:
(p1, ..., pn, a1, ..., am)
where the pi are the attributes that form the primary key, and the aj are the other attributes in the relation.
Where a primary key exists, we may consider a table row as describing the properties of an entity associated with that row in the table.
The first question that arises is how to name those entities within RDF. There are basically two ways that resources are represented within RDF.
The first is a simple solution but it poses problems: do we construct the URIrefs such that they encode primary key information? Do we need a separate table to map URIrefs to rows?
The second approach seems a more natural way to expose the information in a single relation, particularly where no obvious URIref exists to name a particular entity. This approach is the one adopted here; but should a relation describe a resource that has a "natural" or obvious URIref, it makes sense to use the first option.
We therefore transform the row into the following RDF (expressed here in an N-Triples-like format [NTRIPLES]):
_:x(p1, ..., pn) <eg:p1> p1 .
...
_:x(p1, ..., pn) <eg:pn> pn .
_:x(p1, ..., pn) <eg:a1> a1 .
...
_:x(p1, ..., pn) <eg:am> am .
where _:x(p1, ..., pn) is a blank node. For the purposes of serialising a graph, the blank node is allocated a local name using a function based on the primary key from this relation, thereby uniquely identifying the entity or resource being described. (For the purposes of wrapping a RDBMS within an RDF API, a "blank node" object would similarly have to carry such identifying information "beneath the hood".)
In fact, where tables are not completely normalised, several entities that exist in a one-to-one relationship may be described by the same relation. Let us assume, for example, that attributes aj ... ak actually describe a conceptually separate entity. Then the following RDF might be a more natural expression:
_:x(p1, ..., pn) <eg:r1> _:y(p1, ..., pn) .
_:y(p1, ..., pn) <eg:aj> aj .
...
_:y(p1, ..., pn) <eg:ak> ak .
Here, _:y(p1, ..., pn) is a second blank node generated similarly, with sufficient identifying information to be able to locate the source row.
Such denormalisation is common in production databases. In general, this sort of consideration is not generally reflected using the usual machinery of a RDBMS; recourse to an underlying ER diagram or other specific knowledge of the problem-domain is required to make the determination that multiple entities may be represented in a single relation.
In the above example, we used arbitrary <eg:p1> ... <eg:am> arc labels to connect a resource to its attributes. The selection of more appropriate attribute labels (possibly from existing schemas) is an integral part of the translation process.
It is conceivable that the same property label may arise naturally in several places, from several source relations. There are RDFS considerations arising from this, since range and domain declarations in RDFS are not class-contextualised.
Although a full treatment of adapting a legacy RDBMS to a "triple-matching" API is beyond the scope of this document (we consider primarily the export of data here), it is worthwhile noting that having the same property arc arise from several translation rules will likely give rise to (potentially expensive) union queries when trying to answer questions of the form:
Find all X and Y such that
X <property-name> Y .
There is a natural expression in RDF of a two-table join:
Table 1: (p1, ..., pn, a1, ..., ar)
Table 2: (f1, ..., fn, b1, ..., bs)
where the (fi) form a foreign key. The simplest approach is to create multiple arcs as follows (with the _:x(p1, ..., pn) generated as above, naming the entity in table 1).
For each row in table 2 with (fi) = (pi), generate
_:x(p1, ..., pn) <eg:b1> b1 .
...
_:x(p1, ..., pn) <eg:bs> bs .
Such a simple approach may well suffice in many cases. However, where a one-to-many relationship exists between table 1 and table 2, this approach provides no way of distinguishing between corresponding groups of triples produced by multiple rows from table 2. Therefore, an accurate expression of these relations in RDF requires the creation of a (blank) node for each row in table 2:
For each row in table 2 with (fi) = (pi), generate
_:x(p1, ..., pn) <eg:t1t2> _:y(p1, ..., pn) .
_:y(p1, ..., pn) <eg:b1> b1 .
...
_:y(p1, ..., pn) <eg:bs> bs .
Here, <eg:t1t2> is an arc linking the entity described in table 1 to a set of values from a row in table 2.
We also note in passing that RDF and RDFS can say little or nothing about any cardinality constraints here - more expressive languages (for example, OWL) are required to express such meta-information.
Occasionally, and where the underlying RDBMS does not support a better expression of an enumerated type, a database may utilise a foreign key to capture that notion. In such a situation it may be better to map the foreign key directly to either: a datatyped literal, or: a URIref-named resource denoted the value of the enumerated type. Determining when such an alternative mapping may be appropriate is not generally automatable from a data dictionary.
Where a many-to-many relationship exists between entities, a RDBMS generally has to model this using a joining table containing foreign keys from each of the tables representing the entities in the many-many relationship.
In the absence of additional fields in the joining table (that is, where the joining table is simply an artifact of the expression of the many-many relationship between the other two tables), such a relationship may be expressed directly in RDF via the use of a simple property arc. In such a situation the only choice remaining is of the name (and direction) of that property arc.
Should additional information exist on the joining table (ie, there are fields that are not part of a foreign key), or should the joining table relate more than two other tables, an additional node will have to be introcuded into the resulting RDF to model the relationship directly.
That is, we have two alternatives: either -
Table 1: (p1, ..., pk, a1, ..., al)
Table 2: (q1, ..., qn, b1, ..., bm)
Joining table: (f1, ..., fk, g1, ..., gn)
Generate the many-many triple as follows -
For each row in the joining table with (fi) = (pi) and (gj) = (qj), generate
_:x(p1, ..., pk) <eg:join> _:y(q1, ..., qn) .
or the more explicit alternative:
Generate the many-many triples as follows -
For each row in the joining table with (fi) = (pi) and (gj) = (qj), generate
_:x(p1, ..., pk) <eg:join> _:join(p1, ..., pk, q1, ..., qn) .
_:y(q1, ..., qn) <eg:join> _:join(p1, ..., pk, q1, ..., qn) .
The relational model does not impose an ordering on the rows comprising a relation: the rows of a relation form a flat, unordered bag (or set). Instead, a RDBMS permits queries to specify ordering criteria using the contents of fields in a query. To improve query efficiency, indexes on fields often used for ordering results may be created.
RDF has two constructs which may be utilised for ordering purposes: containers (the rdf:_1, ... construct) and collections.
Mapping the flat relational data onto either of these RDF constructs is likely to be counter-productive. If the original data is orderable, then it must contain fields which permit its ordering. To permit the most simple reflection of such data into RDF, it simply suffices to ensure that all the field contents that are used to create the ordering are reflected into RDF. By the manipulation of the resulting RDF, the data may then be extracted in the original order (assuming the RDF querying technology used permits the ordering of results).
With the recent revision of the RDF specifications (at time of writing, approaching last call)[RDFCORE], a mechanism for specifying a datatype to be attached to a literal has been introduced.
The adoption of such a datatyping mechanism has practially zero cost from the point of view of mapping relational data into RDF; and it has the additional benefit that typing information from the original database can be preserved.
For example, where a field exists which contains an integer value, the resulting RDF may utilise the XSD integer datatype rather than the "untyped" RDF literal. That is:
_:x(p1, ..., pn) <eg:intValuedProperty> "10"^^<xsd:integer> .
rather than:
_:x(p1, ..., pn) <eg:intValuedProperty> "10" .
For the basic types most commonly found within a relational database, existing datatypes (for example, from the XML Schema Datatypes[XSD]) will suffice.
Thus far, attention has been on the mapping of instance data, not on meta-data (the relational schema). Clearly, some mapping is possible. For example, it is trivial to create a class for the entities described in a particular relation, and to generate rdf:type arcs as required.
It is natural, in performing a relational-RDF mapping, to consider what support RDF gives for expressing schema-level constraints. In this regard, RDFS is expressively weak in many aspects.
There is no support in RDFS for expressing cardinality constraints. While a number of tools are capable of identifying two blank nodes as denoting the same entity (on the basis of equality conditions expressed over one or more properties of those nodes), the rules for such an inference are not expressible in RDFS.
The semantics for rdfs:range and rdfs:domain are also universal and conjuntive, not class-contextualised. Therefore, care needs to be taken while making range and domain assertions.
RDF does not have a mechanism for declaring new datatypes, although it can represent literals with arbitrary datatypes.
However, RDF and RDFS form a foundation on which a number of more expressive languages are expected to be built. It is hoped that these languages will fill in some of the missing features outlined here.
In the description above of the process of mapping from a relational database to RDF, some of the features present in RDF have not been used:
We finish with a rundown of some features in modern RDBMS that may present additional complications in the relational-RDF mapping.
Many modern RDBMS permit a wide variety of field types. Some of these may be mapped directly to datatyped literals for preexisting datatypes (eg, xsd:integer). For some types, however, it is possible that no corresponding XSD datatype exists, and a new datatype may be required to complete the mapping.
Some modern systems include compound types for fields - for example, "array of integers". Mapping such a construct into a single, large datatyped literal is somewhat unsatisfactory. Instead, RDF containers (or even collections) may be used.
Some systems include features which might be described as "object-relational". These include the "inheritance" of one relation's properties by another. While the detail of such features varies between products, the notion of class inheritance may be adequately modelled using RDFS.
The NULL value does not appear directly in relational algebra. With sufficient normalisation, it it possible to "factor out" the need for NULL values. However, pragmatic concerns such as a desire for efficient operation often lead to NULL values becoming necessary.
Unfortunately, the NULL value may be tacitly overloaded with a number of meanings: "not appropriate", "not available", "don't know", even such things as "this person is an adult" (when present in a "date-of-birth" field), and so on; the semantics for a particular NULL value are often only captured by application code.
Therefore, no automatic rule is possible to deal with the many uses and misuses of NULL. Instead, the semantics of each occurence must be individually considered, and an appropriate expression in RDF found. In many cases, it suffices to not emit triples whose object would be a NULL value.
KAON REVERSE[KAONREVERSE]
from the
KAON project[KAON]
is an early prototype for mapping relational database content to
ontologies enabling both storage of instance data in such databases
and querying the database through the conceptualisation of the
database
.
The mapper is intended to be merged with the
Harmonise Mapping Framework[HMAFRA]
tool and the user interface into the KAON OIModeler.
The work is ongoing at the current date 2003-02-18.
D2R MAP[D2RMAP] is a mapping language for turning a relational database into RDF along with a LGPL-licensed processor that implements the mapping, emitting RDF/XML, N3 or N-Triples.
We surveyed existing work mapping on triple stores and databases, especially relational ones and gave an overview of the major implementations with their schemas.
We have looked at the mapping of relational data into RDF. This discussion has been principally from the point of view of exposing relational data in a format suitable for import and processing by RDF tools. However, similar considerations apply when wrapping a legacy RDBMS for use with an RDF triple-matching API.