SQL-RDF

Authors:
Andy Seaborne, Hewlett-Packard Laboratories
Damian Steer, Institute for Learning Research Technologies, University of Bristol, UK
Stuart Williams, Hewlett-Packard Laboratories

1 Introduction

We describe a prototype system that extracts data from a number of relational databases and integrates that data into a business process.  Our approach has two parts: a direct mapping of the databases into RDF and a set of rules to perform the ontology mapping.

We are interested in exploring the architectural and design best practices in this area with a possible view to work within W3C.

2 Business Problem

Exact details of the business situation have been removed. While this comes from a real situation, with a real issue they were addressing, the description here has been abstracted.

The business is involved in production of physical goods from various different geographical sites. Each site has installed equipment. There is a cycle of investment in such equipment whereby all the sites are studied to see where is the best use of the investment.  There are many factors involved in the production and production capacity at each site. There are limitations and geographical considerations. It may be that a site is running at capacity for it's current equipment so it would require a significant new capital equipment in order to increased production from the site. It may be that only one factor is currently limiting production so investment at that site might cause a significant increase in production. It may be that a site is in the wrong geographical location so that increases production is costly to distribute based on current demand. Each site has it's own cost profile based on equipment installed.

The different aspects of each site, such as equipment installed, geographical considerations, production history, current equipment state, are managed by several different applications, some developed in-house and some based on a commonly used industry-specific platform. Most information was recorded in SQL databases but further information, such as detailed site descriptions, exist in PDF documents (including diagrams).

The business process involve a review each site on a regular basis. The original process involved gathering data from the existing applications, incorporated them in a report, together with relevant documents.

To improve this business process, changes were being made. Firstly, the review process can be split into a general overview to select promising sites for investment, followed by detailed analysis. This reduces the amount of information that needs to be coordinated because many of the text documents relate only to the detailed analysis. Secondly, integration within the enterprise was involving around business objects, defined in XML.

The problem, therefore, is how to populate these business objects from a diverse set of data sources each of which was originally built in support of particular applications, not for general reuse.

3 Prototype

3.1 Design

The prototype consisted of three parts:

This approach is driven top-down by the requirements. It does not attempt to model the information in the original databases in the absence of the use of the information.

An alternative approach would have been to model the databases to give a richer interface at the RDF level. This would create a more reusable information resource but at the cost of needing to invest more time in modeling the databases and more knowledge capture of the meaning of the the database schema.

3.2 Data Sources

The Semantic Web does not change these problems except to note that the provenance and origin of data in a data integration exercise is important.

Keys

The data sources identified common concepts, specifically the different sites, with a common identifier scheme.  This scheme had been extended over time so that earlier developed applications used keys of different lengths to the more recently created applications.  The key scheme changes had been designed so that old keys remained valid as identifiers when transformed to the new form.

Using these keys across systems was critical to the experiment. Here, the data systems did agree because the linking was based on identifiers that matter to the applications.  Such identifiers were created, then were not later modified, so when they were copied between systems they had remained correct and accurate.

Data Maintenance

We found all the usual problems of inconsistent and divergent information in systems that had been built for different and independent purposes.

For example, two systems have geographical location.  In one system (the database of company assets), it was a significant piece of information; in another system, it was ancillary. We guess that the data was obtained by different  routes because often the location data was close but different. These were known issues and the organization took one database to be definitive.

In descriptions/display labels also differed in spelling and capitalization. This did not matter because a human was involved in reading both. For integration purposes, we needed these labels as identifiers so any choice was sufficient and in practice choosing one source over another, where possible, was more consistent in the business objects.

3.3 Rules

From conversations with the current maintainers of the systems, we developed mapping from business vocabulary to the concrete database vocabulary. SPARQL queries in the business vocabulary could then be automatic translated at query time into SQL queries on the databases.

There were 4 classes of rule rewrite used:

  1. Mapping a value key from one format to another
  2. Mapping a property of the business ontology to a definitive property in a database
  3. Linking data by join across databases by key
  4. Extracting information from multiple sources for a given value

We built two rule translation systems. One was custom code, one used the Jena rule engine. To use the rule engine, basic graph patterns of the SPARQL query were used as the input to the rule engine. Jena can handle variables in this context so the data model is RDF extended with named variables. Both systems were of comparable complexity but the rule language based was easier to modify and maintain.

Developing the mapping proven to be quite simple in practice. While it was possible that the exact semantics of a database field was not easily determined, the use of a requirements driven process meant that the rules were "fit for purpose".

Although in this prototype, only 4 classes of rule were needed, we do not believe that

There was no need to split the SPARQL query at this point into different basic graph patterns for different databases because that was handled by the database wrapper itself.

4 SquirrelRDF

The SQL-RDF adapter is part of SquirrelRDF.  SquirrelRDF provides access to relaltional databases, LDAP servers and IMAP servers. Like all SquirrelRDF adapters, it provides a SPARQL interface to a non-RDF store by extending the basic ARQ query engine. By intercepting query evaluation at key points, in particular the execution of basic graph patterns, arbitrarily complex queries can be supported. This approach ensures a full SPARQL implementation over the foreign data source.

The SQL adapter takes basic graph patterns (BGPs) and converts them to SQL queries. The conversion is based on Tim Berners-Lee's "Relational Databases on the Semantic Web" design issue from 1998 (RDF-RDB). The major features are:

For example:

  People
id  |  name
------------
 1  | Damian
 2  | Libby

Becomes:

ex:people;id=1 a ex:People ;
	ex:people_id 1 ;
	ex:people_name "Damian" .
ex:people;id=2 a ex:People ;
	ex:people_id 2 ;
	ex:people_name "Libby" .

SquirrelRDF uses JDBC's metadata facilities to configure itself, finding the tables, columns, and keys in the database. In practice, it is more useful to have a realised configuration file to consult, and to permit tweaks. This is automatically created using a command line tool.

The configuration file is an augmented RDF schema, with some additional statements declaring the classes to be mapped and database connection details. Using the example given above:

CREATE TABLE People
{
	id int,
	name varchar(30),
	PRIMARY KEY id
}
ex:People
	a rdfs:Class ;
	db:primaryKey ex:people_id ;
	db:database <jdbc:mysql://localhost/conference> ;
	db:table "People" .

ex:people_id
	a rdf:Property ;
	rdfs:domain ex:people ;
	db:col "id" ;
	db:colType "int" .
	
ex:people_name
	a rdf:Property ;
	rdfs:domain ex:people ;
	db:col  "name" ;
	db:colType "varchar" .

Specifying primary keys is optional, but important if data needs to be round tripped simply since the row URI is formed from the primary key(s).

Some SPARQL patterns can't be supported:

 We did not find either of these to be any limitation in the prototype.

Incoming BGPs are executed, firstly, by performing static analysis. Many queries can immediately be discounted by domain inference: if the same subject must have more than one type then the pattern cannot be satisfied. Similarly (mutatis mutandis) for range analysis and datatyping. Finally the BGP is split into further BGPs, grouped by relevant database, and each is translated into an SQL query. By this point the translation is straightforward.

5 Directions

In this prototype, we took a direct mapping of the original data into RDF, the accessed that through rules that translated between the business model and the data model. Each component is a possible area for on-going discussion in W3C and the semantic web community.

  1. A direct mapping of a database into RDF based on keys, tables and existing names.
  2. A language to express the mapping of one ontology into another (and back again).

Such a mapping approach is not specific to SQL databases and suggests that a useful, common technology in this area can be developed separately.

While RIF use case 8 (Vocabulary Mapping for Data Integration) covers the requirement for the rule interchange framework to be able to express this situation, there remains the opportunity for an actual language. The next step would be to scope the requirements.

6 Links

Jena -- A Semantic Web Framework for Java

ARQ -- a SPARQL query engine for Jena.

SquirrelRDF -- a SQL and LDAP access engine

Relational Databases on the Semantic Web Tim Berners-Lee, Sept 1998 revised April 2006.

D2RQ -- Chris Bizer, Richard Cyganiak, J?g Garbers, Oliver Maresch