Declaring RDF Views of SQL Data

Orri Erling (Program Manager, OpenLink Virtuoso)
oerling@openlinksw.com

OpenLink Software
10 Burlington Mall Road Suite 265
Burlington, MA 01803 U.S.A.
+3130 2733199

Introduction

OpenLink is a developer of data access middleware, from data access drivers to federated databases. In addition to presenting heterogeneous RDBMS-es as a single consistent SQL queriable data universe, OpenLink Virtuoso is also a full function RDBMS and RDF quad store. As part of its combined SQL/RDF functionality, Virtuoso maps local and remote relational data into RDF for query with SPARQL.

OpenLink sees the emergence of RDF into the mainstream of information technology as the concluding chapter of the enterprise information integration saga. As enterprise information is stored in a diversity of relational repositories, their mapping to RDF for unified ad hoc access becomes a key enabler of the agile enterprise. The emergence of ontologies for domains such as financial reporting and CRM is a key step towards having information where it is needed, within the enterprise as well as between partners. Also, serving structured information to consumers may become a factor in e-commerce, from exposing anything from inventories to transportation schedules for use by intelligent agents. All this requires mapping of business assets from OLTP to data warehouses into RDF while dealing with a broad spectrum of schema heterogeneity, security, scalability, and other issues.

While RDF opens a world of new possibilities by potentially making any data joinable with any other, this freedom poses new challenges to query processing, especially when merging heterogeneous, largely disjoint data sources. In the current state section we discuss how we have approached these issues.

In the Future Directions and Goals section we discuss gains to be achieved through standardization and vendor cooperation.

Present Functionality

Virtuoso RDF Views allows mapping arbitrary collections of relational tables, views, procedures, or web services into SPARQL accessible RDF. The RDF data is constructed on demand by evaluating SQL queries and stored procedures generated on the fly as part of a SPARQL query-processing pipeline.

Mapping

The mapping consists of a set of quad patterns, which map a graph, subject, predicate object quad into expressions involving columns or sets of columns and/or constants for each part of the quad. Multi-part keys and arbitrary mapping of key columns to IRI's via format strings and/or conversion functions are supported. Given IRI mapping patterns, the system can automatically infer which quad patterns may potentially join with which other ones.

SPARQL queries where the predicate is not constant are difficult to map to RDF. These will easily result in large unions, where every possible predicate is retrieved and a table may end up scanned multiple times. Virtuoso introduces a SQL construct where a multi-column result set is translated into a narrower multi-row set in order to deal with cases like:
SELECT ?person ?p ?v WHERE { ?person a person . ?person ?p ?v };
Which is conceptually equivalent to:
SELECT * from person;
presenting multiple rows per person instead of multiple columns (i.e. wide person rows).

As the graph IRI is a part of a quad map pattern, it may also need to be correlated with a relational column. This allows for example presenting a graph per person, per product family or per any other division that makes sense in the relational schema. Thus, a single graph does not always translate one to one to a set of quad patterns and relational tables but rather one mapping may be perceived as a set of multiple graphs.

Beyond Relational

Not all data of interest in Enterprise Information & Data Integration (EII / EDI) is accessible as a relational table that can be linked into a federated database. For example, most enterprise applications have an API that offers canned reports through a web services interface. Virtuoso allows representing these as SQL tables that can also be mapped into RDF for transparent query access via SPARQL. This is done by declaring a procedure view, also called table valued function, for accessing a web service and presenting data for use in a SQL query. A SQL cost model interface allows the query optimizer to make join order choices for these. Thus Virtuoso is not limited to presenting relational data to RDF data consumers but can act as a front end to any collection of web services, semi-structured or text data.

Application Experience

The more complex applications of Virtuoso RDF Views are the OpenLink Data Spaces (ODS) collaborative applications suite, the Musicbrainz.org database, and the OpenLink CRM package. These represent a total of several hundred tables and some thousands of quad patterns. We have found that choices of key to IRI mapping are crucial to limiting the set of possible joins, which is key to manageable response times. For example, when mapping the ODS to the SIOC ontology, any application, from photo album to weblog to mail to news reader, can provide a post. If we then ask for all posts, we get a union of all the eligible tables across all applications. If we then join users to posts to topics, we can end up joining all post tables to all tables containing tags for posts, which is a lot of useless joining. Judicious choice of key to IRI mappings allows the system to automatically prune out combinations like looking for a blog post's tag from a photo album's tags table.

Making a well-working and efficient mapping is not self-evident. While some things are easy to generate even automatically, this is seldom sufficient, specifically when the target ontology is a given, as it most often is, after all the value proposition of the semantic web is to reuse existing vocabularies. More design time automation is needed.

Requirements on SPARQL

For RDF to deliver on its promise as the lingua franca of all data integration, it will need extensions on the query side, specifically aggregates, grouping and text search. There is hardly any business intelligence query without aggregation and grouping, yet SPARQL itself does not standardize this even if implementations may offer this as an extension. Also, for any search, text indexing is essential but this too is an extension, not a part of the language. In the context of mapping, these are not technical problems since the underlying systems do offer these capabilities. However, the lack of standardization on the SPARQL side will cause implementations to lack interoperability, potentially hindering uptake. Thus, specifically when talking about reporting, even a de facto consensus on these matters would be very desirable.

Future Directions and Goals

Defining RDF views for moderately complex schemes, over 100 tables, can be quite complex. Specially with regard to lack of training in the corporate IS world, new tools are needed for interactive design and testing. These tools must combine the features of a schema designer with those of a report writer, all adapted to RDF.

To foster the creation of such tools, convergence at the back end is desirable. Defining a core of shared mapping concepts and codifying this as an RDF ontology would be a first step. The core mappings should be powerful enough for translating given relational schemes into given RDF vocabularies, creating and collapsing structures of intermediate blank nodes etc., mapping, assembling and disassembling multi-part keys and so on, yet should not be too complex to implement. This is a difficult requirement.

The take-off of the relational database was tied to the widespread appearance of query builders. The same will apply to RDF in the form of query builders, specifically ones that deal with mapping the existing world of all relational data into commonly accepted ontologies. This requires industry cooperation and consensus and we see the present workshop as an important first step in this direction.