Enabling Semantic Access to Enterprise RDB Data

 

(Position Paper for W3C workshop on RDF Access to Relational Databases)

 

Jun Yuan, David H. Jones

{jun.yuan, david.h.jones}@boeing.com

 

Mathematics & Computing Technology

Boeing Phantom Works

P.O. Box 3707, M/C 7L-70

Seattle, Washington, 98124, U.S.A

 

Motivations & Benefits

 

Relational databases have been an important part of the enterprise IT infrastructure, mainly because of their proven efficiency in dealing with huge volumes of data. However, the advent of the Internet has brought many end users into the information cyberspace. They usually have little, if any, training in either query languages or database systems, and thus have difficulties using database languages, e.g. SQL. A new semantic query language, such as SPARQL, is much preferred, and will enable users to retrieve information solely by the semantic understanding of the applicable domain.

 

Another drawback of relational systems is that database schemas change over the time, even though the semantics of the data have not changed. Non-semantic changes may be caused by many different things, including schema normalization or de-normalization, migrating from one DBMS product to another, changing of data type for a particular data field, or by using different techniques, for example using a stored procedure instead of a traditional view.  We have also observed that the addition of a new set of data or archiving existing data may also result in schema changes.

 

For any software-based information consumer, all these non-semantic changes imply one thing: the existing pre-defined or pre-cached query statements have to be modified accordingly. Otherwise, an exception may be returned instead of the query answers. Given the fact that a database usually has many software applications on top of it, it is really challenging and expensive to modify all the applications appropriately and in a timely fashion. The larger question then, is: Is there an approach that is able to hide these non-semantic changes from software applications so that it is not necessary to modify an application as long as there isn’t any change in semantics?

 

With the ever-growing information sharing requirement in almost every enterprise, retrieved information (query answers) needs to be shared among many information consumers, not only human but also computers and their software components. This implies that the semantics of query answers must be both human and machine understandable. We know that query answers from databases are usually represented by a flat table, with multiple rows, and each row may have a number of fields. Suppose that you are receiving a table with two columns, one being aircraft tail number and the other being a part number, would you be able to understand what the data really means without asking any questions to another person? 

 

Semantic access to existing RDB data holds the promise of bringing explicit semantics at several different levels and for different parties: The semantics of the domain, the semantics of data content, the semantics of a user-defined query itself, and the semantics of returned query answers. It not only provides information consumers a more convenient and user-friendlier interface to retrieve information, but it also offers a foundation for better system maintainability, better semantic interoperability among multiple data systems, and, hence, better data leverage .

 

Some Challenges

 

1.      Ontology

It is obvious to this audience that an ontology plays a very important role in regards to the semantic access to RDB data. Where and how shall we obtain this ontology? Is it difficult (or not) to derive this ontology? As a matter of fact, a semantic model is commonly used in database design. People are familiar with three-level database design, starting from the conceptual design, then the logical design, and finally the physical design. Each design phase has its own data schema: Conceptual, logical, and physical. A conceptual schema, usually an Entity-Relationship diagram, is actually a kind of semantic model. Based on the above it may appear that every data system should have a conceptual schema, which would be able as the basis of the ontology. Unfortunately, in practice there are two major reasons why this is not the case.

 

First, people usually start the database design with a concept model, but seem to totally forget about it as the implementation goes on. In fact, it is common, even though incorrect, that when there is a requirement to change semantics in a later phase of database design, database developers usually update the physical model directly, without referencing back to the concept model. This makes the original conceptual model quickly out of date.

 

Secondly, the semantic model disappears or gets embedded as a normal part of the implementation process, due to such factors as schema normalization. Schema normalization is a very common practice in database design, and is mostly driven by functional dependency. Without getting into details of the schema normalization, the main thing to point out here is: schema normalization is introduced mainly for the purpose of guaranteeing the integrity of data in the database. While there is no doubt that data integrity is one of the most important aspects in databases, the end result is usually that the implementation/maintenance process make the conceptual schema obsolete, even though for a select-only operation data integrity is not an issue.

 

2.      Mappings

Because of the proven efficiency of a RDB query engine, it is both desirable and reasonable to push the semantic query evaluation down into the RDB query engine as much as possible. This often requires advanced query rewriting techniques to generate either high-level SQL query statements or low-level relational algebra expressions, which are semantically equivalent (if possible) to the original semantic query. We argue that mappings are the key enablers to make such rewriting successfully.

 

Mappings fall into two categories: one maps the semantic model (ontology) to the underlying data model; and the other maps the semantic query primitives to the relational query primitives. For the second category, instead of reinventing the wheel, we can use a lot existing research results from previous deductive database research work. For example, there has been extensive research on mapping first order logic or description logic to the relational calculus. In order to successfully push down the semantic query evaluation as much as possible into a RDB query engine, a generic mapping structure needs to be developed between ontologies and relational data models and theoretical foundation needs to be built between the semantic query formalisms and the relational calculus/algebra.   

 

3.      Result transformation

Query answer transformation, i.e. converting query answers out of the RDB engine into an instantiation of ontology, is another challenge. How to formulate the URI for each instance is one of the most important things. Each database table may have its primary key defined, which, in many cases, is a good resource for generating URIs. However, this is not always enough.  A more challenging situation is to access multiple databases where the different databases don’t share the same primary key for the same real world entity.

 

Another big challenge in result transformation is how to preserve the intermediate query answers efficiently. For traditional RDB engines, the intermediate data is usually abandoned during the query evaluation process. Whereas we know that to recover the semantics of query answers, not only the query answers need to be returned, but also a lot of intermediate data. These intermediate data tell exactly how data elements in final query answers are semantically related, and need to be preserved all the way through the query execution.

 

4.      Performance

Performance is key! The challenge here is how to keep semantic query performance at comparable levels with that of SQL. The challenge of the performance has two aspects; one is how to push down query execution as much as possible, and the other is a how to efficiently transform query answers back into an ontological format.