Rdb2RdfXG/ReqForMappingByOErling
Requirements for Relational to RDF Mapping
Attn: W3C RDB2RDF XG
By: Orri Erling, OpenLink Software
Revision 2, September 18, 2008
Motivation
The RDB2RDF XG has reviewed literature and practices in the realm of relational data to RDF mapping. The need for the mapping is evident, as a large part of the world's data is stored in diverse RDBMS with highly heterogeneous schema, terminology, identifiers, and granularity of representation. RDF and ontologies, on the other hand, promise to offer a framework for integrating all this data.
Most such mapping to date has been done on an ETL (Extract Transform Load) bases. Unsurprisingly, it has emerged from interviews with end user organizations, during the RDB2RDF XG itself as well as previously, that there is benefit in on-demand translation from relational data to an RDF representation. RDF offers a systematic ontology and a query language which can be used against the mapped data, without concern for the semantic heterogeneity inherent in independently arisen relational databases. Among other benefits, on-demand mapping removes the burden of maintaining yet another data warehouse. Further, in the event of very large databases, an RDF translation of the content would also make prohibitive demands on infrastructure.
Consequently, the RDB2RDF XG is contemplating recommending that a working group be formed to develop a standard for describing relational-to-RDF mappings. This document summarizes the requirements for such a language.
Current State
Presently, relational-to-RDF mapping is done with two main goals:
Bootstrapping a data web. There are lightweight solutions for exposing typically small to mid-size relational stores as RDF. There are RDF-izers for translating web service or document web output of large Web 2.0 sites into RDF.
Integration. End user organizations dealing with complex data integration issues have discovered RDF as a means of unifying and/or making the meaning of data of different sources explicit. Queries that would be very hard to express against a combination of relational stores and unstructured data become expressible if all the content is mapped into RDF.
These use cases share some basic requirements but their ultimate implications are very different. We believe that a single working group can serve both needs if the needs are made explicit.
Definitions
In the following:
ETL means physically storing triples produced from relational data in an RDF store.
On-demand mapping or mapping means translating a SPARQL query into one or more SQL queries at query-time, evaluating these against a set of unmodified relational databases, and constructing a SPARQL result set from the result set(s) of such SQL queries.
Mapping language means the deliverable of the working group, a language for specifying ETL and on-demand mappings.
Intended Use Cases
We expect the typical starting situation to have a
pre-existing RDB with its own schema. Applications for maintaining it will exist and typically will not involve RDF. We expect the RDF ontology that is the target of the mapping to most often exist before the mapping is undertakenn. In some cases the target ontology will evolve during the mapping project but most often not.
Update Sensitivity
The time sensitivity and volume of the data to be mapped will significantly vary from application to application. Just as with business intelligence, it is sometimes preferrable to go to the OLTP side for the information and sometimes to build a warehouse that is periodically updated from inline systems.
Both scenarios must be kept in mind when designing a mapping language.
Relative Desirability of Mapping and ETL
We expect cases favoring on demand mapping to be characterized by any of:
- High rate of change of the data
- Very large volume of data
- Relatively straightforward translation between RDF and the data
- Relatively few RDB's being integrated.
We expect cases favoring ETL to be characterized by:
- Large number of heterogenous sources of data
- Complex application logic needed for transforming the data
- RDF reasoning being performed on the mapped data
- Queries with variables in class or predicate positions
Desiderata
This section discusses requirements from the ETL, on demand and mixed ETL and on demand points of view.
General
The below criteria apply to all cases of RDB to RDF mapping, whether ETL or on demand:
- Representation neutrality - The mapping should be declarative, one mapping should suffice for both ETL and on-demand query translation.
- Simplicity - It should be possible to generate a trivial one-to-one mapping of a relational database from the SQL data dictionary alone, using things such as primary and foreign key constraints.
- Readability - A mapping should be human-readable and allow simple edits such as name substitution without extensive understanding of the databases and ontologies involved.
- Machine processable representation - For purposes of automatic generation, discovery, and maintenance of mappings, the mapping language should have a standardized machine representation, similarly to XQuery's XML syntax of the SQL Infoschema views.
- Smooth learning curve - We recognize that trivial cases of mapping are almost self-evident and that complex cases can be very intricate. We expect users to start with trivial mappings and to discover needs which are more complex as they proceed. If the mapping language is to be divided into levels, then the simpler language should be a part of the more complex. For example, simple cases could be expressed as shorthands of a more powerful representation. Having said this, a trivial mapping should look self-evident to anyone with knowledge of the relational schema and ontology involved.
- Comprehensiveness - The mapping must not be bound to a class-per-table and property-per-column model. The mapping must support multiple graphs. The components of any mapped quad, including graph, subject, predicate, and object, must be able to come from different SQL objects, joined by arbitrary SQL conditions. Since the SQL schema will usually be a given, the mapping language will at the minimum need to work with any legitimate SQL-92 schema enhanced vendor-specific data types.
Mapping on Demand
The below criteria apply to scenarios where mapping is made on demand.
- SQL Portability and Extensibility - The SQL generated from a mapping, whether for ETL or on-demand mapping, should be standards compliant. We recognize that many of the most relevant use cases however rely on vendor specific SQL, whether for XML storage, full text, user defined data types, GIS extensions, or other. Therefore provisions should be made for exposing such features via mapping. This is specially so when translating queries for real-time mapping.
- Composability - We expect most mapping use scenarios to involve integration between multiple pre-existing relational databases. While this does not per se increase the complexity of ETL, this should be taken into account when specifying real-time mappings. Specifically, the mapping should allow specification of which URIs may come from which tables/databases, so as to minimize the use of automatically generated
UNION
s andJOIN
s ofUNION
s at run time. Insofar as possible, combining mappings should not result in a combinatorial explosion of across-database joins, except when this is specially desired for the application. This is in part the responsibility of the information architect, but any mapping system should have features supporting this goal. - SPARQL fidelity - The use of mapping should not reduce the expressivity of SPARQL when used against mapped data. This means for example that variables in the predicate position should be permitted,
DESCRIBE
andCONSTRUCT
operations should remain possible, and so forth. This is not an issue with ETL, but needs to be considered when dealing with on-demand mapping.
The Union Bomb
We expect use cases to involve mapping multiple RDB's into one ontology. For example, in the event of an acquisition, there may be two CRM systems that will have to be queried side by side. In order to produce a sales report one might make a query joining salespeople, customers, orders and order lines. Naively taken, this would be a join of the union of both salespeople tables, of both customer tables and so forth.
In such a situation, the mapping must understand that an order in CRM1 can only refer to a custmer and salesperson in CRM1. Hence the result of the mapping must be two SQL queries, one to each CRM, with only the final aggregation step performed by the mapping middleware.
We have seen that when many tables may be translated into an RDF object of a specific class, we are liable to get SQL queries with large unions. For example, the pattern { ?s a post . ?s has_author <person1> . ?s ?p ?o } when applied to the OpenLink Data Spaces application can produce thousands of lines of SQL. There are many types of posts and these may have many different sets of properties, all retrieved from different tables of different applications.
A key can be translated into a URI with a pattern with placeholders for the key parts. For example, Virtuoso uses a syntax like the C printf format string. If two paterns are declared to be "customer/%d" and "/order/%d" it is obvious that these will never be equal. Hence a key that is mapped as a customer IRI will not be used to refer to a foreign/primary key that is mapped as an order IRI. In the underlying RDB both may be single integers and nothing says that a join could not be made between the two.
Mapping keys to strings in this manner makes it possible to eliminate joins that make no sense. Any mapping language for on demand mapping must have a feature from which disjointnedd of keys may be inferred.
Reversibility of Identifier Mapping
There will be cases where different identifiers in different databases refer to the same thing. For example, consider making a sales report by region combining CRM1 and CRM2. The region keys must be translated before the aggregation. This is a one way translation. The local region id's are translated to a common set of id's and the agggregation is done.
Then consider an alarm for low level of inventory for parts needed for product x. Suppose there are many inventory systems being integrated. Does one begin by looking for parts with low level of inventory or parts needed for product x? The answer will depend on the statistics of the data. In one case, one must translate the part id in the products database to its equivalent in the different inventory systems. In the other case, the translation is reversed. A mapping system must know whether a mapping between identifiers is reversible and must be able to take this into account in query plans.
Combining On Demand Mapped RDB With Stored Triples
Following is a discussion of specific issues encountered when combining physical triples with relational data mapped to RDF on demand.
For example, there is a relational CRM system and there is an RDF store with emails. We wish to find large orders with complaints received during the 10 first days after delivery. The emails are in order to extract named entities and they are stored as triples along with text and headers. Logically, a URI in a triple will correspond to a key in a relational table. Physically, the system must be able to decide whether to join from email triples to customers in the CRM table or the other way around.
One way of doing this is to have function indexes on the triples as well as any column/set of columns that are used in composing a URI's. A more efficient way is to have a query optimizer which knows that the mappping between the internal ID of an IRI in the triple and the set of relational column values to which the IRI refers is a bijective function. Any mapping language must have a means of specifying such characteristics of key to IRI mappings.
Consider {?complaint a email . ?complaintt <from> ?customer . ?customer <has_service_contract> ?ct . }
The system must be able to infer based on mapping declarations that it will look in the CRM RDB for the service contract. Hence the sender IRI must be transparently translated into the key for customer in the CRM. If the from IRI of the email does not match this pattern, the join will fail.
Use Case Driven Process
The development of a mapping language needs to be use case driven. As initial, relatively easy use cases, we may use benchmarks developed in the SQL and RDF worlds. Specifically, TPC-H from the Transaction Processing Performance Council can offer a typical decision support workload, and the Berlin SPARQL Benchmark from Chris Bizer et al can offer an OLTP-style, read-only workload.
For both benchmarks, a mapping implementation already exists from OpenLink Software.
We expect the harder challenges to come from real-life data-integration use cases. Accordingly, we invite the participation of end user organizations. A mapping working group could for example take the GIS database of the Ordnance Survey of Great Britain and a suitable set of biomedical databases as use cases, and test any developed mapping language against real queries found in these environments. This is not unreasonable since mappings of benchmark data sets are already part of pre-existing work.
Relation To Other Work
We expect many of the RDB to RDF mapping use cases to involve analytics or discovery. Such workloads can hardly be conceived of without aggregation, grouping, and nesting of queries in the query language. Fortunately, there are SPARQL implementations, notably HP Jena and OpenLink Virtuoso, which offer such features, and a SPARQL 2 working group working on standardizing such extensions will almost certainly coincide with a mapping working group.
This does not have necessary implications on a mapping language per se, but is an important part of the environment where the work is to proceed.
Implications for Implementations
From the above follows that an implementation of an on demand mapping language must have substantially the capabilities of a SQL optimizer plus logic that is specific to mapping. A SQL optimizer, when it sees a join will know what is to be joined. A mapping system must on the other hand determine what tables could match the pattern based on what is known of the format of the IRI's involved, the classes of the IRI's that may be bound to the query variables and so forth.
Experience in implementing such functionality suggests that this is as or more complex than a good SQL implementation and a notch harder than a basic SPARQL implementation over a physical triple store. We expect the complexity inherent in the matter to limit the number of implementations. While the basic ideas are not difficult as such, there is a large number of optimizations that are required for practical applications. Also, coupling to the target SQL implementations needs to be fairly close, extending to optimizer statistics and the like.
Criteria of Success
At the end of the term of a mapping working group, there should exist at least two interoperable implementations of the mapping language providing at least ETL. Aside from this, implementors are encouraged to support on-demand mapping.
The mapping language should accommodate the TPC-H, Berlin SPARQL Benchmark, and chosen use case SQL schemas. It should be possible to make a mapping with no loss of information. When having multiple instances of the relational schemas, it should be easy to merge them all into one RDF mapping without loss of either semantics or performance.
It should be possible to run all the benchmark queries and the typical lookup and analytics queries pertinent to the use cases against a mapped database in both ETL and on-demand mapped variants on at least one implementation. In the event that there is no SPARQL standard for aggregation, grouping, etc., doing this on mapped data via any supporting SPARQL extension will be sufficient.