UpdatingRelationalDataViaSPARUL

From W3C Wiki
Jump to: navigation, search

Updating Relational Data Via SPARUL (Updatable RDF Views)

Ivan Mikhailov, OpenLink Software

imikhailov at openlinksw.com , iv_an_ru at #swig channel of irc.freenode.net

Objective

  • We can retrieve relational data by SPARQL requests.
  • We could edit relational data by SPARUL commands.
  • We could even edit a mix of relational and RDF data by one SPARUL command.

Definitions

This proposal does not describe SPARQL and SPARUL languages, please refer to original specs if needed. Being one of developers of OpenLink Virtuoso team, I'll describe technical details in terms of Virtuoso SPARQL extensions. One of good starting points is Carl Blakeley's Mapping Relational Data to RDF with Virtuoso's RDF Views, more details can be found in Virtuoso User's Guide

Consider a very traditional Northwind database: "customers" make "orders", orders consists of "order lines" that mention different "products" of different "categories", "products" are shipped by "shippers" from "suppliers". We can map all data from this database into single RDF graph with very straightforward RDF View.

  • Every database row will be identified by an IRI that consists of table name and values of primary key fields, such as 'http://e5e.com/orderlines?order=12&product=45'. The exact format does not matter for the proposed functionality, what's important is that resulting IRIs are distinguishable by their syntax (easy to say that "this IRI may correspond to an order line" and "this IRI can not correspond to a product") and that they defines a bijection ("this IRI is made from SQL values 12 and 45").
  • The value of any foreign key field is an IRI of the referred row.
  • The value of any other field is used unchanged as an RDF literal.
  • Field names become names of predicates.
  • Table names are used in related class IRIs.

For every field of the schema (including individual fields of compound primary keys) we produce a rule (called "quad map pattern") that form triples { <pk-iri> <field-name-iri> field-value } , e.g.,

<http://e5e.com/orderlines?order=12&product=45> quantity 4 .

<http://e5e.com/orderlines?order=12&product=45> order <http://e5e.com/orders?order=12> .

For every table of the schema we produce a quad map pattern that form triples { <pk-iri> rdf:type <table-iri> } , e.g.,

<http://e5e.com/orderlines?order=12&product=45> rdf:type <http://e5e.com/orderline> .

Turning a Read-Only View into Simple Updateable View

The resulting RDF View is very simple in use. One may query it via SPARQL without any restrictions on possible query values. The internals of SPARQL to SQL front-end are not as simple as its API, but only three features of the front-end should be kept in mind:

  1. Given quad (or triple) and set of RDF Views, front-end can list all rules that may in principle form the quad in question.
  2. Given IRI and a description of a bijection format and the number of source field in the list of source fields, front-end can print an SQL expression that will return the field value for given IRI.
  3. SPARQL compiler knows which columns form primary keys and other unique keys.

These features let us make a set of triples by constructor clause of a SPARUL statement and turn it into set of SQL data manipulation statements. That's all we need for updating relational data by SPARUL. Of course, for SPARUL MODIFY there will be pair of sets of triples made by pair of DELETE/INSERT constructors, but the processing of every set will be quite similar.

Constructor of SPARUL statement should work as usual, no changes needed. Next, the constructed set of quads (of size N) should be partitioned into groups of triples such that every group describes one tuple of relational data. This can be done by a variant of topological sorting, taking O(N<sup>2</sup>) fixed-time comparisons. After partitioning every group should be checked for conflicting values (values of same table field from different items of a group should not contradict to each other) and time cost is O(N<sup>2</sup>), more correctly, O(max-size-of-group<sup>2</sup> * number-of-groups). One more check should prevent the routine from adding incomplete rows: if a table column is NOT NULL and has no defaults then some triple should provide its value, O(N) time. The rest is straightforward: a group is converted into SQL data manipulation operation. Operations should be done in order of cascading of foreign keys (e.g., orders are inserted before their order lines but deleted after order lines)

Consider the following set of triples:

<http://e5e.com/orderlines?order=12&product=45> order <http://e5e.com/orders?order=12> .

<http://e5e.com/orderlines?order=98&product=65> order <http://e5e.com/orders?order=12> .

<http://e5e.com/orderlines?order=98&product=65> order <http://e5e.com/orders?order=98> .

<http://e5e.com/orderlines?order=12&product=45> quantity 4 .

<http://e5e.com/orderlines?order=98&product=65> quantity 3 .

<http://e5e.com/orderlines?order=98&product=65> quantity 6 .

<http://e5e.com/orderlines?order=77&product=33> rdf:type <http://e5e.com/orderline> .

The partitioning will recognize following groups:


group 1: "orderlines" table, "order"=12, "product"=45

<http://e5e.com/orderlines?order=12&product=45> order <http://e5e.com/orders?order=12> .

<http://e5e.com/orderlines?order=12&product=45> quantity 4 .


group 2: "orderlines" table, "order"=98, "product"=65

<http://e5e.com/orderlines?order=98&product=65> order <http://e5e.com/orders?order=12> .

<http://e5e.com/orderlines?order=98&product=65> order <http://e5e.com/orders?order=98> .

<http://e5e.com/orderlines?order=98&product=65> quantity 3 .

<http://e5e.com/orderlines?order=98&product=65> quantity 6 .


group 3: "orderlines" table, "order"=77, "product"=33

<http://e5e.com/orderlines?order=77&product=33> rdf:type <http://e5e.com/orderline> .


group 4: "orders" table, "order"=12

<http://e5e.com/orderlines?order=12&product=45> order <http://e5e.com/orders?order=12> .

<http://e5e.com/orderlines?order=98&product=65> order <http://e5e.com/orders?order=12> .


group 5: "orders" table, "order"=98

<http://e5e.com/orderlines?order=98&product=65> order <http://e5e.com/orders?order=98> .


The check for contradictions in values of fields will find problems in groups 2 and 4. Triple <http://e5e.com/orderlines?order=98&product=65> order <http://e5e.com/orders?order=12> contradicts to itself, because "order" is equal to 98 in subject but 12 in object. Triples <http://e5e.com/orderlines?order=98&product=65> quantity 3 and <http://e5e.com/orderlines?order=98&product=65> quantity 6 specify different values for same 'quantity' field. Depending on configuration and type of operation, the procedure should either signal an error or remove these triples from the rest of the processing. Note that if check for self-contradiction triples comes before check for conflicts between different triples then group 4 will remain nonempty. This may be better for the rest of the processing because in some cases it will not prevent the rest of procedure from inserting new record in "orders" table that may be required for foreign key integrity.

The check for incomplete groups will detect that group 3 does not specify the NOT NULL quantity field, this is an error in case of SPARUL INSERT operation. Depending on configuration, the procedure should either signal an error or remove the group from the rest of the processing. Group 2 become incomplete after triple removal at previous check and should be removed like group 3.

If a group become empty because all its triples were removed due to contradictions then the group is very-very incomplete.

In SQL UPDATE, a table row is either updated or unchanged, SPARUL MODIFY may in principle cause DELETEs without 'pair' INSERTs, changing total number of rows in a table. If the operation is SPARUL MODIFY then its behavior will become funny as soon as some groups are removed. In current version of Virtuoso, SPARUL UPDATE and SPARUL MODIFY are synonyms, but it may be useful to make semantic of SPARUL UPDATE more strict and prohibit some weird cases by changing the logic: when there is a primary key such that it is associated with some DELETE group but there is no appropriate INSERT group then an error may be signaled (same for INSERT without DELETE). It should not matter whether 'required' groups are missing in the original couple of sets or they are filtered out by checks.

Finally, survived rows of survived groups should form insert, delete or update statements. This part seems to be trivial, in other words it will cause only subtle and hard-to-debug problems.

Complications and Obfuscations

Some quads in the set may not match any quad map pattern. They could be ignored but this will masquerade possible typo errors so depending on configuration the procedure may signal an error.

Some quads in the set may be stored only in default quad storage. These quads may alter the 'native' RDF storage but this may also masquerade possible typo errors so depending on configuration the procedure may also signal an error.

An additional setting is required to specify whether the operation should preserve integrity of foreign keys.

DELETE constructor of SPARUL DELETE or MODIFY may provide redundant information about rows that should be deleted or modified: it's enough to specify only fields of primary key but some triple(s) may contain more fields. It is an open question what should be done with additional fields. Should any attention be paid to them at all? If their values do not match actual values of fields of a row identified by primary key, should an error be signaled? I'd prefer to never worry about redundant fields on DELETE side but tastes may differ.

Some applications may define IRI formats using user-defined procedures. One stored procedure will be used to compose an IRI from N given SQL field values, each of N inverse procedures returns the field value for given IRI. These procedures may access tables, e.g. a procedure may get grantee ID, look at SYS_USERS and print either user IRI or role IRI depending on value of U_IS_ROLE, it may even refuse to compose an IRI at all if the ID value is not found in the table. In addition, the inverse function may refuse to parse an IRI that contains ID that is not found in the table. Obviously such 'strict' formats will prevent the updatable view from any insert of new rows. This it may be convenient to label such 'strict' formats as insert-unfriendly in order to improve run-time error diagnostics: it's much better to explicitly exclude quad map pattern from algorithm than to get strange field values and mystique errors at the end.

Required SPARUL extensions

SPARUL syntax should be extended in two directions: constructors should be more flexible and there should be room for options.

Constructors should be able to form set of quads in different graphs and graph names should be expressions, not necessarily literals, as well as all other expressions.

Virtuoso extends the 'group pattern' syntax with 'QUAD MAP rdf-view { group-pattern } ' clause that looks like 'GRAPH expression { group-pattern } ' and forces the SPARQL compiler to use only one RDF View to bind values of variables in group-pattern. Depending on situation, that restriction will either produce shorter SQL or improves diagnostics of errors in the group-pattern, e.g., a developer may intend to use triple pattern for data from only one given RDF View but the compiler may prove that no one triple generated by the view could match the pattern. The same clause could be useful at constructor patterns of the SPARUL statement.

There should be some syntax to specify the variety of options. There are too may 'depending on configuration' phrases in the above text. Virtuoso extends the SPARQL syntax with 'define parameter-name parameter-value' clauses at the very beginning of the query text, there may be more extensions, e.g., there may be different behavior of SPARUL MODIFY and SPARUL UPDATE. An extra headache is that Virtuoso allows configuration parameters at the beginning of the SPARQL text but not between individual data manipulation statements of a SPARUL request.

Comments