Abstract

RDf is a mechanism to universalize relational data. The vast majority of relational data is stored in conventional SQL data stores. This documents implementation experience mapping RDF queries to SQL data stores. This mechanism can be used to automatically join data from relational databases in different administration (creative) domains.

Status

This is a very rough document and represents the results of my studies and not those of the greater community.

Goals

Unlike much of the current work on SQL RDF data stores, this approach attempts to provide RDF access to data in specialized SQL data stores rather than provide an SQL data store for RDF. Such a mechanism will allow the joining of vast repositories of relational data currently isolated by administrative barriers and only occasionally exported via tedious interfaces.

Efficiency

Data access efficiency is critical in many domains. While conveniently general, a generic triple store forces a join for each term in a complex query. Accessing multiple fields in a record represented in a single table is more efficient.

Following is an ACLs (access control list) example which is used in the W3C web server. For optimization reasons, the ACLs are grouped -- the relevent fields for ACL group 6 are shown here:

SELECT acl,id,access FROM acls WHERE acl=6
|        acls        |
+-----+-----+--------+
| acl |  id | access |
+-----+-----+--------+
| ... | ... |    ... |
|   6 | 100 |   3122 |
|   6 | 102 |   3955 |
| ... | ... |    ... |
+-----+-----+--------+

In a generic triple store, each field above would constitute another entry in the list of triples:

|           triples            |
+-----------+---------+--------+
| predicate | subject | object |
+-----------+---------+--------+
|       ... |     ... |    ... |
|    ...acl | ...2138 |      6 |
|     ...id | ...2138 |    100 |
| ...access | ...2138 |   3122 |
|    ...acl | ...9287 |      6 |
|     ...id | ...9287 |    102 |
| ...access | ...9287 |   3955 |
|       ... |     ... |    ... |
+-----------+---------+--------+

Querying this data requires self-joins for each field you wish to constrain:

SELECT t0.predicate,t0.subject,t0.object,
       t1.predicate,t1.subject,t1.object,
       t2.predicate,t2.subject,t2.object,
FROM triples AS t0, triples AS t1, triples AS t2
WHERE t0.subject=t1.subject AND t0.subject=t2.subject AND 
      t0.predicate="...acl" AND t1.predicate="...id" AND t2.predicate="...access" AND 
      t0.object=6
|       triples as t0          |       triples as t1          |       triples as t2          |
+-----------+---------+--------+-----------+---------+--------+-----------+---------+--------+
| predicate | subject | object | predicate | subject | object | predicate | subject | object |
+-----------+---------+--------+-----------+---------+--------+-----------+---------+--------+
|    ...acl | ...2138 |      6 |     ...id | ...2138 |    100 | ...access | ...2138 |   3122 |
|    ...acl | ...9287 |      6 |     ...id | ...9287 |    102 | ...access | ...9287 |   3955 |
+-----------+---------+--------+-----------+---------+--------+-----------+---------+--------+

The added cost of the self joins must be balanced against the flexibility of a generic triple store when selecting a database. This cost will not be high in indexed databases with sparse data sets. The cost goes up when multiple initial constraints select large amounts of rows in the triples table which are eliminated by later constraints.

For instance, if the constraints on the above query were re-ordered so that the table (self-join) with the object=6 constraint were taken last:

WHERE t1.subject=t2.subject AND t1.subject=t0.subject AND 
      t1.predicate="...id" AND t2.predicate="...access" AND t0.predicate="...acl" AND 
      t0.object=6

t1 is constrained by having the predicate "...id". If there were 10000 things with the "...id" property in the triple store, this would select 10000 rows. The next (self-)join is on t2 where t2.predicate is "...access" and t2.subject = t1.subject. This would likely eliminate no rows at all, and, if some had more than one "...access" predicate, could cause the row set to multiply. The last constraints say that our object also has an "...acl" property and that its value is 6, trimming our row set down to 2 and throwing away the rest of the rows that had been kept in memory to this point.

Steps

In exporting an RDF view of a relational database (or business process), it is rare that one would choose to export the actual implementation details. It is more likely that the exporter would have an abstract model of the data that would represent the interface provided by the exporter. Agents outside of the exporter's administrative control would be invited to use this interface. The back end database would be mapped to the exported interface. For instance, W3C's access control system exports an RDF interface that exports RDF data and accepts POSTed RDF data in a particular schema.

Once the data is translated from the exported interface is translated to the internal implementation-specific RDF schema, it must be translated to actually SQL queries. In this opperation, an RDF query like

 namespace '(a http://localhost/SqlDB#)
    attach '(\"W3C::Rdf::SqlDB\" (\"properties:/usr/local/perl/modules/Conf/chacl.prop\" \"name:a::W3Cacls\"))       
    ask '(a::W3Cacls
     (a::uris.uri ?uri0 http://www.w3.org/Member/Overview.html)
     (a::uris.acl ?uri0 ?t1)
     (a::acls.acl ?acl0 ?t1)
     (a::acls.access ?acl0 ?access)
     (a::ids.value ?id0 \"eric\")
     (a::ids.info ?id0 ?info)
     (a::idInclusions.id ?idInc0 ?id0)
     (a::acls.id ?acl0 ?t0)
     (a::idInclusions.groupId ?idInc0 ?t0)
    )

is tranlated to SQL like

SELECT 
  b.acl as acl0_acl,b.id as acl0_id,b.access as acl0_access,
  a.id as uri0_id,
  c.id as id0_id,
  b.access as access,
  b.id as t0,
  d.id as idInc0_id,d.groupId as idInc0_groupId,
  a.acl as t1
 FROM uris AS a,acls AS b,ids AS c,idInclusions AS d
 WHERE (a.uri="http://www.w3.org/Member/Overview.html")
   AND (a.acl=b.acl)
   AND (c.value="eric")
   AND (c.id=d.id)
   AND (b.id=d.groupId);
+----------+---------+-------------+---------+--------+--------+-----+-----------+----------------+----+
| acl0_acl | acl0_id | acl0_access | uri0_id | id0_id | access | t0  | idInc0_id | idInc0_groupId | t1 |
+----------+---------+-------------+---------+--------+--------+-----+-----------+----------------+----+
|        6 |     100 |        3122 |     810 |   2112 |   3122 | 100 |      2112 |            100 |  6 |
|        6 |     102 |        3955 |     810 |   2112 |   3955 | 102 |      2112 |            102 |  6 |
+----------+---------+-------------+---------+--------+--------+-----+-----------+----------------+----+

Some steps to translation:

Table Identification

Tables are identified by prefixing the name with a base URI supplied in the configuration.

Field Identification

Fields are identified by prefixing the name with the table identifier.

Row Identification

Identification of a row in a table was done by identifying the row by a unique index on that table. For instance, the acls table has a single unique index on it formed by the combination of the acl, id and access fields:

+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+
| acls  |          0 | PRIMARY  |            1 | acl         | A         |        NULL |     NULL | NULL   |         |
| acls  |          0 | PRIMARY  |            2 | id          | A         |        NULL |     NULL | NULL   |         |
| acls  |          0 | PRIMARY  |            3 | access      | A         |         241 |     NULL | NULL   |         |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+

Therefor, a row in the acls table could be identified by

http://localhost/SqlDB#acls.acl=6&id=100&access=3122

which indentifies the one row in the table where acl = 6, id = 100 and access = 3122.

External Keys

Fields that represent external keys to another table are identified by the row identifier in that other table. For instance, the id field in the acls

+--------+------------------+------+-----+---------+----------------+
| Field  | Type             | Null | Key | Default | Extra          |
+--------+------------------+------+-----+---------+----------------+
| acl    | int(10) unsigned |      | PRI | NULL    | auto_increment |
| type   | char(1) binary   |      |     |         |                |
| id     | int(10) unsigned |      | PRI | 0       |                |
| access | int(10) unsigned |      | PRI | 0       |                |
| last   | timestamp(14)    | YES  |     | NULL    |                |
+--------+------------------+------+-----+---------+----------------+

table constitutes a reference to the id field in the ids table. Therefor, the range of the #acls.id field is an http://localhost/SqlDB#ids table row represented by a URI like

http://localhost/SqlDB#ids.id=100

Variable Bindings

While walking through the RDF to be translated, it proved easiest (after trying several other approaches) to record each place where a variable was used in the constructed SQL query and construct the constraints from list of references to that variable. Furthur, any place that variable was used as an object (the second position in the algae syntax above), represents the appropriate identifier for that node (row). Alternatively, one could say that the any and all object (field value) refrences to such a node should be documented as being external keys. This approach puts more of an assumption on the modeling completeness of the database relations than is otherwise necessary.

External RDF to Internal RDF Mapping

Mapping RDF schema to each other is (appears to be) as simple as defining the rules for the mapping. For instance, The W3C acls SQL database implies a structure where a uri object has a reference to a set of acl objects which each grant a specific access to a specific group.

nodes and arcs diagram of ACLs schema

This needs to be translated to the less pretty internal structure

nodes and arcs diagram of ACLs relational database schema

by a simple set of translation rules

( namespace '(sqlDB http://localhost/SqlDB# \
             acl http://www.w3.org/2001/02/acls/ns# \
             rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#) \
 attach '(\"W3C::Rdf::SqlDB\" \
               (\"properties:/usr/local/perl/modules/Conf/chacl.prop\" \
                \"name:sqlDB::W3Cacls\")) \
 interface (sqlDB::W3Cacls
            API  '((acl::login                   ?u       ?name)) \
            impl '((sqlDB::ids.value             ?u       ?name))) \
 interface (sqlDB::W3Cacls
            API  '((acl::memberOf                ?sub     ?super)) \
            impl '((sqlDB::idInclusions.groupId #?g1      ?super) \
                   (sqlDB::idInclusions.id       ?g1      ?sub))) \
 interface (sqlDB::W3Cacls
            API  '((acl::accessor                ?a       ?principal)) \
            impl '((sqlDB::acls.id               ?a       ?principal))) \
 interface (sqlDB::W3Cacls
            API  '((acl::accessTo                ?a       ?uri)) \
            impl '((sqlDB::acls.acl              ?a      #?acl) \
                   (sqlDB::uris.acl             #?urisRow ?acl) \
                   (sqlDB::uris.uri              ?urisRow ?uri))) \
 interface (sqlDB::W3Cacls
            API  '((acl::access                  ?a       ?access)) \
            impl '((sqlDB::acls.access           ?a       ?access))) \
)

in order to be processed by the SQL translator. This produces a graph matching the original internal graph:

nodes and arcs diagram of acls-internal-ex1 translated to the internal schema

This allows us to answer the query

(ask '(sqlDB::W3Cacls
       (acl::accessTo   ?acl       http://www.w3.org/Member/Overview.html) 
       (acl::access     ?acl       ?access) 
       (acl::accessor   ?acl       ?accessor) 
       (acl::memberOf   ?u1        ?accessor) 
       (acl::login      ?u1        "eric")) 
 collect '(?acl))

by translating it to

((http://localhost/SqlDB#uris.uri ?urisRow http://www.w3.org/Member/Overview.html)
(http://localhost/SqlDB#uris.acl ?urisRow ?aacl)
(http://localhost/SqlDB#acls.acl ?acl ?aacl)
(http://localhost/SqlDB#acls.access ?acl ?access)
(http://localhost/SqlDB#ids.value ?u1 "eric")
(http://localhost/SqlDB#idInclusions.id ?g1 ?u1)
(http://localhost/SqlDB#idInclusions.groupId ?g1 ?accessor)
(http://localhost/SqlDB#acls.id ?acl ?accessor))

and getting

(<http://localhost/SqlDB#acls.acl=6&id=100&access=3122>)
(<http://localhost/SqlDB#acls.acl=6&id=102&access=3955>)

ToDo

Rectify model diffs
Converge extra uri node pointing at the accessTo resource (http://www.w3.org/Member/Overview.html). Perhaps this should be done with a daml:uniqueProperty on uris.uri.
expand access bitmask
The access privileges represented by the access bitmask represent a set of access privileges that should be represented by a repeated property.

Done

External to internal translation rules
implement interface
Implement the code that parses and dispatches interface directives on incoming queries and outgoing proofs.

See Also

current work in RDF/SQL mappings
DBview - Dan Connolly, Tim Berners-Lee
DBview provides a serialization of the contents of SQL tables.
squish - Libby Miller, Dan Brickley et al.
Squish is a practical implementation of a generic RDF statement store in SQL.
relational database encoding in XML

Valid XHTML 1.0!

Eric Prud'hommeaux
Last modified: Wed Nov 14 18:57:55 EST 2001