W3C

Notes on Adding SPARQL to MySQL

SPASQL Example · ② SPASQL-MySQL · ③ SPASQL XTech Paper · ③ XTech Slides · ⑤ SPASQL CVS Tree

This is an experiment to add SPARQL support to MySQL. It's based on query translation work currently implemented in OO perl. An example of that translation is describe in a write-up on of an interface to WWW 2005 conference data. If anyone is interested in participating, please contact <eric+mysql@w3.org>.

Table of Contents

  1. Interface
  2. Query Construction
  3. Table Structure
  4. Referenced Code Snippets
  5. Example: parts flow
  6. Next Steps
  7. Building on Debian
  8. Building from MySQL sources
  9. See Also
  10. Change Log

Interface

The results of a SPARQL select are largely expressable in an SQL result set. It makes a preverse kind of sense to make SPARQL queries available to existing MySQL clients by re-using the MySQL wire protocol.

mysqld offers two interfaces, a named pipe (probably something equivilent on Windows, if mysqld runs on Windows), and a socket interface. msyql clients speak a simple line protocol that consists of a one-byte command and parms. In the case of a query, the command is known by the enum COM_QUERY (3) and looks like (in gdb escaping):

p packet
$16 = 0x8b45fe0 "\003SELECT * FROM __Holds__ INNER JOIN __Nodes__ ON __Holds__.p=__Nodes__.id  LIMIT 2"

The MySQL command line client, mysql, passes commands that it does not recognize as queries. For instance,

mysql> SPARQL:SELECT * FROM __Holds__ INNER JOIN __Nodes__ ON __Holds__.p=__Nodes__.id  LIMIT 2;

got to the server as

\003SPARQL:SELECT * FROM __Holds__ INNER JOIN __Nodes__ ON __Holds__.p=__Nodes__.id  LIMIT 2

The command stack for a successfully parsed query

#1  0x0814acdd in handle_select (thd=0x8b44ff0, lex=0x8b4e0b0, 
    result=0x8b4e2e8) at sql_select.cc:183
#2  0x08129c7d in mysql_execute_command () at sql_parse.cc:1509
#3  0x0812e879 in mysql_parse (thd=0x8b44ff0, inBuf=0x8b45120 "\001", 
    length=146034684) at sql_parse.cc:3080
#4  0x081289c8 in dispatch_command (command=COM_QUERY, thd=0x8b44ff0, 
    packet=0x8b45fe1 "SELECT * FROM __Holds__ INNER JOIN __Nodes__ ON __Holds__.p=__Nodes__.id  LIMIT 4", packet_length=146034672) at sql_parse.cc:1113

presents several potential places to intercept SPARQL queries. Options:

Query Construction

10.2.3 Adding a New Native Function has some pointers for adding functionality to MySQL. The impedance mismatch between SPARQL and SQL will require more effort by the parser/post-processor than the current parser code in sql_yacc.yy . The query construction will build an internal datastructure similar to that of SqlDB.pm . It will transform that into a set of calls to something like

void join_on(Field *f1, Field *f2) {
    Item *exp = new Item_func_eq(new Item_field(NullS, f1.tableName, f1.fieldName), 
				 new Item_field(NullS, f2.tableName, f2.fieldName))
    add_join_on(f2.tableName, exp)

Table Structure

SQL queries don't care about foreign keys or your entity relationship diagram or any of that. It is no more natural to join Person.address ON Address.primaryKey than to join Person.shoeSize ON Address.streetNumber. (Relational calculus cares about domains, but SQL doesn't reflect that, as far as I've seen.) SQL treats foreign keys as integrity constraints at modification time. Consequentially, many databases do not have any hints about how tables/fields are linked.

RDF encourages us to model relationships conceptually. Joe's address is not 21853, but instead the data in the tuple with the primary key 21853. Query tranlation leans heavily on a table of foreign keys. They are detected by structures like ?x p ?KEY . ?KEY p2 ?z where ?KEY is the object of one arc and the subject of another. SPASQL introduces a new primary key Item to MySQL which resolves it's field reference after the schema is loaded for the tables.

OPTIONALs

Unlike SQL LEFT OUTER JOINS, SPARQL's OPIONALs are potentially complex graph patterns, with multiple joins, references to new variables and previously bound variables, and FILTER constraints. The general recipe for this is to keep track of the variables (and therefor, the table attributes) introduced by an OPTIONAL graph pattern. The WHERE constraints need to enforce the FILTER conmstraints only if all the introduced variables are NOT NULL. Despite the fact that the relational engine may have been able to find some bindings for optional variabls, we can only report them if all the variables in the pattern were bound. For example

SELECT ?orderDate ?postCodePos ?districtPos 
 WHERE { ?order <Orders.orderDate> ?date . 
    OPTIONAL { ?order <Orders.shippingAddress> ?address . 
               ?address <Addresses.pattern> ?inf . 
               ?inf <AddressPattern.postCodePosn> ?postCodePos . 
               ?inf <AddressPattern.districtPosn> ?districtPos 
               FILTER ( ?postCodePos != "N/A" && 
                        ?districtPos != "N/A"
                      )
             }
       }

translates to

SELECT ?orderDate, 
       if(?address IS NULL || ?inf IS NULL || ?postCodePos IS NULL || ?districtPos IS NULL, NULL, ?postCodePos ) AS ?postCodePos, 
       if(?address IS NULL || ?inf IS NULL || ?postCodePos IS NULL || ?districtPos IS NULL, NULL, ?districtPos ) AS ?districtPos
WHERE ( ?address IS NULL || ?inf IS NULL || ?postCodePos IS NULL || ?districtPos || 
        (?postCodePos != "N/A" && ?districtPos != "N/A")
      )

The remaining problem is that one may get duplicate rows like

?orderDate?postCodePos?districtPos
2002-09-08 00:00:00NULLNULL
2002-09-08 00:00:00NULLNULL

if there were, say, two AddressPatterns @@need backward pattern to illustrate properly?@@ for a given address, but some other binding in the OPTIONAL fails.

Example: parts flow

An example that nests UNIONs and OPTIONALs is essential for working out the GUT (grand unified theory). This example shows what parts will pass through shipping and recieving:

Verbose SPARQL parts flow query

SELECT ?part ?descr ?given ?family ?posn ?rma ?billCity ?billState ?shipCity ?shipState
 WHERE { { ?order <Orders.product> ?prod .
           ?order <Orders.customer> ?cust .
           OPTIONAL { ?order <Orders.shippingAddress> ?shipAddr .
                      ?shipAddr <Addresses.city> ?shipCity .
                      ?shipAddr <Addresses.state> ?shipState }
         } UNION {
           ?order <Returns.product> ?prod .
           ?order <Returns.customer> ?cust .
           ?order <Returns.RMAnumber> ?rma
         }
         ?prod <Products.partNo> ?part .
         ?prod <Products.description> ?descr .
         { ?cust <Customers.id> ?num .
           ?cust <Customers.givenName> ?given .
           ?cust <Customers.familyName> ?family .
           ?cust <Customers.billingAddress> ?addr
         } UNION {
           ?cust <Employees.id> ?num .
           ?cust <Employees.givenName> ?given .
           ?cust <Employees.familyName> ?family .
           ?cust <Employees.position> ?posn
         }
         ?cust <Employees.homeAddress> ?addr .
         ?addr <Addresses.city> ?billCity .
         ?addr <Addresses.state> ?billState }

or, factored into predicates that can map to one of two tables:

Terse SPARQL parts flow query

SELECT ?part ?descr ?given ?family ?posn ?rma ?billCity ?billState ?shipCity ?shipState
 WHERE { ?order <Orders-Returns.product> ?prod .
         ?order <Orders-Returns.customer> ?cust .
         { OPTIONAL { ?order <Orders.shippingAddress> ?shipAddr .
                      ?shipAddr <Addresses.city> ?shipCity .
                      ?shipAddr <Addresses.state> ?shipState }
         } UNION {
           ?order <Returns.RMAnumber> ?rma
         }
         ?prod <Products.partNo> ?part .
         ?prod <Products.description> ?descr .
         ?cust <Customers-Employees.id> ?num .
         ?cust <Customers-Employees.givenName> ?given .
         ?cust <Customers-Employees.familyName> ?family .
         { ?cust <Customers.billingAddress> ?addr
         } UNION {
           ?cust <Employees.position> ?posn
         }
         ?cust <Employees.homeAddress> ?addr .
         ?addr <Addresses.city> ?billCity .
         ?addr <Addresses.state> ?billState }

These correspond to:

SQL SPARQL parts flow query

SELECT Products_0.partNo AS part,
       Products_0.description AS descr,
       U1.given,
       U1.family,
       U1.posn,U0.rma,
       Addresses_1.city AS billCity,
       Addresses_1.state AS billState,
       U0.shipCity,
       U0.shipState
  FROM  (  
              SELECT¿↔  Orders_0.customer AS cust,
                     Orders_0.product AS prod,
                     NULL AS rma,
                     IF(Addresses_0.city IS NULL OR Addresses_0.state IS NULL, NULL, Addresses_0.city) AS shipCity,
                     IF(Addresses_0.city IS NULL OR Addresses_0.state IS NULL, NULL, Addresses_0.state) AS shipState
                FROM Orders AS Orders_0
                     LEFT OUTER JOIN Addresses AS Addresses_0 ON Orders_0.shippingAddress=Addresses_0.id
               WHERE Orders_0.customer IS NOT NULL AND Orders_0.product IS NOT NULL
        UNION ALL  
              SELECT Returns_0.customer AS cust,
                     Returns_0.product AS prod,
                     Returns_0.RMAnumber AS rma,
                     NULL AS shipCity,
                     NULL AS shipState 
                FROM Returns AS Returns_0
               WHERE Returns_0.customer IS NOT NULL AND Returns_0.product IS NOT NULL AND Returns_0.RMAnumber IS NOT NULL
    )        AS U0  
    INNER JOIN Products AS Products_0 ON U0.prod=Products_0.id
    INNER JOIN (
              SELECT Customers_0.id AS num,
                     Customers_0.givenName AS given,
                     Customers_0.familyName AS family,
                     NULL AS posn,
                     Customers_0.billingAddress AS addr
                FROM Customers AS Customers_0
               WHERE Customers_0.id IS NOT NULL AND Customers_0.givenName IS NOT NULL AND Customers_0.familyName IS NOT NULL AND Customers_0.billingAddress IS NOT NULL
        UNION ALL
              SELECT Employees_0.id AS num,
                     Employees_0.givenName AS given,
                     Employees_0.familyName AS family,
                     Employees_0.position AS posn,
                     Employees_0.homeAddress AS addr
                FROM Employees AS Employees_0
               WHERE Employees_0.id IS NOT NULL AND Employees_0.givenName IS NOT NULL AND Employees_0.familyName IS NOT NULL AND Employees_0.position IS NOT NULL AND Employees_0.homeAddress IS NOT NULL
    ) AS U1 ON U1.num=U0.cust
    INNER JOIN Addresses AS Addresses_1 ON U1.addr=Addresses_1.id
 WHERE Products_0.partNo IS NOT NULL AND Products_0.description IS NOT NULL AND Addresses_1.city IS NOT NULL AND Addresses_1.state IS NOT NULL;
partdescrgivenfamilyposnrmabillCitybillStateshipCityshipState
G1013poolBiffThompsonEdgeCityAV
G0max5skateboardChipThompsonEdgeCityAV
G0jwl13nose ringChipThompsonEdgeCityAV
G0jpl56other ringChipThompsonEdgeCityAVEdgeCityAV
G1grl5grillWillieWonkamad scientist湘南台神奈川
G0rk5rebelious musicChipThompsonfickle2EdgeCityAV

Query Federation

The SPARQL language supports named graphs which identify the source of information. SPARQL queries can use named graphs to allow the query to direct portions of the data collection/unification to occur at any data source which supports a query protocol. These sources may be SPARQL endpoints or other databases that support SPARQL or some similarly expressive language. The FeDeRate for Drug Research demonstrates such a query in the pharmaceutical domain.

Practically, the engine doing the federation needs to know what language to use to speak to the remote data source. If the remote is a SPARQL endpoing (something that speaks the SPARQL Protocol, the protocol will be HTTP. If it's another MySQL database, the federating engine can use either SQL or SPARQL over the MySQL protocol.

Query Dispatch — Iteration

One way to federate queries is to ask the federated query one time for each row in the current working set, substituting in all the bound variables in each row. This process is described in FeDeRate for Drug Research. This process works for SQL, SPARQL and XQuery (untried, but theoretically true) remote query services.

I have no eartly idea how to implement this in MySQL. Fortunately, the two methods described below allow all of the relevent bindings to be shipped to the remote query service at once.

SPARQLfed

The SPARQLfed query language allows one to bundle the relevent variables from the result set with a query. Instead of the client asking a query one time per binding, the server can parse the query either construct a temporary table with the passed variable bindings, or iterate over the substitutions locally. Either way, it is quicker than having the client ask the question over and over again.

SQLfed

SQLfed is like SPARQLfed except that the bindings are specifically passed as a temporary table creation, inserts into that table, and a join against that table.

CREATE TABLE T23123 (name VARCHAR(255), planNo INTEGER);
INSERT INTO T23123 (name, planNo) VALUES ("bob", 23), ("sue", 18), ("trina", 23);
SELECT Person_0.email FROM Person AS Person_0
       INNER JOIN HealthPlan AS HealthPlan_0 ON HealthPlan_0.planID=Person_0.healthPlan
       INNER JOIN T23123 ON T23123.name=Person_0.email
                            AND HealthPlan_0.planID=T23123.planNo;
DROP TABLE T23123;

This requires table creation, insert (, select) and table drop privileges; pretty much the keys to the castle.

Plan

Apparently, Brian Miezejewski has some code that may be helpful here. Pinging him now.

Progress/Milestones

Next Steps

Building on Debian

The sparql directory has all the pieces needed to patch and re-build a MySQL with SPARQL support. One can build from the debian source:

  1. @@debian build foo@@

    Each time you make a mistake, you can rm configure-stamp to force autoconf to run again.

  2. install the sparql directory under the mysql-dfsg-5.0-5.0.?? directory.
  3. build the binary:

or just install a pre-built binary:

Building from MySQL sources

Dynamic Parser

The dynamically loadable parser provides a very nice way to develop and update the SPASQL/MySQL implementation (or any other language) without compiling MySQL. It extends the client/server protocol with flags for 3 user parsers in addition to the regular SQL parser. The user and set and query the dynamic parser through the system variables dymanic_parser1, dymanic_parser2, dymanic_parser3. Parsers are loaded by setting the dynamic_parserN global variable.

SET GLOBAL dynamic_parser2='libsparql.so'
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like '%parse%';
+-----------------+--------------+
| Variable_name   | Value        |
+-----------------+--------------+
| dynamic_parser1 |              | 
| dynamic_parser2 | libsparql.so | 
| dynamic_parser3 |              | 
+-----------------+--------------+
3 rows in set (2.61 sec)

The mysql client has been extended to spot queries starting with "parser1", "parser2" or "parser3" and pass them to the server with the appropriate request identifier.

API

The dynamic parsers are derived from the class Dynamic_parser, defined in include/dyn_parser.h. Each library object has an initialize function which takes a THD pointer and a pointer to the string to be parsed. (In this implementation, the parser may modify or truncate the string; sql_parse simply deletes it when the parser is done.) initialize returns a parser object. mysqld calls the bool parse() function and continues throgh the rest of the code path as it would have after calling parse_sql.

API extensions seem likely, however, they do not affect the stability of the rest of MySQL.

See Also

Change Log

 
$Log: Overview.html,v $
Revision 1.24  2008/02/18 15:37:23  eric
+ Next Steps

Revision 1.23  2008/02/15 03:22:14  eric
~ found 5.2 patches

Revision 1.22  2007/09/05 20:01:51  eric
~ reflect updates to SPASQL/MySQL impl (5.1, 5.2)

Revision 1.21  2007/06/19 23:06:58  eric
+ (most of?) Query Federation plan

Revision 1.20  2007/06/19 22:12:33  eric
+ W3C icon

Revision 1.19  2007/06/19 22:08:16  eric
~ navigation bar

Revision 1.18  2007/05/25 20:18:57  eric
+ more specific patch instructions

Revision 1.17  2006/05/27 13:51:12  eric
+ started Building on Debian section

Revision 1.16  2006/05/15 10:47:54  eric
+ subsections for the parts flow example

Revision 1.15  2006/05/14 19:23:35  eric
fix city and state

Revision 1.14  2006/05/06 23:20:49  eric
~ fixed parts flow SQL

Revision 1.13  2006/05/06 00:47:38  eric
added states to the parts flow example

Revision 1.12  2006/05/06 00:40:09  eric
+ details from SQL -> RDF model mapping

Revision 1.11  2006/05/01 21:29:24  eric
+ parts flow example



Eric Prud'hommeaux, W3C Team Contact for the RDF Data Access Working Group <eric@w3.org>
$Date: 2008/02/18 15:37:23 $