W3C

SPASQL: SPARQL Support In MySQL

SPASQL Example · ② SPASQL-MySQL · ③ SPASQL XTech Paper · ③ XTech Slides · ⑤ SPASQL CVS Tree

Abstract

FeDeRate [FED] provides a mapping between RDF queries and SQL queries over conventional relational databases. SPASQL provides similar functionality, but eliminates the query rewriting phase by parsing the SPARQL [SPARQL] directly in the database server. This approach is efficient, and allows applications linked to currently deployed MySQL client libraries to execute SPARQL queries and get the results back through the conventional MySQL protocol.

The parallels between the SPARQL Result Set and SQL relational results allow applications to issue SPARQL queries and process the results as they would process SQL query results. In practical terms, this would entail changing one's MySQL server and changing a query in a PHP script to issue a SPARQL query instead of a SQL query, greatly simplifying the deployment requirements of systems like FeDeRate.

Relational schemas tend to be brittle; they require maintenance when new attributes need to be stored. Storing odd attributes in a generic triple store allows a database to efficiently handle the uniform data, and provide flexible storage for irregular data in the generic triple store. Queries on conventional relations can be joined with triple store queries, providing an efficiency where possible and flexibility where needed.

Table of Contents

1. Introduction

RDF data is expressed as triples, each consisting of a subject, predicate and object. Relational data is stored in relations (more commonly known as tables). SPARQL is a query language for RDF data and SQL is a query language for relational databases. Both query languages are tailored to their own data domains, but they have similar pattern matching expressivity. RDF grounds terms on URIs, which makes it convenient to express data in a universal context. A SPARQL query can express a query federated over multiple relational databases, web documents, and other kinds of stores.

SPASQL is a modified MySQL server which is able to parse both SQL and SPARQL queries. This allows one to embed SPARQL queries in any MySQL client. For instance, a PHP page may include SPARQL queries where it used SQL before.

<? mysql_connect(localhost,$username,$password);
   @mysql_select_db($database) or die( "Unable to select database");

   mysql_query("SPARQL: 
SELECT ?address ?apt
 WHERE { ?o <Orders.shippingAddress> ?address .
         ?address <Addresses.apt> ?apt };");
?>

Because the server continues to process SQL queries (with no change in performance), one may install the SPASQL server without breaking existing infrastructure. This is exciting both because it allows one to choose either language to query relational data, and because it opens up a world of database federation possibilities,

2. SQL - SPARQL Mapping

SQL is a language for querying relational data. SPARQL is a language for querying RDF data. Of course, Mapping SPARQL to SQL requires first mapping the relational data to RDF data. This mapping should be intuitive and useful. Given a particular data mapping, a correctly mapped SQL query is one that yields the same results when executed over the mapped relational data as the original SPARQL query over the original RDF data.

RDF is a graph-based language. That is, it has links built into it. Where data in SQL encodes links as coincident attributes, RDF incorporates that link into the data model. For example, a SPARQL query does an implicit join:

SELECT ?d ?apt
 WHERE { <Orders.id=3183> <Orders.shippingAddress> ?d .
         ?d <Addresses.apt> ?apt }

where an SQL query requires an explicit foreign key/primary key equivalence test:

SELECT A.id AS d, A.apt AS apt
  FROM Orders AS O_0
 INNER JOIN Addresses AS A ON O.shippingAddress=A.id
 WHERE A.id=3183

2.1 Data Mapping

MySQL compiles SQL queries into a data structure that includes the joined tables and the constraints (coming from the SQL WHERE clause). SPASQL compiles SPARQL queries into the same data structures, allowing the MySQL engine to execute them the same as SQL queries.

An engine executing SQL queries does not know about foreign keys or entity relationship diagram or anything like that. It is no more natural to join Person.address ON Address.primaryKey than to join Person.shoeSize ON Address.streetNumber. (Relational calculus cares about domains, but SQL does not reflect that, as far as I've seen.) SQL treats foreign keys as integrity constraints at modification time. Many databases operate without any metadata about how tables/fields are linked.

This mapping is most easily expressed as a relational to RDF mapping. In relational calculus, a table (AKA "relation") has some number of fields (AKA "attributes"). The mapping uses an RDF predicate to identify a given field in a table. For example, the RDF triple <Orders.id=2185> <Orders.orderDate> "2002-09-07 00:00:00" corresponds to a table with a row (AKA "tuple") in the Orders table with id=2185 and orderDate="2002-09-07 00:00:00". RDF encourages us to model relationships conceptually. Joe's address is not 2, but instead the data in the tuple with the primary key 2. SPASQL query translation infers primary keys from the structure of the query.

Note: this implementation is limited to single-field primary and foreign keys (see Multi-field Keys in the Future Work section).

2.2 Query Construction

The expressivity of the queries exceeds that of SQL in at least one place. As mentioned above, SQL forces the query to express links as foreign key / primary key equivalence tests. The SPARQL user need not know anything about primary keys. In fact, they can be considered to be simply a data encoding detail, and not a property of the conceptual data model. Mechanically, links to tables must be resolved to be links to the foreign key of that table, so it is ultimately the responsibility of SPASQL to construct query trees that efficiently discover the foreign key and test equivalence to the foreign key.

MySQL only knows the structure of a database (indexes, primary keys, ...) when it is executing the query and has loaded the table descriptors into memory. At parse time, SPASQL constructs a late-binding field reference, which gets the name of the field after the table schemas are loaded. This code is executed in the Item_primary_key_field::fix_fields, which is a virtual function that MySQL calls on all atoms in the query structure after the tables are loaded. This results in a small number of extra clock cycles as SPASQL walks through the list of tables, and then the list of indexes for the desired table to find the primary key. In practice, the number of joined tables tends to be small (not at all worth the overhead of changing the list to a hash) and the primary key tends to be the first index in the list.

The ability to join on fields that are selected at query execution represents an expressivity greater than that of SQL, or even, relational calculus. However, unlike most higher level extensions to predicate calculus, it only trivially changes the performance.

2.3 Feature Mapping

The overlap between SPARQL expressivity and SQL expressivity is very high. The following table show how the features are mapped, and the SPASQL implementation state:

SPASQLSQLstatus
SELECTSELECTimplemented
ASKSELECT COUNT(*) > 0not
CONSTRUCTserialize RDF graphnot
DESCRIBEreturn tuple identifiernot
s p o data modeltuple with attribute corresponding to pimplemented
FILTERWHEREimplemented
OPTIONAL patternLEFT OUTER JOINimplemented
UNIONUNIONpartial, see UNION Limitations
named graphsnamed databases and federated querynot
Table Result Modifiers
DISTINCTDISTINCTimplemented
ORDER BYORDER BYimplemented
LIMITLIMITimplemented
OFFSETOFFSETimplemented
Operators
|| && + - * / < <= > >= sameimplemented
BOUNDIS NOT NULLimplemented
isIRIN/A
isBlankN/A
isLiteralN/A
strN/A
langN/A
datatypenot a dynamic question
langMatchesN/A
regexregexnot

may be applicable in another mapping of relational data to RDF.

2.4 Conjunction Mapping

The basic SPARQL form pattern1 . pattern2 denotes a conjunction of pattern1 and pattern2. If pattern1 and pattern2 have the same subject and the predicates correspond to columns in the same table, these constraints identify non-NULL properties of the same tuple; no table join is implied. If the object of pattern1 is the subject of pattern2, pattern1's property is assumed to be a foreign key to the primary key of the table indicated by pattern2's predicate. This foreign key / primary key test is used as the join constraint, similar to that produced by parsing the expression in JOIN table ON expression.

2.5 OPTIONAL Mapping

In SPARQL, the keyword OPTIONAL introduces new graph pattern which might not be bound. This corresponds to an OUTER JOIN in SQL, however, extra constraints are needed to assure the integrity of any conjunctions in the graph pattern. This includes extra (OUTER) JOINs implied by the graph pattern. For example, all variables bound by UNION { pattern1 . pattern2 } must be not-NULL. In addition, any FILTER constraints in the graph pattern must be met if the introduced variables are to be bound.

2.6 FILTER Mapping

SPARQL FILTERs are like SQL's WHERE constraints, except that they may be scoped by being in an OPTIONAL pattern. SPASQL accumulates a set of constraints from all the FILTERs, adds the appropriate graph integrity constraints for any FILTERs in OPTIONAL graph patters, and assembles a constraint datastructure similar to that produces by parsing an SQL WHERE clause.

2.7 UNION Mapping

SQL has UNIONs, but the semantics are much simpler than those of SPARQL's UNION. Two problems arise:

3. Protocol

SPASQL uses the existing MySQL protocol with existing MySQL client libraries. Practically, this means that installing SPASQL requires only installation of the augmented server. This is motivated by simplicity of migration and the opportune re-use of MySQL's result structure.

3.1 Deployment

The MySQL protocol is well deployed and supported. MySQL client libraries are available for most computer langages on most platforms. Organizations typically have a small number of database installations and a large number of clients for those databases. W3C, for instance, has one relational database but many varied clients. It is much simpler for such an organization to offer new query services by upgrading the database server.

3.2 SPARQL XML Results Similar to MySQL Tables

The results of a SPARQL select are largely expressable in an SQL result set. It makes a perverse kind of sense to make SPARQL queries available to existing MySQL clients by re-using the MySQL wire protocol.

The SPARQL XMLres reports an ordered set of variable bindings. This is very similar to the table results of an SQL query. Unbound variables in SPARQL correspond naturally to NULLs in SQL. The primary distinction between the two result formats, non-uniform datatypes, is generally not an issue when the data comes from a relational database, as the column types are homogeneous. The exception is that SQL UNIONs can produce columns of data with non-uniform type. This is one of the reasons that UNION is considered only partially done in this implementation.

3.2.1 Data Type Uniformity

Relational databases make a set of practical assumptions to help performance. One of those is that each value in a column of a database has the same type. For instance, every part number in a partNumber field will be an unsigned integer. Some systems require variable datatypes, but this is encoded at an application level, for instance, by creating another field that identifies the datatype of a particular field. Datatypes that are opaque to the database are beyond the scope of SPASQL. Any application code that worked with variably typed fields in a relational database can employ the same logic to differentiate data reported by SPARQL queries.

RDF literals have attached datatypes. One persons' height may be recorded in cm, while another, in the same RDF database, may be recorded in inches. However, RDF data coming from a relational database will have consistent datatypes, allowing SPASQL to utilize the MySQL results format which associates a datatype with every value in a column.

3.3 Multiplexing SPARQL and SQL

mysqld offers two interfaces, a named pipe (probably something equivalent on Windows, if mysqld runs on Windows), and a socket interface. MySQL clients pass the server a simple request that consists of a one-byte command and parameters. In the case of a query, the command is known by the enum COM_QUERY (3) and looks like (in gdb escaping):

p packet
$16 = 0x8b45fe0 "\003SELECT * FROM __Holds__ INNER JOIN __Nodes__ ON __Holds__.p=__Nodes__.id  LIMIT 2"

The MySQL command line client, mysql, passes commands that it does not recognize as queries. For instance,

mysql> SPARQL:SELECT * FROM __Holds__ INNER JOIN __Nodes__ ON __Holds__.p=__Nodes__.id  LIMIT 2;

got to the server as

\003SPARQL:SELECT * FROM __Holds__ INNER JOIN __Nodes__ ON __Holds__.p=__Nodes__.id  LIMIT 2

If we wish to re-use the SPARQL protocol, we can:

3.4 Encoding Issues

SPASQL's multiplexing approach requires all SPARQL queries to go across the same data connection as SQL queries. Following are some consequential issues and the resolution:

3.4.1 Differentiating SPARQL Queries

The chosen interface accepts SPARQL queries on the same channel as MySQL's other queries. The codes paths are identical except for the parser. The two languages do not overlap (no SPARQL query is a SQL query and visa versa), and it is possible to execute one parser and then the other, looking for the one that parses a particular query. However, both parsers introduce side-effects associated as they parse, potentially corrupting the query structure for a query meant for the other parser.

It is thus practical to know which parser to execute. This also allows the correct parser return helpful error messages to the user in the event of a query that is invalid in both languages.

Any query with the leading text "SPARQL:" was outside of the existing MySQL grammar. This text serves to distinguish SPARQL queries in SPASQL. The simple code

    if (!strncmp(inBuf, "SPARQL:", 7)) {
      lex->ptr = (uchar*)inBuf+7;
      sparqlFrob frob(thd, (char**)&lex->ptr);
      parse_res = frob.parse();
    } else {
      parse_res = yyparse((void *)thd);
    }

can be easily modified to change the default query language or select another SPARQL or MySQL query indicator. Note: this is a rather small and resilient patch; it is unlikely to break with normal MySQL evolution. The table structures are probably similarly resilient, though the future is hard to predict.

3.4.2 Escaping the MySQL Line Terminator

MySQL uses an unquoted ';' as a line terminator. This character also appears in the SPARQL grammar (note the PropertyListNotEmpty production). This character must be escaped ("\;") when it is not inside quotes.

4. Implementation Challenges

4.1 Complexity

MySQL is an industrial-grade SQL engine. Its design emphasizes performance over ease of modification. This made SPASQL implementation more fun in a variety of ways.

Yacc Collisions

MySQL was not designed to use multiple parsers. Fortunately, the SPARQL grammar used came from the yacker [YAC], which produces modules that are meant to be used together. A relatively small amount of work was required to fit them together.

5. Performance

Since the SPASQL interpreter runs in the MySQL process, SPASQL queries incur no additional network cost. SPASQL currently only expresses queries that could also be expressed in SQL (see heterogeneous databases for a contrast). The performance of SPASQL can be evaluated by considering the additional processing required to parse and evaluate a SPASQL query as opposed to an SQL query.

The SPASQL translation during the parsing stage requires variable lookup. The current implementation uses arrays. It would be worth using hash tables on a service answering queries with hundreds of variables.

5.1 Query Size

The size of a query affects, though minutely, the transmission and parsing time. Simple SQL queries tend to be shorter than SPARQL queries because the names of predicates are longer than those of the table fields they denote. For instance, the simple join of Orders on Addresses is more tersely expressed in SQL:

SELECT id, apt FROM Orders, Addresses WHERE Orders.id=3183

than in SPARQL:

SELECT ?d ?apt WHERE { <Orders.id=3183> <Orders.shippingAddress> ?d }

Queries with joins, however, are more likely to have coincident attribute names and require explicit differentiation. For example, the primary key of both Addresses and Orders is called id. Qualifying all the field names with table names (only the id fields in this query actually require full qualification) yields an SQL query:

SELECT Addresses.id, Addresses.apt FROM Orders. Addresses WHERE Orders.shippingAddress = Addresses.id AND Addresses.id=3183

that is longer than the SPARQL equivalent:

SELECT ?d ?apt WHERE { <Orders.id=3183> <Orders.shippingAddress> ?d . ?d <Addresses.apt> ?apt }

The more explicit form:

SELECT Addresses_0.id AS d, Addresses_0.apt AS apt FROM Orders AS Orders_0 INNER JOIN Addresses AS Addresses_0 ON Orders_0.shippingAddress = Addresses_0.id WHERE Addresses_0.id=3183

is more verbose than the SPARQL version, but is not common for simple queries.

6. Conformance

The SPARQL Query specification defines the query language and a notional set of results. The SPARQL Protocol [SPROT] is specified as a WSDL interface. It offers only one interface: query. The input requires one input query string and zero or more default and named graph URIs:

Abstractly, the contents of the In Message of SparqlQuery's query operation is an instance of an XML Schema complex type, called st:query-request in Excerpt 1.0, composed of two further parts: one SPARQL query string; and zero or one RDF dataset descriptions. The SPARQL query string, identified by one query type, is defined by [SPARQL] as "a sequence of characters in the language defined by the [SPARQL] grammar, starting with the Query production".

A SPASQL query can be seen as a SPARQL query with only a query string. SPASQL does not support CONSTRUCT or DESCRIBE at this time.

The response message is defined by:

Abstractly, the contents of the Out Message of SparqlQuery's query operation is an instance of an XML Schema complex type, called query-result in Excerpt 1.2, composed of either:

  1. a SPARQL Results Document [SRD] (for SPARQL Query for RDF query forms SELECT and ASK); or,
  2. an RDF graph [RDF-Concepts] serialized, for example, in the RDF/XML syntax [RDF-Syntax], or an equivalent RDF graph serialization, for SPARQL Query for RDF query forms DESCRIBE and CONSTRUCT).

The SPARQL XML Results Format is the specified way to move return SPARQL query results, SPASQL uses the existing MySQL protocol. This can be viewed as equivalent, but is not allowed by the current specifications.

7. Future work

Of course, this is just the beginning. Data modeling is a succession of measures with diminishing returns. However, there are some obvious features to add to SPASQL:

7.1 Heterogeneous Databases

RDF data is much less brittle than relational databases; also, much less efficient. The data currently addressed by queries is necessarily as brittle and efficient as conventional RDBs. A nice compromise between the two is to keep the uniformly structured data in conventional relational stores, but store data that does not fit in that schema in an associated RDF triple store, nd generate queries that consult the triple store when the data is known to be outside the conventional schema for that database.

7.2 Multi-field Keys

The MySQL InnoDB database has foreign key metadata. This implementation does not use InnoDB's metadata, but future versions could use this metadata to allow multiple fields in keys.

7.3 Disjunction

The differences between SPARQL UNION and SQL UNION require additional work. The disjunction mapping notes page [UNION] should have the latest work on this mapping.

8. References

[FED]
RDF Access to Relational Databases, E. Prud'hommeaux, http://www.w3.org/2003/01/21-RDF-RDB-access/ .
[NOTES]
Notes on Adding SPARQL to MySQL, E. Prud'hommeaux, http://www.w3.org/2005/05/22-SPARQL-MySQL/ .
[RDF-Syntax]
RDF/XML Syntax Specification (Revised) , D. Beckett, Editor, W3C Recommendation, 10 February 2004, http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/ . Latest version available at http://www.w3.org/TR/rdf-syntax-grammar .
[RDF-Concepts]
Resource Description Framework (RDF): Concepts and Abstract Syntax, Klyne G., Carroll J. (Editors), W3C Recommendation, 10 February 2004. This version is http://www.w3.org/TR/2004/REC-rdf-primer-20040210/. The latest version is http://www.w3.org/TR/rdf-concepts/.
[SRD]
SPARQL Query Results XML Format , D. Beckett, Editor, W3C Working Draft (work in progress), 27 May 2005, http://www.w3.org/TR/2005/WD-rdf-sparql-XMLres-20050527/ . Latest version available at http://www.w3.org/TR/rdf-sparql-XMLres/ .
[SPARQL]
SPARQL Query Language for RDF, E. Prud'hommeaux, A. Seaborne, Editors, W3C Candidate Recommendation, 6 April 2006. This document is http://www.w3.org/TR/2006/CR-rdf-sparql-query-20060406/ . The latest version is available at http://www.w3.org/TR/rdf-sparql-query/ .
[SPROT]
SPARQL Protocol for RDF, K. Clark, Editor, W3C Working Draft (work in progress), 27 May 2005, http://www.w3.org/TR/2005/WD-rdf-sparql-protocol-20050527/ . Latest version available at http://www.w3.org/TR/rdf-sparql-protocol/ .
[UNION]
RDF to SQL Query Conversion UNION Notes, E. Prud'hommeaux, http://www.w3.org/2003/11/01-RDF-SQL-tmp/notes/ .
[YAC]
Yacker User Guide, E. Prud'hommeaux, http://www.w3.org/1999/02/26-modules/User/Yacker .

9. Acknowledgment

I would like to express special thanks to Uldis Bojars, who helped me geek through the details of the mapping to SQL's LEFT OUTER JOIN. Also, thanks to Dave Beckett, Jeen Broekstra, Kendall Clark and Andy Seaborne for all their work on SPARQL.


Eric Prud'hommeaux, W3C Team Contact for the RDF Data Access Working Group <eric@w3.org>
$Date: 2007/06/19 22:12:52 $