W3C

Matching Graph Patterns Against Stem Graphs

Status of this Document

This is a collection of implementation notes by ericP.

Abstract

RDF queries and rules match data that corresponds to a given graph pattern. Data from many non-RDF sources, e.g. relational databases, may be trivially transformed to a stem graph; an RDF graph which reflects native graph model. Stem graphs represent the trivial expression in RDF of non-RDF data formats e.g. relational, spreadsheets, textual. Stem data is brittle in the face of implementation changes, and requires consumers to learn the stem schema. An interface graph, a more Semantic-Web-friendly form of the data, can be expressed by applying deduction rules to re-use popular RDF predicates and graph structures. Consumers of the data will want to express queries and rules in terms of the interface graph.

Any system that requires materializing data transformation involves synchronicity problems. The deduction rules used to map the stem graph to the interface graph can be used to transform queries that match the interface graph to instead match the stem graph, which can then be trivially expressed as queries on the native data format, e.g. SQL or a relational execution plan. This document describes this process, deriving the transformations that are needed to match in the stem graph, the same data that would be matched in a materialized interface graph. The process is intended to be sound and complete with respect to the SPARQL logical connectives and sound with respect to datatype support.

Introduction

Both RDF and Relational Databases are cyclic graph structures (e.g. an Employee table in a database may have a manager field which identifiers another Employee). While RDF is based on triples, relational data is based on n-ary tuples, where each position in the tuple has a prescribed name and data type. The expression of these tuples as triples is intuitive and practical, enabling simple tools to make relational data available to RDF tools. Earlier work ([SPASQL], [D2R], [SquirrelRDF]) has shown how RDF queries can be executed on relational databases, providing unambiguous querying on databases, despite other databases having coincidently named, though not necessarily similar, attributes. The default minimal installation for many of these systems required only a stem URI to serve as a base name for the relations, attributes and tuple identifiers.

[D2R], [SquirrelRDF] and [Virtuoso] provide languages for mapping queries over specific relations and attributes to particular classes and predicates. [SPASQL] is using deduction rules to provide a non-materialized view of the data, here called the interface graph, which allows custodians of the data to provide access to their proprietary data with common, shared attribute identifiers. These rules can be applied to materialize an interface graph, but that raises the usual concurrency issues associated with any data warehouse. These issues are avoided if the rules are used to transform the interface queries to work on the original data source, enabling a query expressed in terms of common, public taxonomies to perform with the same efficiency as a proprietary SQL query.

[SPASQL] is a set of libraries which can transform and express [SPARQL] queries. Its use has been prototyed in MySQL, where it functions as user-loadable parser which parses SPARQL queries and produces a MySQL-native execution plan which then executes like any execution plan produced by the SQL parser. [SPASQL] can also express queries as SQL (strings), though that is not discussed here.

Rule Expressivity

The deduction rules used in this paper are equivalent to SPARQL CONSTRUCTs extended with named graphs in the constructed pattern, e.g.

CONSTRUCT {
  <graph1> { ?s <foo> ?o }
  <graph2> { ?s <bar> ?p }
} WHERE {
  ?s <baz> ?o 
}

While many use cases can be met with the conventional SPARQL CONSTRUCT grammar, the full expressivity of the transformations in this paper can be parsed by changing the ConstructTemplate production to

[30]   ConstructTemplate   ::=   '{' TriplesBlock? ( ( 'GRAPH' VarOrIRIref GraphTemplate ) '.'? TriplesBlock? )* '}'

Other languages can be used to express these deduction rules, but this may limit the expressivity. For instance, N3 doesn't have anything corresponding to a SPARQL OPTIONALs, though some optionals can be expressed as multiple rules. Similarly, N3 rules don't have SPARQL's syntactically sequestered constraints, though it has a wide array of built-in logical predicates, which can be factored out and re-expressed as algebraic expressions if the query is transformed into e.g. SPARQL, SQL. Other candidate rule languages include RIF Basic Logic Dialect, SWRL, RuleML, and OWL.

Single Rule on Query

Consider a deduction rule transforming a stem graph to an interface graph:

  stem            interface  
?B pX ?A           ?A p1 ?B  
?C pY ?B =>        ?B p2 ?C  
?C pZ []    <x>: { ?C p3 ?A }
?A pW ?D           ?B p4 ?D  
  ...                ...     
    

and an interface query that, when expressed in disjunctive normal form, has a component:

{ ?x p1 ?y } ∧ { ?y p2 ?z } ∧ ?g: { ?z p3 ?x } ∨ …

Given a sample query graph, we can compute the constructed interface graph:

B1 pX A1           A1 p1 B1  
C1 pY B1 =>        B1 p2 C1  
C1 pZ []    <x>: { C1 p3 A1 }
B1 pW D1           A1 p4 D1  

If we reverse the rule, we can re-write the query instead of the data. The interface query may match any subset of the interface graph, so the antecedent must make each arc optional (caveats discussed in a moment):

          interface        stem      
       OPT ?A p1 ?B      ?B pX ?A    
       OPT ?B p2 ?C   => ?C pY ?B
<x>: { OPT ?C p3 ?A }    ?C pZ []    
       OPT ?B p4 ?D      ?A pW ?D    
             ...           ...       
    

Matching the interface pattern against the pattern "asserted" by the query yields the following result set:

?A?B?C?D
?x?y?zunbound

Thus, extending SPARQL to allow variables in the input graph, we could use the CONSTRUCT semantics to produce a corresponding stem graph pattern:

{ ?y pX ?x } ∧ { ?z pY ?y } ∧ ?g: { ?z pZ [] }

However, this query throws away a potentially critical constraint in the antecedent of the mapping rule: ?A pw ?D. This may or may not represent a real integrity constraint. If it does not, the antecedent should relfect this by enclosing it in an OPTIONAL. Assuming now that it does, the query needs to include the constaint ?x pW ?gen1. Because we don't need the value of the variable ?D, we use a generated variable. If this variable ?D appears elsewhere in the pattern, the same variable (?gen1) is substituted. (Blank nodes could be used, except the SPARQL semantics stipulates that "labels for blank nodes are scoped to the basic graph pattern".) The final stem query:

{ ?y pX ?x } ∧ { ?z pY ?y } ∧ ?g: { ?z pZ [] } ∧ { ?x pW ?gen1 }

is generated by filling out all NULLs in the result set with generated symbols and performing a CONSTRUCT with the antecedent of the mapping rule. Just as apparently uninteresting triple and graph patterns (those that do not bind or restrict variables used in the portion of the interface rule tha corresponds to the interface query) represent potential integrity constraints, so to do any FILTERs applying to the antecedent of the mapping rule. ORDER BY is meaningless as the interface graph has no order. A real treatment of LIMIT and OFFSET in the rule antecedent is left for futre work.

Filters, limits and order in the interface query may be passed through to the stem query.

Optionals in the Interface Query

If our example deduction rule handles data that is potentially incomplete, we may add optional patterns to accomodate the potentially missing data.

Mapping Rule with Optionals:

        stem              interface  
      ?B pX ?A             ?A p1 ?B  
OPT { ?C pY ?B   =>        ?B p2 ?C  
      ?C pZ [] }    <x>: { ?C p3 ?A }
      ?A pW ?D             ?B p4 ?D  
        ...                  ...     
    

The SPARQL CONSTRUCT rules for unbound specify that construct triples patterns referencing unbound variables are silently ignored. Thus, the triple patterns ?B p2 ?C and <x>: { ?C p3 ?A } are consequent to the optional stem pattern which binds ?C. If we add that optional to the (yet empty) list of dependencies for ?C, we can add optionals to the stem query for each of the variables that is optionally bound in the interface graph. For instance, the following modification of our earlier interface query binds ?z optionally:

SELECT ?x ?y ?z ?g
 WHERE {
   ?x p1 ?y .
   OPTIONAL { ?y p2 ?z .
              GRAPH ?g: { ?z p3 ?x }
   }
 }

The corresponding graph pattern that matches the stem graph is:

SELECT ?x ?y ?z ?g
 WHERE {
   ?y pX ?x .
   OPTIONAL { ?z pY ?y .
              ?z pZ [] .
   }
   ?x pW ?gen1
 }

We can verify this by creating data that does and does not match the optional clause in the stem pattern, and examining which stem data is retrieved by this stem query:

sample stem and query graph:

B1 pX A1           A1 p1 B1  
C1 pY B1 =>        B1 p2 C1  
C1 pZ []    <x>: { C1 p3 A1 }
B1 pW D1           A1 p4 D1  

B2 pX A2           A2 p1 B2  
C2 pY B2          # no triple
B2 pW D2           A2 p4 D2  
    

Both the interface query, applied to the interface graph, and the stem query, applied to the stem graph, arrive at the same result set:

?x?y?z
A1B1C1
B2B2unbound

Unsatisfiable Optionality

In an interface query, an optional pattern involving only variables which were elsewhere non-optional would be a meaningless optional as there would be no graph constructed from the stem=>interface rule which would produce the non-optional pattern but not the optional pattern. For instance, using the same example mapping rule, the subject of the p3 arc is not actually optional in the following interface query:

SELECT ?x ?y ?z ?g
 WHERE {
   ?x p1 ?y .
   ?y p2 ?z .
   OPTIONAL { GRAPH ?g: { ?z p3 ?x } }
 }

Translating that OPTIONAL to the stem query:

SELECT ?x ?y ?z ?g
 WHERE {
   ?y pX ?x .
   ?z pY ?y .
   OPTIONAL { ?z pZ [] }
 }

would be an error as it would result in a binding of ?z=>C2 in the above sample stem graph, while C2 would never appear in the interface graph.

?x?y?z
A1B1C1
B2B2C2/unbound

Dependent Optionals

An optional graph pattern in the mapping rule antecedent may introduce variables that are used as constraints in later graph patterns. In the following stem pattern, C is introduced in an optional pattern.

        stem    
      ?B pX ?A  
OPT { ?C pY ?B } ← may bind C.
      ?C pZ []   ← more restrictive if C is bound.
      ?A pW ?D  
        ...     
    

A binding of C in the optional pattern constrains the subsequent match of ?C pZ [].

B1 pX A1
C1 pY B1 ← matches optional pattern and binds ?C to C1.
C2 pZ [] ← no match for C1 pZ [] so the potential solution is eliminated.
B1 pW D1
    

Including any graph pattern in the stem query requires the prior inclusion of all optional patterns which introduce variables used in that pattern. This can be handled by generating a mapping from variable to the graph pattern that introduces that variable. (This dependency list is also needed by an optimizer to ensure that optionals are not reordered in a way that changes the query solution.)

Treatment of Optionals

Any OPTIONAL graph patterns in the antecedent of the mapping rule are included in a stem query if some triple patterns which are introduced in and consequent to that OPTIONAL graph pattern match triple patterns in the interface query. Further, when expressing that graph pattern in stem query, the pattern is not optional unless every corresponding matched triple pattern in the interface query is OPTIONAL. Any optionals which introduce a variable used in an included portion graph pattern are also included. All other OPTIONALs may be safely omitted from the stem query.

Multiple Interface Graph Matches to Interface Query

The interface graph can matche the interface query in mutiple ways. For example, the interface triples with p2 as a predicate in:

  stem      interface 
?B pX ?A     ?A p1 ?B 
?C pY ?B =>  ?B p2 ?C 
?C pZ []     ?C p2 ?A 
?A pW ?D     ?B p4 ?D 
  ...          ...    
    

match this pattern:

{ ?y p2 ?z } ∨ …

with two distinct sets of bindings:

?A?B?C?D
unbound?y?zunbound
?zunbound?yunbound

This corresponds to a situation where a single match of the antecedent of the mapping rule produces an interface graph which is matched two ways by the interface query. The stem query that fulfills this query completely is a disjunction of all of the ways in which the antecedent could produce the appropriate interface query. It is also possible that some or all of these antecedent patterns are optional by the rules described above in Optionals in the Interface Query.

Query Mapping Algorithm

Given a stem graph SG, a set of mapping rules MRs, IG is an interface graph that is produced by executing MRs on SG and adding the product to an empty graph. An interface query QI, performed on IG produces a result set RSi. The following steps give results in a stem query QS that, when executed on QI, also produce RSi:

Stem Query Mapping

While there are many applications where a specific stem graph can be mapped to a non-RDF data model or query language, the most obvious is probaby SQL. Specific stem queries can be simply expressed as SQL or an execution plan in a relational databse. There are many possible mapping from relational data to RDF; the following one requires as input, a stem URI, a database, and some miscellaneous punctuation for separating components in constructed URIs:

Sample Stem Graph Mapping

The stem graph is composed of the set of tuple maps for each tuple in the database. A tuple map assigns an identifier for a tuple and makes RDF triple assertions expressing each attribute of that tuple. For any attribute in the tuple, the subject, predicate and object of each triple are:

For example, the following table can be converteed by supplying a stem URL http://hr.example/DB/:

Employee
id lastName manager
18 Johnson NULL
253 Smith 18
@prefix empl: <http://hr.example/DB/Employee/> .
empl:id.18    empl:lastName   "Johnson" .       
empl:id.253   empl:lastName   "Smith" .         
empl:id.253   empl:manager    empl:id.18 .      

Literal Maps

The above relational strings (or varchars, or however the last names might be encoded) are expressed as RDF literals. For Common relational datatypes have corresponding W3C XML Schema (XSD) datatypes, which are available for a literal map, e.g. a mapping from a relational date to an xsd:date:

Employee
id omitted columns birthday
18 ... 1969-11-08
253 ... 1979-01-18
@prefix empl: <http://hr.example/DB/Employee/> .       
empl:id.18    empl:birthday   "1969-11-08"^^xsd::date .
empl:id.253   empl:birthday   "1979-01-18"^^xsd::date .

The literal mapping used in examples in this paper is:

relational RDF
type SQL type SPARQL
varchar "text" literal "text"
integer 123 xsd::integer 123
float 1.23 xsd::float 1.23
date "2008-08-26" xsd::date "2008-08-26"

The core SPARQL language demands only a subset of these XSD datatypes, however, support for additional datatypes is well-defined, and trivial to implement by pushing down into the relational database. For example, a SPARQL constraint manager.birthday="1969-11-08"^^xsd::date is easy to convert to an SQL constraint manager.birthday="1969-11-08".

Representations Friendly to Linked Data

The above rule mapping has a few simplifications which make it difficult to use in the web of Linked Data [LD] (due to practical considerations when using HTTP). Examining the above stem graph from the perspective of a web browser, we see that:

It is both convenient and scalable if the schema for a given data set is available in one comprehensive page, or a set of pages that describe similar properties. Thus, it is preferable if lastName and manager are identified as http://hr.example/DB/Employee#lastName and http://hr.example/DB/Employee#manager respectively. A schema page should be served at http://hr.example/DB/Employee, describing both properties.

Because RDF has no prescribed way to use the same identifier in the distinct roles of a web page and the thing identified by that web page, conventions have been established to produce identifiers that do not identify web pages, but, when placed in a web browser, will resolve to an informative web page. Extensive deliberation has resulted in the recommendation to use either a hash fragment in an RDF document (e.g. http://hr.example/DB/Employee/id.18#record) or any identifier (the one discussed, http://hr.example/DB/Employee/id.18, would work) that is served with an HTTP 303 redirect. If we select the first approach, we get this graph, which is easily navigated by conventional Linked Data tools:

@prefix emplP: <http://hr.example/DB/Employee#> .
@prefix emplN: <http://hr.example/DB/Employee/> .

emplN:id.18#record    emplP:lastName   "Johnson" .
emplN:id.253#record   emplP:lastName   "Smith" .
emplN:id.253#record   emplP:manager    emplN:id.18#record .

The URIs used in the example were composed from:

subject map:stem'/'table'/'attribute'.'value#fragment
predicate map:stem'/'table'#'attribute

Keys that are composed from more than one attribute may be addressed as well. For example, http://hr.example/DB/Employee/key1.value1.key2.value2#record uses the pattern ['.'attribute'.'value]* to address the attributes in the key.

Stem Query as SQL

Given the mapping from relational data to stem graphs, it is possible to convert stem queries to relational queries. A mapped stem query is an SQL query that, when applied to any possible relational data, produces the same result set as would the stem query performed on the stem graph for that data. The following sections examine the logical components of SPARQL queries and illustrate how they are mapped to SQL.

Basic Graph Patterns

A Basic Graph Pattern is set of triple patterns and filter constraints which have similar expressivity to their SQL counterparts, joins and restrictions. Just as tuple maps produce triples from tuple attributes, the process can be reversed, converting triple patterns to queries on tuples.

A triple pattern is attached to an attribute in an instance (alias) of a table, where the table and attribute are determined by parsing the predicate name , and the alias name is a function of the pair of the subject and the table. Variable predicates are not treated in this document, though an explicit enumeration of table attributes would probably suffice. Any two triple patterns with the same subject and predicates that differ only in attribute name are traversing properties of the same tuple. Triple patterns on the same table, but with different subjects, are addressing potentially different tuples in the same table, and thus require an extra join (and alias name) against that table. In this SPARQL query to select the names of employees and their managers,

PREFIX emplP: <http://hr.example/DB/Employee#>

SELECT ?empName ?managName
 WHERE { ?emp      emplP:lastName   ?empName .
         ?emp      emplP:manager    ?manager .
         ?manager  emplP:lastName   ?managName }

the third triple pattern is on the same table, but is intended to match managers, rather than their employees. This can be expressed using different table aliases for each tuple map with the same table and different subjet. For readability in this document, these table names are derived from the variable name in the subject of the triple pattern:

SELECT emp.lastName, manager.lastName
  FROM Employee as emp
       INNER JOIN Employee as manager ON emp.manager=manager.id
 WHERE emp.lastName IS NOT NULL AND manager.lastName IS NOT NULL

The SPARQL graph pattern had a coreference on ?manager; it was used as the object of the second triple pattern, and the subject of the third. This coreference is enforced in the join constraint emp.manager=manager.id.

The IS NOT NULL constraints in the WHERE clause account for the differences in expression of missing attributes in relational data and RDF. In relational data, missing attributes are given a value of NULL, while in RDF, the triple including them is not expressed. In SQL, NULL is not equal to anything else, including NULL, so variables and constants used in constraints have the same behavoir in SQL and SPARQL. non-NULL rule: any variables that are mentioned only once in non-optional parts of the query must be restricted to non-NULL values.

Parsing Tuple Identifiers

Queries may include node identifiers, such as this query to list the names of the employees who work for Employee 18:

PREFIX emplP: <http://hr.example/DB/Employee#>
PREFIX emplN: <http://hr.example/DB/Employee/>

SELECT ?empName
 WHERE { ?emp      emplP:lastName   ?empName .
         ?emp      emplP:manager    emplN:id.18 }

The includesion of a foreign key (manager) introduces a join, and the value of that foreign key (18) produces a constraint manager=18. The query would be inconsistent if http://hr.example/DB/Employee#manager and http://hr.example/DB/Employee/id.18 did not both refer to the same table.

SELECT emp.lastName
  FROM Employee as emp
 WHERE emp.manager=18 AND emp.lastName IS NOT NULL

If emplN:id.18 were replaced with a variable, say ?manager, the SQL would need a join Employee AS manager only if ?manager were selected or if some property of manager were referenced, e.g. ?manager emplP:lastName ?manName.

Parsing Literal and Numeric Constraints

Literal and numeric value constraints in a graph pattern are represented as SQL constraints, expressed either in join constraints, on in an SQL WHERE expression. For example, a constraint on the manager's last name:

PREFIX emplP: <http://hr.example/DB/Employee#>

SELECT ?empName
 WHERE { ?emp      emplP:lastName   ?empName .
         ?emp      emplP:manager    ?manager .
         ?manager  emplP:lastName   "Johnson" }

may be used to restriction of the candidate tuples for manager:

SELECT emp.lastName
  FROM Employee as emp
       INNER JOIN Employee as manager ON emp.manager=manager.id
                                       AND manager.lastName="Johnson"
WHERE emp.lastName IS NOT NULL

or as a restriction on the final product of the join:

SELECT emp.lastName
  FROM Employee as emp
       INNER JOIN Employee as manager ON emp.manager=manager.id
 WHERE manager.lastName="Johnson" AND emp.lastName IS NOT NULL

Filters on Basic Graph Patterns

SQL filters and SPARQL filters have similar arithmetic and boolean evaluation, e.g. manager.birthday > "1969-01-01"^^xsd::date is expressed in SQL as manager.birthday > "1969-01-01". This document does not discuss the edges of datatype compatibility (e.g. maximum integer or float value) as this would require a survey of the SQL implementations.

Complex filter expressions, involving conjunction and disjunction, can be expressed in conjunctive normal form, and each conjunct can be moved into the earliest join that binds the variables in that conjunct. To demonstrate this, our example tables are extended to include many to many relations for employees and managers:

Example Tables:

Employee
id lastName birthday
18 Johnson 1969-11-08
253 Smith 1979-01-18
255 Jones 1981-03-24
19 Xu 1966-11-08
254 Ishita 1971-10-31
Manage
manager manages
18 253
253 255
19 255
253 254

A query for third-line employees include multiple joins. A constraint on the second-line manager can constrain the join of that table. For example, in this query with many joins:

PREFIX emplP: <http://hr.example/DB/Employee#>

SELECT ?empName ?grandManagName
 WHERE { ?emp          emplP:lastName   ?empName .
         ?emp          emplP:birthday   ?empBday .
         ?lower        managP:manages   ?emp .
         ?lower        managP:manager   ?manager .
         ?manager      emplP:birthday   ?manBday .
         ?upper        managP:manages   ?manager .
         ?upper        managP:manager   ?grandManager .
         ?grandManager emplP:birthday   ?grandManBday .
         ?grandManager emplP:lastName   ?grandManagName
         FILTER (?manBday < ?empBday && ?grandManBday < ?manBday) }

the filter constraint conjuncts (?manBday < ?empBday and ?grandManBday < ?manBday), can be sorted into the join constraints:

SELECT emp.lastName, grandManager.lastName
  FROM Employee as emp
       INNER JOIN Manage AS lower ON lower.manages=emp.id
       INNER JOIN Employee AS manager ON manager.id=lower.manager
                                         AND manager.birthday < emp.birthday
       INNER JOIN Manage AS upper ON upper.manages=manager.id
       INNER JOIN Employee AS grandManager ON grandManager.id=upper.manager
                                         AND grandManager.birthday < manager.birthday
 WHERE emp.lastName IS NOT NULL AND grandManager.lastName IS NOT NULL

Optionals and Disjunctions can produce SQL subqueries. Filters on these subqueries should be treated as applying to the ON constraint when the subquery in joined to the rest of the query. Constants and variables introduced in an OPTIONAL or UNION may be pushed down into the subquery when generating SQL strings. When generating execution plans which exceed SQL expressivity, it may be possible to push into the subquery constraints against variables bound outside of the subquery.

Disjunctions

In SQL, UNIONS are represented as subqueries. For example, a query for employees above and below employee 253:

SELECT ?name
 WHERE { ?who emplP:lastName "Smith"
         { ?above   manageP:manages ?who .
           ?above   manageP:manager ?manager .
           ?manager emplP:lastName  ?name }
         UNION
         { ?below   manageP:manager ?who .
           ?below   manageP:manages ?managed .
           ?managed emplP:lastName  ?name } }

The first constraint is handled as it is in basic graph patters; the Employee table is aliased to who, the variable ?who is attached to the attribute who.id, and who.lastName="Smith" is added to the constraints

The SQL UNION representing the SPARQL UNION has two disjoints, each expressed as a subselect. The joins and constraints are processed normally, yielding these two join patterns:

      FROM Manage AS above
      INNER JOIN Employee as manager ON above.manager=manager.id
WHERE manager.lastName IS NOT NULL
      FROM Manage AS below
      INNER JOIN Employee as managed ON below.manages=managed.id
WHERE managed.lastName IS NOT NULL

Both sides of the SPARQL UNION reference the same two variables which are used ouside of the union: ?who, which is coreferenced elsewhere in the graph pattern, and ?name, which is a result of the query. Therefor, these subqueries must select the attributes to which the variables are attached, i.e. SELECT manager.lastName AS name, above.manages AS who for the first and SELECT managed.lastName AS name, below.manager AS who for the second.

Finally, the union must be uniquely named, AS union1, and the coreference on ?who must be enforced by constraining the attribute to which it is attached in the subquery against the the attribute to which it was attached in the initial SPARQL constraint, i.e. ON union1.who=who.id.

This is a mapped stem query for the SPARQL query above.

SELECT union1.name
  FROM Employee AS who
       INNER JOIN (
         SELECT manager.lastName AS name, above.manages AS who
                FROM Manage AS above
                INNER JOIN Employee as manager ON above.manager=manager.id
          WHERE manager.lastName IS NOT NULL
       UNION
         SELECT managed.lastName AS name, below.manager AS who
                FROM Manage AS below
                INNER JOIN Employee as managed ON below.manages=managed.id
          WHERE managed.lastName IS NOT NULL
       ) AS union1 ON union1.who=who.id
 WHERE who.lastName="Smith"

Violating SQL Scoping

As mentioned in Filters on Basic Graph Patterns, when compiling an execution plan directly, it might be possible to violate the SQL scoping rules and push the constraints inside the subqueries. In the mapped stem query, the constraint imposed by the coreference to ?who is enforced on the union two subselects, union1.who=who.id. Some relational databases have execution plans that exceed SQL expressivity, specifically, they allow constraints in the subselects to reference attributes in aliases joined outside of the subselect. While there is no SQL expression for this, it can be thought of as looking like this:

SELECT union1.name
  FROM Employee AS who
       INNER JOIN (
         SELECT manager.lastName AS name
                FROM Manage AS above
                INNER JOIN Employee as manager ON above.manager=manager.id
          WHERE above.manages=who.id AND manager.lastName IS NOT NULL
       UNION
         SELECT managed.lastName AS name
                FROM Manage AS below
                INNER JOIN Employee as managed ON below.manages=managed.id
          WHERE below.manager=who.id AND managed.lastName IS NOT NULL
       ) AS union1 ON TRUE
 WHERE who.name="Smith"

Care should be taken when violating SQL scoping rules as this may exploit unexplored code paths as well as violate assumptions made while optimizing. Pushing constraints into subselects can be thought of as an optimization, and may be inconsequential on some systems where the optimizer performs steps like this automatically.

General Mapping for Unions

The first query in Disjunctions was a convenient query in that the constraints imposed on the subselects by coreferences in the patterns were identical in all disjoints. Specifically, both disjoints include a constraint imposed by a coreference on ?who. This query finds all of Smith's immediate managers, and those immediate employees that also share Smith's birthady (perhaps to support some favoritism on the part of Smith). The structure of this query provides more rigorous requirements for a mapping solution from SPARQL UNIONs to SQL UNIONs:

SELECT ?lastName
 WHERE { { ?above   manageP:manages ?who .
           ?above   manageP:manager ?manager .
           ?manager emplP:lastName  ?name }
         UNION
         { ?below   manageP:manager ?who .
           ?below   manageP:manages ?managed .
           ?managed emplP:lastName  ?name .
           ?managed emplP:birthday  ?bday } 
         ?who emplP:name "Smith" .
         ?who emplP:birthday ?bday }

As in the earlier query, both disjoints have coreferences on ?who. The second disjoint also has a coreference on ?bday. If one can violate the SQL scoping as above, these constraints are managed trivially in the expression of the subselects. However, to serialize this as SQL, the constraints on the SQL UNION must be treated individually for each disjoint. This can be accomplished by adding a disjoint ordinal to the subselect, e.g. 2 AS _DISJOINT_, and expressing the constraints on the SQL UNION in terms of the disjoint ordinal, e.g. _DISJOINT_=2 AND union1.who=who.id AND union1.bday=who.birthday.

SELECT union1.lastName
  FROM ( SELECT 1 AS _DISJOINT_, manager.lastName AS lastName, above.manages AS who, NULL AS bday
                FROM Manage AS above
                INNER JOIN Employee as manager ON above.manager=manager.id
          WHERE manager.lastName IS NOT NULL
       UNION
         SELECT 2 AS _DISJOINT_, managed.lastName AS lastName, below.manager AS who, managed.birthday AS bday
                FROM Manage AS below
                INNER JOIN Employee as managed ON below.manages=managed.id
          WHERE managed.lastName IS NOT NULL
       ) AS union1
       INNER JOIN Employee AS who ON 
                      (_DISJOINT_=1 AND union1.who=who.id) OR
                      (_DISJOINT_=2 AND union1.who=who.id AND union1.bday=who.birthday)
 WHERE who.lastName="Smith"

Optionals

SPARQL optional graph patterns segregate a portion of the match which, by failing, will not eliminate rows. This behaves the same way as the SQL left outer join. A common use of SPARQL OPTIONAL is to bind a new variable to the value of a single attribute that may or may not be in the graph:

PREFIX emplP: <http://hr.example/DB/Employee#>

SELECT ?empName ?birthday
 WHERE { ?emp      emplP:lastName   ?empName .
         OPTIONAL { ?emp      emplP:birthday   ?birthday } }

If a SPARQL OPTIONAL introduces no variables that bind to a foreign key, and no introduces no other new join aliases, the query can be represented in SQL by simply ommiting the IS NOT NULL constraints and including tests that allow for NULL in subsequent matches:

SELECT emp.lastName, emp.birthday
  FROM Employee as emp
 WHERE emp.lastName IS NOT NULL

When converting this to a result set, any columns that are bound to NULL are simply not included in the bindings.

Equivalence of Variables Bound in Optionals

This query looks for any set of three employees with lexically increasing names (e.g. Ishita, Johnson, Jones, Smith) who all have the same birthday if known (typical in scenarios where talent scouts recruit from hospital nurseries). If the first two employees' birthdays may be unknown, in which case, only the remaining two have to match.

PREFIX emplP: <http://hr.example/DB/Employee#>

SELECT ?emp1Name ?emp2Name ?emp3Name
 WHERE { ?emp1     emplP:lastName   ?emp1Name .
         OPTIONAL { ?emp1     emplP:birthday   ?birthday }
         ?emp2     emplP:lastName   ?emp2Name .
         OPTIONAL { ?emp2     emplP:birthday   ?birthday }
         ?emp3     emplP:lastName   ?emp3Name .
         ?emp3     emplP:birthday   ?birthday .
         ?emp4     emplP:lastName   ?emp4Name .
         ?emp4     emplP:birthday   ?birthday .
         FILTER ( ?emp1Name < ?emp2Name && ?emp2Name < ?emp3Name && ?emp3Name < ?emp4Name) }

The first two utterances of ?birthday are optional and the next two are not. If the initial binding to birthday were not optional, we could attach the variable ?birthday to the Employee attribute emp1.birthday, and use that attribute in subsequent constraints. Treating the sequence of constraints in the query, we can loosely attach ?birthday to emp1.birthday, which means that, as long as it is loosely bound, each constraint on ?birthday must not fail if it is unbound. After it is used in a non-optional pattern, ?birthday is attached to the corresponding table attribute, emp3.birthday and the restriction emp3.birthday IS NOT NULL is a restriction on the solution. (Note that in the next binding, emp4, tests for emp4.birthday=emp3.birthday, so this IS NOT NULL restriction is again dropped.

SELECT emp1.lastName AS emp1Name, emp2.lastName AS emp2Name, emp3.lastName AS emp3Name
  FROM Employee AS emp1
       INNER JOIN Employee AS emp2 ON emp2.lastName > emp1.lastName AND 
          ( emp2.birthday IS NULL OR emp1.birthday IS NULL OR emp2.birthday=emp1.birthday )
       INNER JOIN Employee AS emp3 ON emp3.lastName > emp2.lastName AND 
          (    emp1.birthday IS NULL AND emp2.birthday IS NULL
            OR emp3.birthday=emp1.birthday
            OR emp3.birthday=emp2.birthday )
       INNER JOIN Employee AS emp4 ON emp4.lastName > emp3.lastName AND 
          emp4.birthday=emp3.birthday

Any constrain on a loosely attached variable must include a disjoint allowing the attribute to which that variable is attached to be NULL, e.g. emp2.birthday IS NULL for the join of emp2. Any constraint of a previously loosely attached variable must include a disjoints for

Optionals Introducing Joins

SPARQL optional patterns may introduce multiple joins, which must be treated as a subquery. Following is a query of employees and, if they are a third-line employee, their manager and manager's manager (a "grand manager", in the spirit of "grandparents"):

PREFIX emplP: <http://hr.example/DB/Employee#>
PREFIX mangP: <http://hr.example/DB/Manage#>

SELECT ?empName ?managName ?grandManagName
 WHERE {          ?emp            emplP:lastName   ?empName .
         OPTIONAL { ?mang         mangP:manages    ?emp .
                    ?mang         mangP:manager    ?manager .
                    ?manager      emplP:lastName   ?managName .
                    ?grandMang    mangP:manages    ?manager .
                    ?grandMang    mangP:manager    ?grandManager .
                    ?grandManager emplP:lastName   ?grandManagName } }

This introduces multiple joins within the OPTIONAL (in this case, of the Manage table) These two additional joins can be treated as a subquery, similar to a subquery in a UNION, and following the same rules for enforcing coreferences:

SELECT emp.lastName AS empName, opt1.managName AS managName, opt1.grandManagName AS grandManagName
       FROM Employee AS emp
            LEFT OUTER JOIN (
    SELECT grandManager.lastName AS grandManagName, manager.lastName AS managName, mang.manages AS emp
           FROM Manage AS mang
                INNER JOIN Employee AS manager ON manager.id=mang.manager
                INNER JOIN Manage AS grandMang ON grandMang.manages=mang.manager
                INNER JOIN Employee AS grandManager ON grandManager.id=grandMang.manager
             ) AS opt1 ON opt1.emp=emp.id
 WHERE emp.lastName IS NOT NULL

These Optionals Introducing Joins Results: are produced by the above query on the Example Tables.

empNamemanageNamegrandManageName
JohnsonNULL NULL
Smith NULL NULL
Jones Smith Johnson
Xu NULL NULL
Ishita Smith Johnson

Erroneous Simplification of Joins in an Optional

It is tempting to treat SPARQL OPTIONALs as a series of LEFT OUTER JOINs in the same query

SELECT emp.lastName AS empName, manager.lastName AS managName, grandManager.lastName AS grandManagName
  FROM Employee AS emp
       LEFT OUTER JOIN Manage AS mang ON mang.manages=emp.id
       LEFT OUTER JOIN Employee AS manager ON manager.id=mang.manager
       LEFT OUTER JOIN Manage AS grandMang ON grandMang.manages=mang.manager
       LEFT OUTER JOIN Employee AS grandManager ON grandManager.id=grandMang.manager
 WHERE emp.lastName IS NOT NULL

but that does not properly handle the cardinality in the case where an employee has multiple managers, but only one of them has a manager:

empName grandManageName
Johnson NULL
Smith NULL
Jones Johnson
JonesNULL
Xu NULL
Ishita Johnson

The tuple {Jones, NULL} was erroneously included because it partially satisfaction of an OPTIONAL pattern. OPTIONAL patterns may be represented as a sequence of simple outer joins if:

Note that, in both SPARQL and SQL, OPTIONALs (LEFT OUTER JOINS) introduce an order constraint on the sequence of joins. A variable that is bound in an optional can constraint future non-optional joins.

Leading Optionals

Pushing the ?emp emplP:lastName ?empName constraint to the bottom is a very different, but illustrative query:

PREFIX emplP: <http://hr.example/DB/Employee#>

SELECT ?empName ?grandManagName
 WHERE { OPTIONAL { ?manager      emplP:manager    ?emp .
                    ?manager      emplP:manager    ?grandManager .
                    ?grandManager emplP:lastName   ?grandManagName } 
                  ?emp            emplP:lastName   ?empName }

The SPARQL semantics assert that WHERE { OPTIONAL { X } } is equivalent to WHERE { { } OPTIONAL { X } } and that the result set after processing { } contains one soltion with no bindings. SQL has not syntax for starting off with an optional, but these semantics can be emulated by creating a table with one solution with one binding, the name of which is unique so that it will not affect any joins: SELECT 12817 AS _IGNORE_ATTRIBUTE__) AS _IGNORE_RELATION_ . The selected value is unimportant; 12817 was chosen to be conspicuous in case of implementation error. The alias name must not collide with another alias.

The joins and constraints for the subselect representing the OPTIONAL are expressed normally:

FROM Manage as emp
     INNER JOIN Manage as manager ON manager.manages=emp.manager
     INNER JOIN Employee as grandManager ON grandManager.id=manager.manager

The variable ?emp in the OPTIONAL has corefrences outside of the OPTIONAL. To enforce this, the resulting subselect, AS opt1, must produce an attribute manages. ?emp is initially loosely bound to opt1.manages, and follows the according the rules described in Equivalence of Variables Bound in Optionals, ultimately producing the constraint opt1.manages IS NULL OR opt1.manages=emp.id.

The variable ?grandManagName is loosely attached to opt1.grandManagName and is not subsequently strongly attached. As specified by the non-NULL rule, there is no opt1.lastName IS NOT NULL constraint.

SELECT emp.lastName AS empName, opt1.grandManageName
  FROM (SELECT 12817 AS _IGNORE_ATTRIBUTE__) AS _IGNORE_RELATION_
       LEFT OUTER JOIN (
    SELECT emp.manages AS manages, grandManager.lastName AS grandManageName
      FROM Manage as emp
           INNER JOIN Manage as manager ON manager.manages=emp.manager
           INNER JOIN Employee as grandManager ON grandManager.id=manager.manager
                        ) AS opt1 ON TRUE
       INNER JOIN Employee AS emp ON opt1.manages IS NULL OR opt1.manages=emp.id
 WHERE emp.lastName IS NOT NULL

This is a very different query than the one with with the ?emp emplP:lastName ?empName pattern leading the optional: it has either the number of solutions in the OPTIONAL, or, if the optional fails, one solution with no bindings joined against each entry in Eployee. In fact, we know from the two solutions in the Optionals Introducing Joins Results that both Jones and ishita had grand managers.

empName grandManageName
Jones Johnson
Ishita Johnson

Nested Optional Mapping

Optionals nested inside another graph pattern, e.g. a UNION or another OPTIONAL, are treated as OPTIONALs at the top level.

PREFIX emplP: <http://hr.example/DB/Employee#>
PREFIX mangP: <http://hr.example/DB/Manage#>

SELECT ?empName ?managName ?grandManagName
 WHERE {          ?emp            emplP:lastName   ?empName .
       OPTIONAL { ?mang           mangP:manages    ?emp .
                  ?mang           mangP:manager    ?manager .
                  ?manager        emplP:lastName   ?managName .
         OPTIONAL { ?grandMang    mangP:manages    ?manager .
                    ?grandMang    mangP:manager    ?grandManager .
                    ?grandManager emplP:lastName   ?grandManagName } } }
    

These can be respresented as nested OUTER subselects.

SELECT emp.lastName AS empName, opt1.managName AS managName, opt1.grandManagName AS grandManagName
       FROM Employee AS emp
            LEFT OUTER JOIN (
    SELECT opt1.grandManagName AS grandManagName, manager.lastName AS managName, mang.manages AS emp, mang.manager AS manager
           FROM Manage AS mang
                INNER JOIN Employee AS manager ON manager.id=mang.manager
                LEFT OUTER JOIN (
        SELECT grandManager.lastName AS grandManagName, grandMang.manages AS manager
               FROM Manage AS grandMang
                    INNER JOIN Employee AS grandManager ON grandManager.id=grandMang.manager
                 ) AS opt1 ON opt1.manager=mang.manager
             ) AS opt1 ON opt1.emp=emp.id
 WHERE emp.lastName IS NOT NULL

Note that, while similar to the Optionals Introducing Joins query, the fact that the grandManager is optional within the optional manager clause, permits a solution of an employee who has a manager but no grand manager.

empNamemanageNamegrandManageName
JohnsonNULL NULL
Smith Johnson NULL
Jones Smith Johnson
Jones Xu NULL
Xu NULL NULL
Ishita Smith Johnson

A system could safely reduce this query to the simple set of left outer joins described in Erroneous Simplification of Joins in an Optional by examining the possible cardinalities for each of the outer joins:

Conjunction

All graph patterns in a SPARQL query that are not separated by {}s scoping an OPTIONAL or UNION graph pattern are conjunctions. These may be treated as if all their triple patterns were in the same basic graph pattern. For example, these three patterns:

{ { ?emp      emplP:lastName   ?empName .
    ?emp      emplP:manager    ?manager }
    ?manager  emplP:lastName   ?managName
  FILTER ?managName = "Johnson"}
{ { ?emp      emplP:lastName   ?empName .
    ?emp      emplP:manager    ?manager }
  { ?manager  emplP:lastName   ?managName }
  FILTER ?managName = "Johnson"}
{ { ?emp      emplP:lastName   ?empName .
    ?emp      emplP:manager    ?manager }
  FILTER ?managName = "Johnson"
  { ?manager  emplP:lastName   ?managName } }

are all equivalent to

{ ?emp      emplP:lastName   ?empName .
  ?emp      emplP:manager    ?manager .
  ?manager  emplP:lastName   ?managName
  FILTER ?managName = "Johnson" }

SQL Mapping Algorithm

Given a set of relationals DB and a stem query QS:



    

Optional Graph Warning

Following the SPARQL rule on CONSTRUCTs with unbound variables (¶2), variables not bound in the stem graph would produce no triples in the interface graph. Thus, triples using variables that are bound in OPTIONALs may be annotated as at risk:

        stem              interface  
      ?B pX ?A             ?A p1 ?B  
OPT { ?C pY ?B   =>        ?B p2 ?C   <- At risk because the dependent bindings
      ?C pZ [] }    <x>: { ?C p3 ?A } <- might not come from the optional stem.
      ?A pW ?D             ?B p4 ?D  
        ...                  ...     
    

We can see that the query is at risk for missing many potential solutions as the constraint ?g: { ?z p3 ?x } is not going to eliminate some solutions.

Acknowledgements

This work was all done with funding from Eli Lilly and Lincoln Labs. I would like to thank Susie Stephens (Eli Lilly) and Daniel Van Hook (Lincoln Labs) for their commitment to making the Semantic Web practical and easy to use. I would also like to thank C. M. Sperberg-McQueen for his rigorous examples and bug discoveries.

CVS Log

$Log: Overview.html,v $
Revision 1.44  2009/09/01 18:02:07  eric
~ moved PatientDataExample out

Revision 1.43  2009/01/06 05:43:25  eric
~ incorporated feedback mid:1225706364.12028.64.camel@tweetie08

Revision 1.42  2009/01/05 20:53:49  eric
+ Acknowledgements

Revision 1.41  2008/09/28 17:32:51  eric
+ tinkering on Nested Optional Mapping

Revision 1.40  2008/09/26 18:29:38  eric
+ Nested Optional Mapping

Revision 1.39  2008/09/23 22:28:19  eric
~ fixed SPARQL query for optJoin1

Revision 1.38  2008/09/08 12:48:51  eric
+ ids to generate stem queries from MappingRules

Revision 1.37  2008/08/28 06:58:50  eric
~ improved Leading Optionals

Revision 1.36  2008/08/28 06:20:19  eric
~ fixed layout in Conjunction

Revision 1.35  2008/08/28 05:51:34  eric
~ more rigorous
+ coreference
+ General Mapping for Unions

Revision 1.34  2008/08/27 14:27:30  eric
+ Literal Maps
+ variables attached to attributes
+ Equivalence of Variables Bound in Optionals
+ Leading Optionals

Revision 1.33  2008/08/26 05:07:53  eric
~ mellowed border boxes
~ re-labeled rules in Query Mapping Algorithm
+ new rules in SQL Mapping Algorithm

Revision 1.32  2008/08/25 06:46:06  eric
+ filter sorting example

Revision 1.31  2008/08/25 05:09:22  eric
+ @Disjunctions: analysis of external constraints in subqueries

Revision 1.30  2008/08/25 04:03:29  eric
+ Disjunctions

Revision 1.29  2008/08/25 00:12:27  eric
+ Optionals

Revision 1.28  2008/08/24 11:18:36  eric
+ filters

Revision 1.27  2008/08/24 10:48:08  eric
~ more rigorous treatment of examples

Revision 1.26  2008/08/24 08:02:16  eric
- forward refs in Introduction
+ Stem Query as SQL

Revision 1.25  2008/08/18 06:27:50  eric
~ validate

Revision 1.24  2008/08/18 06:25:59  eric
+ note to handle dependent optionals in Query Mapping Algorithm

Revision 1.23  2008/08/18 06:14:42  eric
~ expanded Treatment of Optionals to include dependent optionals.

Revision 1.22  2008/08/18 03:28:13  eric
+ Dependent Optionals

Revision 1.21  2008/08/14 03:40:44 eric
+ Multiple Interface Graph Matches to Interface Query

Revision 1.20  2008/08/12 11:28:51 eric
+ fabulous rule for when to include OPTIONALs

Revision 1.19  2008/08/07 18:20:32 eric
~ typos
~ s/required/included/

Revision 1.18  2008/08/07 04:13:29 eric
+ more rules

Revision 1.17  2008/08/06 23:56:54 eric
+ more rules

Revision 1.16  2008/08/06 14:48:28 eric
+ start on algorithm

Revision 1.15  2008/08/05 18:42:05 eric
+ author list

Revision 1.14  2008/08/04 18:58:26 eric
+ response to 9CACCBCF-B1CC-4986-976F-9C2711364304@acm.org

Revision 1.13  2008/08/04 15:08:02 eric
+ pW/p3 productions
+ anchors

Revision 1.12  2008/08/04 14:54:01 eric
+ anchors

Revision 1.11  2008/08/04 14:46:55 eric
+ intro

Revision 1.10  2008/08/04 04:20:59 eric
+ introduction

Revision 1.9  2008/08/04 02:42:16 eric
+ stem graph transformation

Revision 1.8  2008/08/03 14:47:05 eric
+ entailment mapping

Revision 1.7  2008/08/03 05:09:51 eric
~ documented rules

Revision 1.6  2008/08/03 00:16:18 eric
+ ActivityHeaderID homonym

Revision 1.5  2008/08/02 17:57:44 eric
~ split out mapping options by expressivity

Revision 1.4  2008/07/30 22:42:02 eric
+ annotation

Revision 1.3  2008/07/30 22:40:50 eric
+ hospital example

Revision 1.2  2008/07/30 17:41:47 eric
+ tables

Revision 1.1  2008/07/24 21:33:19 eric
created


ericP
$Id: Overview.html,v 1.44 2009/09/01 18:02:07 eric Exp $