SPARQL Queries to Validate Shape Expressions

This document presents an algorithm for expressing Shape Expressions as SPARQL queries. There are two variants of the algorithm, one for validation and one to find the remaining triples after validation. The accompanying Shape Expressions Definition provides a formal definition of the Shape Expressions language. In the interest of brevity, the pseudocode uses NULLable tuples rather than monads like Optional() and Some(). Likewise, functions like map and flatmap use the Scala syntax of list.map{element -> code}.

Validation Queries

The four states returned from the validityφ⟦ , , , ⟧ function are (𝕡|𝕗|∅|𝜃). These are mapped to a combination of SPARQL state and return codes from the function generating the SPARQL. Recall that a UnaryExpression with the optional flag sets makes all nested rules optional. For non-optional rules, the the states are only pass (𝕡) and fail (𝕗). These manifest in SPARQL queries which evaluate to either a non-empty or empty result set respectively. Optional rules convey cardinality to nesting AndRules, OrRules and GroupRules which impose FILTERs to test the data's conformance to the schema.

	description	SPARQL query generation	SPARQL query execution
𝕡	pass		evaluates to 1 solution
𝕗	fail		evaluates to 0 solutions
∅	empty set	counter name, min and max cardinality conveyed to nesting operations
𝜃	indeterminate

In order to simplify the task of mapping ShEx to SPARQL, we use a set of operations in SPARQL which map to portions of the abstract syntax. Most basic is the enforcement of the constraints to select a set of arcs. In the SPARQL translation presented here, that set of arcs matching a particular ArcRule is reduced to a count. The behaviors dependent on the different between ∅ and 𝜃 are implemented in nesting subqueries.

The basic counter function finds the number of arcs matching the input.

counter : (shapeAlias, predicate, predicateTest, object, objectTest, semanticAction, minCardinality, maxCardinality) → (sparqlString, counterVar)

It produces a query block like this, where the HAVING clause is elided if there are no cardinality constraints:

counter = nextCounter(shapeAlias)
having = minCardinality ? "HAVING (COUNT(*) >= "+minCardinality
         +(maxCardinality ? " && COUNT(*) <= "+maxCardinality : "")+")" : ""
return "{ SELECT "+shapeAlias+" (COUNT(*) AS "+counter+" {\
          "+shapeAlias+" "+predicate+" "+object+" .\
          FILTER ("+predicateTest+" && "+objectTest+")\
          "+semanticAction+"
        } GROUP BY "+shapeAlias+" "+having+"}"

We also need the function which produces variable names unique within the query. (It's nice to supply some context string to make the queries more readable.)

nextCounter : String → String

Invocation

The SPARQLvalidation algorithm described below takes two parameters (rule, optionality) and essentially mirrors the validityφ⟦rule,optionality,point,graph⟧ function defined in the semantics, except that the graph (and thereby the point) are not known during query generation. The actual algorithm used in SimpleShExDemo passes many arguments to the query generation function in order to pretty-print the query. Here, we elide those parameters for indentation, reuse of prefixes from the input schema, and predictable names for variables. To begin, we call SPARQLvalidation(start rule, start rule label, false).

`SPARQLvalidation(Rule, shapeAlias:String, opt:Bool) → (counter:String, min:Integer, max:Integer, sparql:String)`

case: Rule a ArcRule
1. We count the arcs matching the NameClass constraints, eliding the constraints from the ValueClass constraints and semantic actions. The NameClass and ValueClass produce constraints to validate the predicate and object respectively. As a gesture towards efficiency, the SimpleShExDemo implementation uses constant predicates and objects when possible in order to reduce query costs.
  
  If within an optional, the selected count and the min and max cardinality will be returned to the calling operation so we don't enforce cardinality in the calls to counter:
```
(nameSparql, nameCounter) = counter(shapeAlias, nameClass.predicate(), nameClass.predicateTest(), "?o", "true", "", NULL, NULL)
```
```
{ SELECT ?IssueShape (COUNT(*) AS ?IssueShape_c5) {
  ?IssueShape ex:reproducedOn ?o . FILTER (true && true)
} GROUP BY ?IssueShape}
```
  Otherwise, the min and max are enforced immediately in a HAVING clause at the end of the subselect.
```
(nameSparql, nameCounter) = counter(shapeAlias, nameClass.predicate(), nameClass.predicateTest(), "?o", "true", "", minCardinality, maxCardinality)
```
```
{ SELECT ?EmployeeShape (COUNT(*) AS ?EmployeeShape_c0) {
  ?EmployeeShape foaf:givenName ?o . FILTER (true && true)
} GROUP BY ?EmployeeShape HAVING (COUNT(*)>=1)}
```
2. Count the arcs matching the NameClass and the ValueClass and any semantic actions. Note that SPARQL does not validate the lexical forms of typed literals. In principle, casting to the goal type would extract some lexical validation, but for full XSD validation, extension functions will be required.
```
(valSparql, valCounter) = counter(shapeAlias, nameClass.predicate(), nameClass.predicateTest(), valueClass.object(), valueClass.objectTest(), action, NULL, NULL)
```
```
{ SELECT ?IssueShape (COUNT(*) AS ?IssueShape_c6) {
  ?IssueShape ex:reproducedOn ?o . FILTER (true && datatype(?o) = xsd:dateTime)
  ?IssueShape ex:reportedOn ?rpt . FILTER (?o > ?rpt) 
} GROUP BY ?IssueShape}
```
3. Make sure the ValueClass constraints and semantic actions didn't eliminate any solutions from those selected by the NameClass:
```
nameValEquiv = "FILTER ("+nameCounter+" = "+valueCounter+")"
```
  This test could be elided by using the same variable name for both counts, but this idiom is more transparent to the reader.
4. All of the NameClasses and all except one of the ValueClasses can be expressed as either a constant term or a variable paired with some FILTER constraints. The exception to the is the ValueReference, which additionally tests the integrity of the shape expression referenced by ValueReference's l:Label:
```
refCounter = nextCounter(shapeAlias+"_c");
(nestedSlects, _) = sparqlValidation(findRule(l),l,false);
refSparql = "{ SELECT "+shapeAlias+" (COUNT(*) AS "+refCounter+") {\
               { SELECT "+shapeAlias+" ?EmployeeShape {\
                 "+shapeAlias+" "+nameClass.predicate()+" "+l+" .\
                 FILTER (true && (isIRI("+l+") || isBlank(?"+l+")))\
               } }"
               + nestedSelects +
             "} GROUP BY "+shapeAlias+" }\
             FILTER ("+nameCounter+" = "+refCounter+")
sparqlQuery = nameSparql + valSparql + nameValEquiv + refSparql
```
  The sparqlQuery above query is sufficient for strict validation, but it cannot report the nodes that correspond to the referenced shapes. Adding a clause which enumerates them by virtue of not grouping on the referring shape variable provides the user with the ability to know not only that some constellation of shapes validated but what nodes were involved. Since there may be more than reference to any shape expression, the referenced alias needs a unique variable alias.
```
refdAlias = nextCounter(shapeAlias+"_"+l+"_ref")
sparqlQuery += "OPTIONAL {
                 "+shapeAlias+" ex:reproducedBy "+refdAlias+" .
                 FILTER (true && (isIRI("+refdAlias+") || isBlank("+refdAlias+")))
               }
```
  Note that SPARQL is non-recursive and is unable to validate cyclic Shape Expressions. The SimpleShExDemo elides the extra component after warning the user that cyclic expressions can not be expressed as SPARQL.
5. ```
if (opt) return (nameCounter, min, max, sparqlQuery)
else return (NULL, NULL, NULL, sparqlQuery)
```

case: Rule a AndRule

Iteratively call each SPARQLvalidation on each conjoint. If called in an optional, add a filter which implements the cardinality logic:

vs = conjoints.map{conj -> SPARQLvalidation(conj,shapeAlias,opt);
sparqlStr = vs.flatmap{(counter, min, max, sparql) -> sparql}
if (opt)
  elided += "+join("&&", vs.flatmap{(counter, min, max, sparql) -> ""+counter+"=0})
  included += "+join("&&", vs.flatmap{(counter, min, max, sparql) -> ""+counter+">="+min+(max ? " && "+counter+"<="+max : "")})
  return (vs[0].counter, vs[0].min, vs[0].max, sparqlStr + "FILTER(" + elided + " || " + included +")")
else
  return (NULL, NULL, NULL, sparqlStr + "FILTER(" + elided + " || " + included +")")

for example:

FILTER (?IssueShape_c2=0 && ?IssueShape_c5=0
        || ?IssueShape_c2>=1&&?IssueShape_c2<=1
           && ?IssueShape_c5>=1&&?IssueShape_c5<=1)

case: Rule a OrRule

We must ensure that exactly one disjoint passes:

vs = disjoints.map{disj -> SPARQLvalidation(disj,shapeAlias,opt)
sparqlStr = "{ SELECT "+shapeAlias+" WHERE {\n    {\
            + join("} UNION {", vs.flatmap{(counter, min, max, sparql) -> sparql})
            + "    }\n} GROUP BY "+shapeAlias+" HAVING (COUNT(*) = 1)}
  return (NULL, NULL, NULL, sparqlStr)

for example:

{ SELECT ?UserShape WHERE {
    {
        # disjoint1
    } UNION {
        # disjoint2...
    }
} GROUP BY ?UserShape HAVING (COUNT(*) = 1)}

case: Rule a GroupRule

we must test the nested rule Additionally, if we are an AndRule in an

(counter, min, max, sparql) = SPARQLvalidation(r,shapeAlias,opt)
if (opt) {
  return(counter, min, max, sparql)
} else {
  if (GroupRule.o)
    sparql += "FILTER("+counter+"=0 || "+counter+">="+min + (max ? ""+counter+"<="+max : "") ")";
  return(NULL, NULL, NULL, sparql)
}
sparqlStr = "{ SELECT "+shapeAlias+" WHERE {\n    {\
            + join("} UNION {", vs.flatmap{(counter, min, max, sparql) -> sparql})
            + "    }\n} GROUP BY "+shapeAlias+" HAVING (COUNT(*) = 1)}
  return (NULL, NULL, NULL, sparqlStr)

Queries for extra triples are constructed from the Validation Query and then joined with a large UNION selecting the triples matching each of the refdAliass created in ArcRule step 4.

SPARQL Queries to Validate Shape Expressions (informative)