Warning:
This wiki has been archived and is now read-only.

ISSUE-47: Can SPARQL-based constraints access the shape graph

From RDF Data Shapes Working Group
Jump to: navigation, search

This page collects use cases of ?shapesGraph access in the current SHACL spec as well as outside of the spec, and discusses design alternatives.

Page started by Holger Knublauch

Use Cases for the Core Vocabulary

sh:allowedValues

SPARQL behind sh:allowedValues currently requires walking the allowed values, which are stored in an rdf:List present in the shapes graph.

Design Alternative: SPARQL code generation with FILTER NOT IN (...). This would work OK because the allowed values cannot really be blank nodes. However, it means that we need a different mechanism here (code injection) compared to how other templates are executed.

sh:AndConstraint, sh:OrConstraint

SPARQL behind sh:AndConstraint currently requires shapes graph access for two things: to walk the rdf:List of operands, and to make the recursive sh:hasShape call for each.

Design Alternative 1 (Dimitris): Break/split all the AND/OR/NOR/XOR/... into shapes groups and move the operand execution in the SHACL engine.

Answer (HK): I believe you are contradicting yourself. On one hand you are saying that SHACL should work with SPARQL endpoints so that performance is best and then you are suggesting to turn everything into little queries again, controlled by an outside loop. The solution with sh:hasShape allows a SPARQL processor to see the big picture and do the actual work itself. We should assume that future databases will have native support for sh:hasShape and not expect that everything will work perfectly well with the status quo in 2015, before SHACL was invented. Also I am unclear how you imagine things like nested AND/OR that are used within sh:valueShape. How would such things work without a callback to the engine? Finally, your proposal requires special, hard-coded treatment of certain templates which is driving up implementation costs and the complexity of the spec.

Answer (DK): (trying to answer all comments at once) this is exactly what sh:hasShape is doing right now (Executing small query chunks within the sparql engine) which imho could be moved to the SHACL engine with the same functionality and perhaps better control on the operand priority/order. I was always in favor of small independent queries, big queries usually result in timeouts on big databases. Regarding nesting, the simplest approach is flattening. Future databases could indeed have native SHACL support regardless of the existence of sh:hasShape, personally I do not have the expertise to say which option is better suited for optimizing execution in a SPARQL engine, having complete queries or query parts executed through a function

Answer (HK): We seem to be talking about very similar approaches now. Your approach to controlling the execution within the SHACL engine will have essentially the same effect as my approach based on sh:hasShape. In cases where the data graph and the shapes graph do not live in the same dataset, my proposal is to wrap them into a virtual dataset where "safe" queries can simply be forwarded into the target database while queries requiring ?shapesGraph remain in the local dataset (e.g. Jena ARQ's own engine). This meets the requirement of using SPARQL as much as possible for the spec. With your approach we would need to break the consistency of the language, because certain Core elements that are currently templates would no longer be templates but have some special "hard-coded" implementation as part of the engine. The latter is not ideal, but I agree that this could be made to work assuming we are convinced that all other use cases of ?shapesGraph mentioned here can either be solved differently or are not relevant enough. My preference remains to provide two definitions of these templates (such as sh:valueShape): one based on SPARQL, and one that pushes control into the engine as you outline. We could either put those two variations into the same document or move the latter into a separate document.

Design Alternative 2 (Dimitris): Create a single SPARQL query similar to the ShEx approach

Answer (HK): Do you have a reference for that please? "Similar to" no longer works, we need to work out details that take SHACL features such as templates into consideration.

Answer (DK): I am not in favor of this approach but listed it for completeness. Eric/Jose could provide more details on this.

Answer (SSt): This might be related to your question Holger -> ShEx2SPARQL (but idk if this document is still up to date).

sh:ClosedShape

SPARQL behind sh:ClosedShape currently requires walking the current shape declaration in the shapes graph, to find out which properties have been declared.

Design Alternative: SPARQL code generation with FILTER NOT IN (properties). While doable in principle, this is again a "hard-coded" custom mechanism compared to how other templates are defined.

sh:NotConstraint

SPARQL behing sh:NotConstraint currently uses sh:hasShape.

Design Alternative: ?

sh:hasShape: General Recursion and Mixing Execution Languages

The sh:hasShape function is used by several core vocabulary features as a means of calling back to the SHACL engine to evaluate a node against a given shape. This is related to ?shapesGraph access. We could in principle remove the ?shapesGraph argument and assume that the surrounding engine "knows" which shapes graph it is supposed to use (e.g. via some ThreadLocal variable trick in Java). However, if we can do these callbacks then we are also making certain assumptions that the SPARQL engine and the SHACL processor can communicate with each other. This assumption is present in Dataset-like scenarios (left hand side of my diagram) but not doable in SPARQL endpoint scenarios. As a result, it is unclear how SPARQL endpoints would handle recursion and cases in which a SPARQL query calls out to a JavaScript-based template.

Design Alternative: ?

sh:qualifiedValueShape

SPARQL behind sh:qualifiedValueShape currently uses a recursive call to sh:hasShape (nested in a helper function sh:valuesWithShapeCount) to count the number of property values that have the given shape.

Design Alternative: ?

sh:XorConstraint

SPARQL behind sh:XorConstraint currently requires rdf:List traversal in the shapes graph and uses recursive calls to sh:hasShape to test whether exactly one of the property values has the given shape.

Design Alternative: ?

Use Cases outside of the Core Vocabulary

General use of sh:hasShape

sh:hasShape is arguably a useful feature for all kinds of constraint templates. Given that it is used in several Core templates, it is plausible to assume that other templates will also benefit, extending the expressivity of high-level languages. It would be a poor language design if the Core Vocabulary can do significantly different things that are not also available to end users.

Template arguments that are rdf:Lists

Some Core templates take rdf:Lists as arguments. These are stored in the shapes graph. Some algorithms need to traverse those lists at run-time. Some values in those lists may be blank nodes. Any SHACL user can define templates that also take lists as arguments.

Design Alternative: ?

Constraints requiring background data

If we cannot modify the named graphs that are visible to the executing SPARQL engine, we need to keep this information in the shapes graph. Typical examples include QUDT unit conversion, look-up tables of country codes and other reference data. The shapes graph would typically include those as sub-graphs and they would logically belong to the ?shapesGraph (which is the imports closure). Example (did not check QUDT details, but assume that QUDT has the conversion factors from, say, feet to the base unit of centimetres):

   ex:MaxLengthWithUnitConversionConstraint
       a sh:ConstraintTemplate ;
       rdfs:subClassOf sh:PropertyConstraint ;
       sh:argument [
           rdfs:label "The expected maximum value in cm" ;
           sh:predicate ex:maxExclusive ;
           sh:nodeKind sh:Literal ;
       ] ;
       sh:argument [
           rdfs:label "The unit of measurement of the existing data value" ;
           sh:predicate ex:unit ;
           sh:valueClass qudt:Unit ;
       ] ;
       sh:sparql """
           SELECT ?this (?this AS ?subject) ?predicate (?value AS ?object)
           WHERE {
               ?this ?predicate ?value .
               GRAPH ?shapesGraph {
                   ?unit qudt:conversionFactor ?factor .
               }
               FILTER (?value * ?factor > ?maxLength) .
           } """ .
   
   ex:MyShape
      sh:property [
          a ex:MaxLengthWithUnitConversionConstraint ;
          sh:predicate ex:lengthInFeet ;
          ex:unit qudt:Feet ;
          ex:maxLength 100 ;  # Centimetres
      ] .

Conditional constraints

Constraints may depend on information attached to the type of an instance. In the example below, Birds are generally assumed to be able to fly (and may accumulate bonus miles). However, some subclasses of ex:Bird may declare a boolean flag ex:flightless which is queried by the constraint to flag exceptions.

   ex:Bird
       a sh:ShapeClass ;
       ex:property [
           sh:predicate ex:bonusMiles ;
           sh:datatype xsd:integer ;
       ] ;
       ex:constraint [
           sh:message "Flightless birds cannot have bonus miles" ;
           sh:sparql """
               SELECT *
               WHERE {
                   ?this a ?type .
                   GRAPH ?shapesGraph {
                       ?type ex:flightless true .
                   }
                   ?this ex:bonusMiles ?bonusMiles ;
                   FILTER (?bonusMiles > 0)
               }
               """ ;
       ] .
   
   ex:Emu
       a sh:ShapeClass ;
       rdfs:subClassOf ex:Bird ;
       ex:flightless true .
   
   ex:Emma
       a ex:Emu .

A work-around that is sometimes possible would put the flightless constraint onto a parallel superclass, e.g. ex:FlightlessBird and have ex:Emu rdfs:subClassOf ex:FlightlessBird. However this loses the declarativeness and would only work for 'boolean' scenarios - it would not work for cases where the constraint depends on something like an integer stored at the subclass.

Variations of sh:ClosedShape

The implementation of sh:ClosedShape assumes one specific interpretation: currently all predicates mentioned in sh:properties at the class, but excluding rdf:type and sh:nodeShape. It is quite plausible to assume that not everybody will agree with this particular definition and wants to cover additional cases, e.g. walking super-shapes too. Some platforms (such as the current TopBraid/Jena implementation) have no problem accessing the ?shapesGraph, so why prohibit this for everyone? At least it could be an optional feature.

Form Builders and similar algorithms

Similar to sh:ClosedShape, it will be helpful for many tools to be able to dynamically discover which properties are defined for a given shape. This includes walking the sh:property definitions, value types, cardinalities etc.

Question (Dimitris): Is this access needed during execution time from SHACL or from external tools that can read the SHACL graph outside of the actual validation context?

Answer (HK): You are correct that form building would be done outside of the SHACL engine, by an external tool that has access to the shapes graph (e.g. as a named graph). However, a real-world example that we are seeing in TopBraid (e.g. in EVN and RDM toolkits) is something like "Must have exactly one of the enumerated description properties", where description properties are skos:definition, rdfs:comment and similar properties. The TopBraid form builder (external tool as above) places them into a section "Labels and Description". However, there is also a validation aspect to it, which we would like to implement in SHACL in the future, which needs to verify that at least one of these description properties is present. The alternative to use something like an Xor is not working here, because these properties may be inherited from superclasses or added by subclasses, i.e. the list is different each time.

Definition of property paths using the high-level vocabulary

If property paths will be added to the high-level vocabulary as requested by UC32: Non-SPARQL based solution to express constraints between different properties, they have to be recursively resolved.

Question (Dimitris): I am not sure I understand the exact problem here. Are the property paths needed during execution time? If not, could possibly the SHACL engine parse them (even recursively) in advance?

Answer (HK): They need to be accessible at execution time. For example, imagine a template ex:PathMinCount that takes two arguments: a Path expression and a minCount integer. The Path expression may be something like a SPIN blank node tree which is kept in the shapes graph together with the shape declaration. The SPARQL body of the template would use helper functions etc to walk this structure dynamically. The work-around to have the SHACL engine parse these trees in advance would need to be a hard-coded language extension (which is obviously "cheating" as our end users don't have this luxury).

Answer (DK): Can you explain what hard-coded and cheating the end-users means? A SHACL engine has to hard-code many things either way to parse shapes, scopes, filters, facets, etc

Answer (HK): Simon wanted to implement this extension himself, without changing the SHACL language. Of course we could hard-code path support into the core language, if the WG decides to do that.

Answer (SSt): What I'm obviously wasn't able to achieve was to define a generic SHACL function that takes a focus node and a path as arguments and returns the set of respective property values using a SPARQL template only. Hard-coding path support for SHACL would basically mean that in the specification we just specify how sh:Paths (similar to SPIN TriplePaths) are internally mapped to SPARQL property paths and how those path constructs must behave if they are used within constraints.

Constraints with helper objects (OSLC Example)

Resource Shapes had a vocabulary to express cardinalities using URIs, e.g. oslc:occurs oslc:Exactly-One. In my original SPIN implementation of OSLC, I solved this by attaching the min and maxCounts to these values, e.g.

   oslc:Exactly-One
       oslc:minCount 1 ;
       oslc:maxCount 1 .

which then allows the constraint to be expressed generically as follows (?occurs comes in as an Argument, pointing at a value such as oslc:Exactly-One).

   SELECT *
   WHERE {
       GRAPH ?shapesGraph {
           ?occurs oslc:minCount ?minCount .
           ?occurs oslc:maxCount ?maxCount .
       }
       BIND (spl:objectCount(?this, ?predicate) AS ?count) .
       FILTER (?count < ?minCount || ?count > ?maxCount) .
   }

This is an elegant and powerful pattern as it allows users to define their own structured objects and reuse them across shape definitions, without requiring changes to the actual data graphs.

Constraints with helper objects (sh:valueScheme example)

See Jerven's email

To implement such a structured parameter object, the query needs to walk into the shapes graph unless they are all moved up to become top-level arguments.

Use Cases that we haven't thought of yet

We could is easily try to convince ourselves that the SHACL WG in 2015 knows already what people in the next few years will want to do with SHACL. Sorry, but we don't. It is a perfectly normal situation in RDF-based application to have generic queries that dynamically react on whatever information they find in the class/properties model. The fact that RDFS/OWL classes are also just triples makes this an attractive value proposition. The equivalent in the SHACL world is the shapes graph.