ValidationCode

From Semantic Web Standards

Obsolete - please see ShEx Semantics


Validation code with unit tests

This validation code has been created to support the discussion on the SHEX standard. A suite of test cases has been included to support for discussion related to SHEX.

You can see ValidationPseudoCode here the old original page with the pseudo code that under pined this code.

This validation code validates RDF Subject agains a SHEX definition given in RDF SHEX.

This document and code is based on [1],[2],[3].

See below the validation code of interest, the full code is available at https://github.com/jessevdam/shextest

This code contains a ruby server which can accessed on http://localhost:4567. RDF SHEX code together with RDF document can be given for validation. The RDF SHEX code is validated againt the RDF SHEX format after which the RDF document is validated against the given RDF SHEX format.

A set of test cases is included in the testset.txt, however the test still need to be documented. (work in progess)

The simple format for this file is a following:

#prefix definition for all test cases

*** Name of the test
#shex defenition goes here
---
#data to be test goes here
=== Success code (OPEN,NONE,FAIL,PASS,ALL)
*** Name of the following test set
... 

See filedoc.txt for a short describtion of the different files.

Overview

The matching of an Subject against a Shape is based on the behavior used in a regular expression, with the following differences (Eric)

  1. Shape Expression rules match elements in a unordered set (of triples) while regular expressions match elements in a ordered sequence (of characters). Since no order exists in the object, are all matching triples and associated objects of the subject matched and not only the next character in the sequence.
  2. Arc rules in Shape Expressions discriminate between "not found" and "found with the wrong value", whereas in a regular expression there is only "not found"

In comparison with XML schema there is the following difference

  1. An XML document has one root, wheres as a RDF document there no Root, but many subjects -> any subject(s) can be a root
NOTE: definition for closed/open shapes has to be added
NOTE: semantic actions has still to be added 

Structure overview

The complete SHEX template contain one or more Shapes.

A shape is a 'And' 'Rule group' A shape can only be referred by a Shape Property whereas a 'Rule Group' can not

Rule groups

Each 'Rule group' contains zero or more rules, which can be either a

  • 'Rule groups' named as a sub group
  • 'Rule arcs' named as a shape property

Each 'Rule group' can have the following cardinality

  • 1, match must be found and satisfied
  • ?, match may be satisfied, but if found and the values mismatch then its a failure (see 'OrRuleGroup' truth table below)
NOTE (Eric): The cardinality of groups is limited to a maximum of one because the associations of repeated groups get lost when represented in a set,
e.g {Bob, 23, Sue, 18} because indistinguishable from {18, 23, Bob, Sue}.) 

Two different types of 'Rule groups' exists

  1. And rule groups, each rule must be satisfied
  2. Or rule groups, only one of rules must be satisfied

See group truth tables below for further detail

Rule ARCs

Five different types of 'Rule ARC' exists

  1. Value Type, the value type of the object must comply with one of the given value type(s)
  2. Enumeration Set, the value must be equal to (one off) the given value(s)
  3. Shape Property, the value must be a RDF subject and the subject should comply with (one of) the given shape(s)
  4. Stem Property, the value must be a IRI and the IRI should comply to the stem given.
  5. Any, any value is accepted, or must be either a IRI, literal or blank note
NOTE: For Value Type, I allow for matching against multiple value types. (E.G. the property 'age' may be either a Integer or a String)  

For Shape Property, I allow for matching against one of the shapes. (E.G. the property 'address' may be either a Dutch or American address)

Each 'Rule Arc' can have the following cardinality

  • 1, only 1 occurrence must be found and satisfied
  • ?, only 1 occurrence may be found, but if found and the values mismatch then its a failure (see 'OrRuleGroup' truth table below)
  • +, on least 1 occurrence must be found, but multiple are allowed, however all values must match otherwise its a failure
  • *, any number occurrences may be found, however all values must match otherwise its a failure
  • {N,M}, number of occurrences found must between N(cardinalityMin) and M(cardinalityMax) and all values must match otherwise its a failure

Root elements

There is a root definition that can contain one or more root items and contains the following property

  • items, a set of root items

Multiple shape type can form the root of a document and for each shape type it can be defined how many times it is expected as root, which will be most commonly exactly-one.

A root item contains the following properties

  • occurs, indicating how many times the item may be found
  • shape, the resourceshape that is a root item

If a root matching fails is not counted, however it does not fail the final result, as mismatches will occur, however if the occurs count mismatches it will fail.

TODO: This still needs to be further discussed

Truth tables

AndRuleGroup.truthTable = ...
Seq NONE OPEN FAIL PASS
NONE NONE NONE FAIL FAIL
OPEN NONE OPEN FAIL PASS
FAIL FAIL FAIL FAIL FAIL
PASS FAIL PASS FAIL PASS

(taken from [4])

OrRuleGroup.truthTable = ...
Seq NONE OPEN FAIL PASS
NONE NONE OPEN FAIL* PASS
OPEN OPEN OPEN FAIL* PASS
FAIL FAIL* FAIL* FAIL PASS*
PASS PASS PASS PASS* PASS
NOTE: *In a or rule group one of the elements must match. If one passes no matter what the other results do is will pass,
however if something is optional and returns OPEN should it fail if an  other item fails? 
Please see also eric test case here [[5]]

Validate all

A match ALL is returned if all subject in the database are matched against a SHEX shape. Note however that this is not testing if all triples in the DB are matched.

NOTE: TODO: The test code in not perfect yet, in some cases it will return ALL while it did not match all.

Recursion

There are 2 types of recursion in the validation code, for which code is included to prevent loops. This achieved by including 2 kinds of chain.

1. Subgroup property. If a group is already in the chain of group that is being check it is assumed that it will pass and so a PASS is returned. 2. Shape arc. If a certain subject/shape combination is in the chain is being assumed that is will pass and so true is returned.

Properties of types used in pseudo code

A RuleGroup has the following properties

  • rules, all rules within a RuleGroup, which can be either a sub group(type = RuleGroup) or a shape property (type = ArcRule)
  • cardinality, which can be either '1' or '?'
  • type, which is either OrGroup or AndGroup
    • truthTable, each type has it own truthTable (see below)

A ArcRule has the following properties

  • cardinality, which can be either '1', '?','+','*','{N,M}'
  • cardinalityMin, min count of occurrences
  • cardinalitymax, max count of occurrences
  • type, which can be either EnumerationSet,ValueType,ValueProperty,StemProperty,AnyType
    • matchValue, perform the match of a value against the ArcRule, true->match and false->no match

A EnumerationSet(subclass of ArcRule) has the following property

  • allowedValues, a list of allowed values

A ValueType(subclass of ArcRule) has the following property

  • allowedTypes, a list of allowed types

A ValueProperty(subclass of ArcRule) has the following property

  • shapes, a list of references to shapes of which one must match

A StemProperty(subclass of ArcRule) has the following property

  • pattern, the pattern to which the object IRI must match

A AnyProperty(subclass of ArcRule) has the folowing property

  • superType(optional), allowed supertype (IRI,Literal, Blank Note)

Pseudo code

The pseudo code below matches RDF dataset against a SHEX definition.

NOTE: that the pseudo code contain 2 types of recursion. One for the sub groups and one for the Shape Property.

The result of matchRuleGroup and matchArc can be either

  • NONE = 0, no result found while expected
  • OPEN = 1, no result found however arc/rule group is optional
  • FAIL = 2, result found but one or more values are mismatching
  • PASS = 3, result found and all values are matching
  • ALL = 4, all subjects in the DB are matched against a SHEX shape
/*!
* Copyright 2014 Jesse van Dam and other contributors
*/
var getAllSubject = function(DB)
{ 
 //>SELECT DISTINCT ?subjecti
 //FROM $DB
 //>WHERE
 //>{
 //>  ?subject ?predicate ?values .
 //>}	
 ...
};

var getValues = function(DB,subject,predicate) {
 //>SELECT ?values
 //FROM $DB
 //>WHERE
 //>{
 // >  $subject $predicate ?values .
 //>}
 ...
};
 
validator = function(DB)
{
 var NONE = 0; var OPEN = 1;var FAIL = 2; var PASS = 3; var ALL = 4;
 var resNames = {0:"NONE",1:"OPEN",2:"FAIL",3:"PASS"};
 var truthTable = {};
 //ResourceShape is always a AndRuleGroup
 //The truthTables, does this contain the behaviour we want to have
 truthTable["ResourceShape"] = truthTable["AndRuleGroup"] = 
      [[NONE, NONE, FAIL, NONE],
	[NONE, OPEN, FAIL, PASS],
	[FAIL, FAIL, FAIL, FAIL],
	[NONE, PASS, FAIL, PASS]];
 truthTable["OrRuleGroup"] = 
      [[NONE, OPEN, FAIL, PASS],
	[OPEN, OPEN, FAIL, PASS],
	[FAIL, FAIL, FAIL, PASS],
	[PASS, PASS, PASS, PASS]]; 
	
 var _1 = "http://open-services.net/ns/core#Exactly-one";
 var _0_or1 = "http://open-services.net/ns/core#Zero-or-one";
 var _1_ormore = "http://open-services.net/ns/core#One-or-many";
 var _0_ormore = "http://open-services.net/ns/core#Zero-or-many";
 
 var stLiteral = "http://www.w3.org/2013/ShEx/Definition#Literal";
 var stIRI = "http://www.w3.org/2013/ShEx/Definition#IRI";
 var stBNode = "http://www.w3.org/2013/ShEx/Definition#BNode";
 var stMap = {"IRI":stIRI,"RDFLiteral":stLiteral,"BNode":stBNode};
 
 var arcMatchFunctions = {};
 
 //variable used to remember which object we are checking against which ruleGroup, this it needed to prevent loops
 var visiting = [];
 
 var subjectsValidatedMap = null;
 
 this.matchStart = function(root)
 {
   //try to match against all available subjects
   var subjects = getAllSubject(DB);
   //Make a map of all subjects present in the DB, so we can check if we have matched all of them (so there is no remaining subject)
   //Remaining subject - predicate is not checked in the current version of the script
   //TODO: fix problem of matching via probe routes that fail
   subjectsValidatedMap = {};
   _.each(subjects, function(subject) { subjectsValidatedMap[subject.lex] = false; });
   var finalRes = PASS;
   //iterate over all start items defined in the shex file
   if(to_array(root.items).any(function(item){
     var matchCount = 0;
     //for each item the number of matches should match with the occurs count
     //NOTE: if a fail occurs it does not trigger a failure, otherwise is most cases it will always fail, since not all subject will fit with the start elements
     //NOTE: TODO some functionality could be added, if a subject fits with a 'smaller' shape it must also fit with the 'complete' shape
     _.each(subjects,function(subject)
     {
   	res = matchRuleGroup(subject,item.shape);
   	if(res == OPEN || res == PASS)
   	{
   	  matchCount += 1;
   	  subjectsValidatedMap[subject.lex] = true;
   	}
     });
     if(matchCount == 0)
     {
     	if(item.occurs == _1 || item.occurs == _1_ormore)
     	 return true;//FAIL
     }
     else if(matchCount > 1)
     {
        if(item.occurs == _1 || item.occurs == _0_or1)
     	    return true;//FAIL  	
     }
     else if(item.minoccurs != undefined && (matchCount < item.minoccurs || matchCount > item.maxoccurs))
     {
       return true;//FAIL 
     }             	
   }))
     return FAIL;
   if(_.any(subjectsValidatedMap,function(validated,item) {
     if(!validated)
     {
     	console.log("remaining: " + item);
     	return true;
     }
   }))
     return PASS;
   return ALL;
 };
 
 var to_array = function(val)
 {
   if(val === undefined)
     return _([]);
   var res = Array.isArray(val) ? val : [val];
   return _(res);
 };
 
 //Peform the matching of all the rules in the group, the 'truthTable' table is used to combine them (see truth tables below)
 var matchRuleGroup = function(subject,ruleGroup,exclude)
 {
   //mark the subject that is being checked against the ruleGroup
   visiting.push({"value":subject,"shape":ruleGroup});
   if(exclude == undefined)
     exclude = [ruleGroup];
   else
     exclude = _.union(exclude,[ruleGroup]);
   var sumaryRes = null;
   var todo = to_array(ruleGroup.subGroup);
   var todo2 = todo.difference(exclude);
   //If we exclude some to prevent loops assume they pass, parent must otherwise is fail anyhow
   //NOTE: TODO some hick ups with the NONE condition, to test and define further
   if(todo2.size() < todo.size())
   {
      sumaryRes = PASS;
   }
   todo2.each(function(item)
   {
     var res = matchRuleGroup(subject,item,exclude);
     sumaryRes = evaluate(sumaryRes,res,truthTable[ruleGroup["@type"]]);
   });
   to_array(ruleGroup.property).each(function(item)
   {
     var res = matchArc(subject,item);
     sumaryRes = evaluate(sumaryRes,res,truthTable[ruleGroup["@type"]]);
   });   

   //check done unmark again
   _.remove(visiting,function compare(val){ return val["value"] === subject && val["shape"] === ruleGroup;});
   
     //Empty shape should pass
   if(sumaryRes == null) 
     sumaryRes = PASS;     
   else if(sumaryRes == NONE && ruleGroup.occurs == _0_or1)
     sumaryRes = OPEN;
   //else if(sumaryRes == NONE && ruleGroup.occurs == _1)
   //  return NONE;
   return sumaryRes;   
 };

 var evaluate = function(l,r,truthTable)
 {
   if(l == null)
     return r;
   //Combine subsequent result with the previous one using the truth table (see below)
   return truthTable[l][r];
 };
 
 var matchArc = function(subject,arc)
 {
   //retrieve all values for a given predicate in the subject
   //TODO get @id if propertyDefinition points to an object
   var values = getValues(DB,subject.lex,arc.propDefinition);
   //check if occurrence is correct  //	console.log(JSON.stringify(, null, 2));
   if(values.length == 0)
     return arc.occurs == _0_or1 || arc.occurs == _0_ormore ? OPEN : NONE;
   if(values.length > 1 && (arc.occurs == _1 || arc.occurs == _0_or1))
     return FAIL; //failed on multiplicity
   if(arc.minoccurs != undefined && (values.length < arc.minoccurs || values.length > arc.maxoccurs))
     return FAIL; //NOTE when length == 0 NONE was returned
   //each value should match other wise its a fail
   if(_.any(values,function(value) {
     //use the matchValue function of the arc type
     //if match == false return FAIL
     return !arcMatchFunctions[arc["@type"]](value,arc);
   }))
     return FAIL;
   return PASS;
 };
 
 arcMatchFunctions["EnumerationSetProperty"] = function(value,arc)
 {
   //value must be an IRI
   //NOTE: is that what want to have ??
   if(value._ != "IRI")
     return false;
   //check if value in contained in one of the 'allowed values' given in the 'arc' rule
   return getIRI(to_array(arc.allowedValue)).contains(value.lex);
 };
 
 arcMatchFunctions["ValueProperty"] = function(value,arc)
 {
   //It must be a literal to have a value type
   if(value._ != "RDFLiteral")
     return false;
   //get data type of the value
   //NOTE: should we default to xsd:string if no datatype is given
   var type = "http://www.w3.org/2001/XMLSchema#string";
   if(value.datatype != undefined)
     type = value.datatype.lex;
   //check if the data type in the list of allowed data types
   return getIRI(to_array(arc.valueType)).contains(type);
 };
 
 arcMatchFunctions["ShapeProperty"] = function(value,arc)
 {
   //must be a subject so its must be an IRI
   if(value._ != "IRI" && value._ != "BNode")
     return false;
   //must match against one the shapes given in the Arc rule (most commonly only one given)
   if(to_array(arc.valueShape).any(function(shape) { 
     //recursively call the rule matching with the new subject and shape as rule group
     //NOTE the difference with the subgroup recursion!
     //First check if do a recursive check otherwise we loop forever
     if(_.any(visiting,function compare(val){ return val["value"] === value && val["shape"] === shape;}))
     {
     	//the value is already being checked against the shape in the chain so here we return true.
     	//if it fails is will fail in the 'first' check
     	//NOTE! TODO check impact on NONE,OPEN,PASS and FAIL status
       return true;	
     }
     else
     {
       res = matchRuleGroup(value,shape);
       //res == OPEN should be impossible, because the property valueShape should always reference a ResourceShape
       //and a ResourceShape must always have rs:occurs = Exactly-one  
       //except if all arc and subgroup within rule returned OPEN
       if(res == PASS || res == OPEN) 
       {
         //not perfect
         subjectsValidatedMap[value.lex] = true;
         return true;
       }
     }
   }))
     return true; //return true as soon one matches 
   return false;
 };

 arcMatchFunctions["StemProperty"] = function(value,arc)
 {
   //must be an IRI
   if(value._ != "IRI")
     return false;
   if(getIRI(to_array(arc.stem)).any(function(stem) {
     return value.lex.indexOf(stem) == 0;
   }))
     return true;
   return false;
 };
 
 arcMatchFunctions["AnyTypeProperty"] = function(value,arc)
 {
   //NOTE: TODO minus operator to be defined and implemented
   //any value allowed if no supertype is defined otherwise limit to given supertypes
   if(arc.superType == undefined)
     return true;
   //is super type in the list off allowed super types
   return getIRI(to_array(arc.superType)).contains(stMap[value._]);
 };
};