ShEx/CurrentDiscussion

From Semantic Web Standards
Jump to: navigation, search

Current Discussion and work

See below list of active ongoing discussions. Most active topics are on top.

Validation algorithm

Currently there are 3 modes of validation:

  • State based
  • Backtracking based: Follows the inference rules semantics. It can be used for testing but it is very inefficient
  • Regular Expression derivatives: Efficient implementation inspired by James Clark's RelaxNG algorithm.

A set of test cases have been developed to check the different validation algorithms.

See here a list of current discussion topics related to the State based implementatioin ShEx/CurrentDiscussion/StateBased

Naming of the standard currently named as SHape EXpression (SHEX)

The current name however make people think/refer to shapes in geometry. We think we should think of a better name.

RDF Schema would be a nice name, however this name has already been used by some other project.

I would like anyone to suggest names here.

  • (jesse) I would like to suggest Graph Schema, but then we must also capture schema matching for ordered graphs
  • I (labra) have no problem with the current (ShEx)y name.
  • A W3c charter group is being created with the name RDF Data Shape (maybe, at the end it will be called RDS)

Closed/open shapes and exclusion of matched triples

There are already have been a discussion about open and closed shapes and excluding already matched triples from rematching any following rule, this discussion is important as it could relate to the previous discussion.

When matching a space to a piece of RDF data then the matched triples can be excluded to be further matched against any other rules. However this would make it impossible to redefine a ARC and make is more strict as it can not match the triples already matches by the less strict rule. This especially use full for defining the allowed values in the RDF:type predicate.

A shape can either be defined as open or closed.

An open shape would match a subject if all rules in the shape are passed, however not all triples have to be matched. For a closed shape however each triple in the subject has to be matched. When a shape is defined as closed in can not be further extended.

Arguments in favour of open semantics

In my(jesse) opinion is would be best if a shape is by default open and can be defined as closed.

Arguments in favour of closed semantics

  • Using closed shape semantics seems more natural to define an operational semantics.
  • In contrast with other formalisms, like RDF Schema, OWL, etc. Shape Expressions employs a closed world semantics which allows a more strict validation.
  • Using a closed semantics, it is possible to mimic open shapes by adding: '. . *'. For example, to express a concept with property ':p' and any other property:
   <a> { :p xsd:integer ,
         . . * 
       }
  • Closed concepts cannot be expressed with open semantics. For example, if one wants to express a concept that has one and only one property :p, with closed semantics, it is:
   <a> { :p . }

Is it possible to declare it using an open semantics?

We could add the follwoing statement to indicate a shape is closed

   <a> { 
     :p .
     {}
   }

Use of multiple value types

Currently the SHEXc syntax only allows for the following

  • put only one type of value class: ValueType, ValueStem, ValueAny or ValueReference except for a ValueSet.
  • use of - only in the ValueAny
    • - terms can only be of type ValueStem

I would like to propose a syntax in which there can be multiple value classes defined minus can be used in any value definition. Such that for example

:s {
 (:a @type1 | :a @type2 | :a xsd:string)
}

can be defined as

:s {
  :a (@type1 @type2 xsd:string)
}

When only one ValueClass is used it can be stated without the (), otherwise the () has to be used. The syntax for each of the ValueClasses becomes as following

ValueClass definition example
Value* ex:val1
ValueType* % %xsd:integer
ValueReference @ @ex:type1
ValueStem ...~ ex:item~
ValueNodeType IRI | Literal | BNode | NonIRI | NonLiteral | NonBNode IRI
ValueAny . .
ValueRegExp* /..../ /\d*/

Note that the ValueSet (val1 val2) is the same as previous definition, however to make different between values and valuetypes, did I propose the use of % sign for valuetypes

I further propose instead of having the - in the ValueAny definition to have it the compelete line. So any thing coming after the minus sign is negated. So that we can define something like

:s {
  :a (ex:val~ - ex:val1 @ex:type), #match ex:val~ and must be not equal to ex:val1 and not be of ex:type
  :b (. - ex:val~) #match all except that what is ex:val~
}

This solution would allow for the following use cases (please add use cases that you think are useful)

  • (ex:~ - ex:val1 ex:val2)
  • (ex:~ - /ex:val\d*/)
  • (@type1 @type2)

However this solution would also allow for the following cases, which makes is easy for the user to define sloppy schema's

  • (%xsd:string @foaf:person)

To prevent this kind off usage we can change the definitioin as following.

Only one type of value classes can be listed in list, however after the minus sign different value classes can be used. The value class type can be defined in front of the list as following.

@(type1 type2 - ex:ref~) #reference to either type1 or type2 but IRI my not start with ex:ref
%(xsd:string xsd:int - /1\d*/) #reference either to type string or int but number or string my not start with 1 

The definition of the nameClass for the predicate matching can stay as it is.

Language tags

A language tag can be defined by adding @en to element, which is the same as in turtle

:s {
  :a ="name"@en, #match all string that equal to name and have language tag English
  :b /[a-z]*/@en #match all string that have only non capital letter and have language tag English
}

We could support for having definition that tells it should be in either of the languages defined

:s {
  :a = %xsd:string@(en,es), #match all string that are either English or Spanish
}

Definition of start tag

When no start tag is defined then:

For each subject a report is generated whether is matched any of the types defined. I think support for this mode should be an optional feature for a validator (jesse).

When a start tag is defined then:

A final matching result should be given that can be either

  • Pass -> Root shape successfully match some of the subjects in the database
  • Pass All -> Root shape successfully match to the subjects in the database such that all subjects has been visited

The pass all validation is use full to make sure all items in a database are linked to the root(s) and are validated.

Note: When start element is defined, no matching is done anymore against any of the other shapes that are link link/called from the root note.

I(jesse) think the following 2 items should be added.

  • There should be an option to define multiple root elements
  • For each root element an definition can be given whether 1 subject or multiple subject in the database may be found

This syntax could look like

start = ex:shape1+ ex:shape2+

The following steps are performed during matching

  • Each subject in the database is matched against each defined root element.
    • When a match is found the occurrence rate is increased by one
    • When a mismatch or no match is found nothing happens
  • At the end the occurrence rate is checked against the allowed multiplicity of each root element
    • If mismatching then validation result is a FAIL. (0 != 1..N || N != 1)
  • If all subject have been visited then return 'Pass All' otherwise return 'Pass'

Support for graphs

Currently there is not support for graphs in SHEX therefore is this a first proposal for supporting graphs in SHEX.

The following proposal exist of the following 4 definitions

  • Graph definition
  • Valuetype that reference to graph
  • Reference to shape within another graph
  • Triple stored within another graph
  • Use Graph as Shape

First we define some syntax to define shape expressions for each/a graph in the database using the following proposed syntax

ex:GraphUsers [[ #Shape expression definition for the graph itself
  ex:UserShape {
   foaf:name xsd:string .
 }
]]

ex:GraphReport [[ 
  ex:ReportShape {
   ex:title xsd:string+ .
 }
]]

Using the following syntax we can add support for referencing to a graph from a subject

ex:GraphUsers [[
  ex:UserShape {
   foaf:name xsd:string .
   ex:reportSet []ex:GraphReport . #reference to the GraphReport Graph
 }
]]

ex:GraphReport [[ 
  ex:ReportShape {
   ex:title xsd:string+ .
 }
]]

Example data

ex:allusers {
  ex:user1 foaf:name "user1" ;
           ex:reportSet ex:reportSet1 .
  ex:user2 foaf:name "user2" ;
           ex:reportSet ex:reprotSet2 .
}

ex:reportSet1 {
  ex:report1 ex:title "report1" .
  ex:report2 ex:title "report2" .
}

ex:reportSet2 {
  ex:report3 ex:title "report3" .
  ex:report4 ex:title "report4" .
}

Using the following syntax we can reference to a shape definition inside another graph.

NOTE: The 'default shape graph' is the 'current shape graph'

ex:GraphUsers [[ 
  ex:UserShape {
   foaf:name xsd:string .
   ex:report [@ReportShape+ -> []ex:GraphReport] . #reference to shape ReportShape in the graph GraphReport
 }
]]

ex:GraphReport [[
  ex:ReportShape {
   ex:title xsd:string .
 }
]]

Example data

ex:allusers {
  ex:user1 foaf:name "user1" ;
           ex:report ex:report1, ex:report2.
  ex:user2 foaf:name "user2" ;
           ex:report ex:report3, ex:report4.
}

ex:reportSet1 {
  ex:report1 ex:title "report1" .
  ex:report2 ex:title "report2" .
  ex:report3 ex:title "report3" .
  ex:report4 ex:title "report4" .
}

Using the following syntax we can define that for a certain arc the triple are stored in another graph

NOTE: the 'default graph' for an arc is the 'current graph'

ex:GraphUsers [[
  ex:UserShape {
   foaf:name xsd:string .
   [ex:age xsd:integer] -> []ex:AgeGraph . #this triple is stored within the specified graph
 }
]]

ex:AgeGraph [[ 
   ex:AgeShape { #extra/double validition inside the AgeGraph itself
     ex:age xsd:integer 
   }
]]

Example data

ex:allusers {
  ex:user1 foaf:name "user1" .
  ex:user2 foaf:name "user2" .
}

ex:ageGraph {
  ex:user1 ex:age 24.
  ex:user1 ex:age 28.
}

Using the following syntax we can use the graph subject to say something about the graph itself

ex:GraphUsers [[
  ex:UserShape {
   foaf:name xsd:string .
   [ex:age xsd:integer] -> []ex:AgeGraph . #this triple is stored within the specified graph
 }
 []ex:AgeGraph { #some extra information on the age graph
   ex:source xsd:string .
 }
]]

ex:allusers {
  ex:user1 foaf:name "user1" .
  ex:user2 foaf:name "user2" .
  ex:ageGraph ex:source "world wide web" .
}

ex:ageGraph {
  ex:user1 ex:age 24.
  ex:user1 ex:age 28.
}

When combining several of these items we can define the following

ex:GraphUsers [[ 
  ex:UserShape {
   foaf:name xsd:string .
   [ex:report [@ReportShape+ -> []ex:GraphReport]] -> []ex:GraphReportLink . #reference to shape ReportShape in the graph GraphReport and triple is stored in the GraphReportLink graph shape
   }
]]

ex:GraphReportLink [[
  ex:ReportLink { #double validation
    ex:report [@ReportShape+ -> []ex:GraphReport]]     
  }
]]

ex:GraphReport [[
  ex:ReportShape {
   ex:title xsd:string .
 }
]]

example data:

ex:allusers {
  ex:user1 foaf:name "user1" .
  ex:user2 foaf:name "user2" .
}
  
ex:reportLink {
  ex:user1 ex:report ex:report1, ex:report2.
  ex:user2 ex:report ex:report3, ex:report4.
}

ex:reportSet1 {
  ex:report1 ex:title "report1" .
  ex:report2 ex:title "report2" .
}
ex:reportSet2 {
  ex:report3 ex:title "report3" .
  ex:report4 ex:title "report4" .
}

problems

This initial proposal can do quiet a lot however there are several problems

  • No method exist that defines the graph and shape subject to be the same

In the follwoing use case we would like to tell that graph and subject should be the same, however with the initial proposal given here is that not possible

ex:interaction123 { 
  ex:interaction123 :upregulates ex:protein456
}
ex:protein456 { 
  ex:protein456 ex:name "lexa"
}

Schema definition:

ex:Interaction [[
  []ex:Interaction {
    ex:protein []ex:Protein, #reference to graph
    ex:protein [@[]ex:Protein -> ex:Protein] #as well reference to the shape within 
  }
]]

ex:Protein [[
  []ex:Protein { #could be IRI of an other protein graph
    ex:name xsd:string
  }
]]

An option to solve this would be using some kind of variable binding as done in sparql, however this could dramatically increase the expressiveness and the complexity of the validation process. This would look something like.

ex:Interaction [[
  []ex:Interaction {
    ex:protein []ex:Protein, #reference to graph
    ex:protein [@[]ex:Protein -> ex:Protein] #as well reference to the shape within 
  }
]]

ex:Protein ?uri [[ #bind uri of the graph
  []ex:Protein ?uri { #shape uri should be the same of the one of the graph
    ex:name xsd:string
  }
]]

However the use of bindable variables in SHEX would be a discussion on it own.

  • Defining the (reverse)multiplicity between a subject and the graph is impossible

For the following definition both solution are ok, there is not method to say something about the multiplicity between a graph and subject. ex:GraphUsers [[

  ex:UserShape {
   foaf:name xsd:string .
   [ex:report [@ReportShape+ -> []ex:GraphReport]] -> []ex:GraphReportLink . #reference to shape ReportShape in the graph GraphReport and triple is stored in the GraphReportLink graph shape
   }
]]

ex:GraphReportLink [[
  ex:ReportLink { #double validation
    ex:report [@ReportShape+ -> []ex:GraphReport]]     
  }
]]

ex:GraphReport [[
  ex:ReportShape {
   ex:title xsd:string .
 }
]]

example data:

ex:allusers {
  ex:user1 foaf:name "user1" .
  ex:user2 foaf:name "user2" .
}
  
ex:reportLink {
  ex:user1 ex:report ex:report1, ex:report2.
  ex:user2 ex:report ex:report3, ex:report4.
}

ex:reportSet1 {
  ex:report1 ex:title "report1" .
  ex:report2 ex:title "report2" .
}
ex:reportSet2 {
  ex:report3 ex:title "report3" .
  ex:report4 ex:title "report4" .
}

however, this will fit also, which could be unwanted

ex:reportSet1 {
  ex:report1 ex:title "report1" .
  ex:report2 ex:title "report2" .
  ex:report3 ex:title "report3" .
  ex:report4 ex:title "report4" .
}

We could solve this by defining some kind of multiplicity behind the -> sign, which would look something like ex:GraphUsers [[

  ex:UserShape {
   foaf:name xsd:string .
   [ex:report [@ReportShape+ ->1 []ex:GraphReport]] ->1:1 []ex:GraphReportLink . #all ex:report triples to be found in one graph that graph might not contain any other triples matching this arc and subject
   }
]]

ex:GraphReportLink [[
  ex:ReportLink { #double validation
    ex:report [@ReportShape+ ->1:1 []ex:GraphReport]]  #all definitions to found in one graph and that graph may only contain references from this arc and subject
  }
]]

ex:GraphReport [[
  ex:ReportShape {
   ex:title xsd:string .
 }
]]

However the exact details, complexity and related problems are not clear at the moment.

  • Has complex effect on the validation process

References of and- or rule groups

The ValidationCode script is based on the RDF Shex format, which allows for referencing to named Or and And rule groups. However this is not possible yet in the SHEX syntax. In the current RDF Shex definition there is a difference between the ResourceShape and AndRuleGroup. A Resourceshape is an extension on the AndRuleGroup. Only a Resourceshape can be referenced by a ShapeArc, whereas a AndRuleGroup can not. A resource shape, however, must have an occurence of exactly one.

*Discussion point: Should we have seperate ResourceShape and AndRuleGroup or should this be merged to one.

Some extensions

Language tags

Add @lang to value object so we can express. Example, declare a concept that has two 'rdfs:label', one in English and the other in Spanish.

   <concept> { rdfs:label (@en), 
               rdfs:label (@es) 
             }

Regular expressions

Add regular expressions to the definition of value objects.

For example, declare that a concept has a 'rdfs:label' that has two consecutive a's:

   <concept> { rdfs:label (/.*aa.*/) }

Regular expressions can be combined with language tags. For example, declare a concept that has 'rdfs:label' with two consecutive a's in Spanish.

   <concept> { rdfs:label (/.*aa.*/@es) }

Reverse arcs

Declare arcs that point to a concept. For example, declare that an agent must be known by some person:

   <concept> { ^ foaf:knows @<Person> }

Adding Reverse arcs validation can interact with a semantics that only takes into account the subjects of a node.