Validation Orientation

RDF Data Shapes WG F2F
30 October, 2014

Eric Prud'hommeaux, <eric@w3.org>
- Sanitation Engineer, W3C

Problem Statement

Useful data needs consistent structure:

Advertise value
- What's there?
- How's it spelled?
- How's it linked?
Generate interfaces
Direct query generation
Direct query optimization

Detect and correct errors:

missing properties,
missing/bad type arcs,
missing referents,
inconsistent state (e.g. dates),
value set violations.

@prefix : <http://www.w3.org/2012/12/rdf-val/SOTA-ex#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/'> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<issue7> a :Issue , :SecurityIssue ;
    :state :unassigned ;
    :reportedBy <user6> , <user2> ; # cardinality 1
    :reportedOn "2012-12-31T23:57:00"^^xsd:dateTime ;
    :reproducedBy <user2>, <user1> ;
    :reproducedOn "2012-11-31T23:57:00"^^xsd:dateTime ;
                       # reproduced before being reported
    :related <issue4>, <issue3>, <issue2> .
                       # referenced issues not included

<issue4> # a ???         missing type arc
    :state :unsinged ; # misspelled
    # :reportedBy ??? -  missing
    :reportedOn "2012-12-31T23:57:00"^^xsd:dateTime .

<user2> a foaf:Person ;
    foaf:givenName "Alice" ;
    foaf:familyName "Smith" ;
    foaf:phone <tel:+1.555.222.2222> ;
    foaf:mbox <mailto:alice@example.com> .

<user6> a foaf:Agent ; # should be foaf:Person
    foaf:givenName "Bob" ; # foaf:familyName "???" - missing
    foaf:phone <tel:+.555.222.2222> ; # malformed tel: URL
    foaf:mbox <mailto:alice@example.com> .

Expectations

What do users get elsewhere?

	SQL	XML
missing properties	reportedBy UNSIGNED INT NOT NULL	element reportedBy { User },
missing/bad type arcs	N/A	N/A
missing referents	FOREIGN KEY (reportedBy) REFERENCES Users(ID)	<keyref refer="UserID">
inconsistent state	CHECK(reproducedOn>reportedOn)	[schematron]
value set violations	ENUM('unasigned', 'assigned')	attribute state { "unassigned" \| "assigned" }

Existing data

Ontology separate from constraints separate from instance
- instance and ontology routinely separate.
- constraints tied to OSLC services, not to instance data.
One type used for multiple purposes.
(Data not annotated with fully-discriminating types.)
- foaf:Person could be a friend, invitee, patient
One type arc used through different states.
- e.g. schema for interface A for yet unassigned issues
  vs. schema interface B for resolved issues..
Objects sometimes strings, sometimes objects.
- dc:creator a string or a foaf:Person in the same data.
- likewise wikipathways:complex or metacyc:…
Value set inclusion rules vary:
- enumerated set
- dereferenced enumerated set
- too large to dereference
- inclusion by lexical constraints (e.g. IRI stem)
- product of OWL union/intersection
schema.org:
- {domain,range}Includes
- ObjectProperties can be literals
- datatype "equivalences": "schema/true", "true"^^xsd:boolean, "True"
- Role member rules

Technologies

in order of specificity:

SPARQL
path-indexed ASK
SPARQL as SPIN
OWL/ICV
Resource Shapes/
Description Set Profiles/
TQ schema
Shape Expressions

SPARQL

targeted "point" tests (like schematron).
can be compiled.
expressive.
ubiquitous.
expressive.
idomatic -- many ways to code some constraint.

ASK {
    { SELECT ?S (COUNT(*) AS ?S_c0) {
      ?S foaf:givenName ?o .
    } GROUP BY ?S}
    { SELECT ?S (COUNT(*) AS ?S_c1) {
      ?S foaf:givenName ?o .
    } GROUP BY ?S}
    FILTER (?S_c0 = ?S_c1 &&
            ?S_c0 = 1)
    { SELECT ?S (COUNT(*) AS ?S_c2) {
      ?S ex:state ?o .
    } GROUP BY ?S HAVING (COUNT(*)=1)}
    { SELECT ?S (COUNT(*) AS ?S_c3) {
      ?S ex:state ?o .
      FILTER ((?o = ex:unassigned ||
               ?o = ex:assigned))
    } GROUP BY ?S HAVING (COUNT(*)=1)}
    FILTER (?S_c2 = ?S_c3 &&
            (?S_c0 = 0 || ?S_c0 = 1))
}

Path-indexed ASK

per Simister, Brickley

separate the selector.
imposes more "parsability" on the SPARQL.
used to break up the structure of a SPARQL query.

{
  "@context": { … },
  "constraints": [{
    "context": "ex:status",
    "constraint": "ASK { ?s ex:assignee ?o }",
    "severity": "warning",
    "message": "a status of assigned requires an assignee"
  }]
}

SPARQL as SPIN

Similar to SPARQL but written in RDF.
Associated with RDF term or type.
powerful.
complex@@.

ASK {
    { SELECT ?this (COUNT(*) AS ?this_c0) {
      ?this foaf:givenName ?o .
    } GROUP BY ?this}
    { SELECT ?this (COUNT(*) AS ?this_c1) {
      ?this foaf:givenName ?o .
    } GROUP BY ?this}
    FILTER (?this_c0 = ?this_c1 &&
            ?this_c0 = 1)
    { SELECT ?this (COUNT(*) AS ?this_c2) {
      ?this ex:state ?o .
    } GROUP BY ?this HAVING (COUNT(*)=1)}
    { SELECT ?this (COUNT(*) AS ?this_c3) {
      ?this ex:state ?o .
      FILTER ((?o = ex:unassigned ||
               ?o = ex:assigned))
    } GROUP BY ?this HAVING (COUNT(*)=1)}
    FILTER (?this_c2 = ?this_c3 &&
            (?this_c0 = 0 || ?this_c0 = 1))
}

SPIN as templates

Vocabulary associated with user-defined functions defined in SPARQL or javascript.
Syntax like Resource Shapes/Description Set Profiles (described later).
SPARQL or JS as extensibility mechansim.

:Issue a owl:Class ;
  rdfs:subClassOf owl:Thing ;
  spin:constraint [
      a spl:ObjectCountPropertyConstraint ;
      arg:property ex:name ;
      arg:minCount 1 ;
      arg:maxCount 1 ;
    ] ;
  spin:constraint [
      a spl:ObjectCountPropertyConstraint ;
      arg:property ex:state ;
      arg:minCount 0 ;
      arg:maxCount 1 ;
    ] ;
  spin:constraint [
      a spl:UntypedObjectPropertyConstraint ;
      arg:property ex:state ;
    ] .

ex:name a owl:DatatypeProperty ;
  rdfs:domain my:name-status ;
  rdfs:range xsd:string .

:ValidState a owl:Class ;
  rdfs:label "Valid state" ;
  rdfs:subClassOf owl:Thing ;
.
:state a owl:ObjectProperty ;
  rdfs:domain my:name-status ;
  rdfs:range ex:ValidState .
:unassigned a ex:ValidState .
:assigned a ex:ValidState .

OWL

Combine OWL with a premise type associated with data.

sound, standard logic
no e.g. minimum cardinality
no defined extensibility route

Datatype: rdfs:Literal 
DataProperty: ex:name 
ObjectProperty: ex:status 
    
Class: ex:name-status 
    SubClassOf: 
        ex:name exactly 1 rdfs:Literal ,
        ex:status max 1 ({ ex:assigned , ex:unassigned }) ,
        ex:status min 0 owl:Thing 
    
Individual: ex:assigned 
Individual: ex:unassigned

Not a real proposal; instead used with in a different interpretation…

OWL/ICV

OWL with unique name assumption and closed world

realizes common user expectations
terse syntax (Manchester)
no particular extensibility mechanism

Datatype: rdfs:Literal 
DataProperty: ex:name 
ObjectProperty: ex:status 
    
Class: ex:name-status 
    SubClassOf: 
        ex:name exactly 1 rdfs:Literal ,
        ex:status max 1 ({ ex:assigned , ex:unassigned }) ,
        ex:status min 0 owl:Thing 
    
Individual: ex:assigned 
Individual: ex:unassigned

Resource Shapes/Description Set Profiles/TQ schema

specs: vocab and use cases
no particular extensibility mechanism

my:name-status a rs:ResourceShape ;
    rs:property [
        rs:name "name" ;
        rs:propertyDefinition foaf:name ;
        rs:valueType xsd:string ;
        rs:occurs rs:Exactly-one ;
    ] ;
    rs:property [
        rs:name "state" ;
        rs:propertyDefinition ex:state ;
        rs:allowedValue ex:unassigned> , ex:assigned ;
        rs:occurs rs:Zero-or-one ;
    ] .

Shape Expressions

specs: semantics and compact syntax
provides apparent semantics for Resource Shapes (plus OR and semantic actions)
purpose-built user syntax
no commercial support
defined extensibility mechanism

my:name-status {
  ex:name xsd:string ,
  ex:status ( ex:unassigned ex:assigned )?
}

Semantics

my:name-status {
  ex:name xsd:string ,
  ex:status ( ex:unassigned ex:assigned )?
}

Datatype: rdfs:Literal 
DataProperty: ex:name 
ObjectProperty: ex:status 
    
Class: ex:name-status 
    SubClassOf: 
        ex:name exactly 1 rdfs:Literal ,
        ex:status max 1 ({ ex:assigned , ex:unassigned }) ,
        ex:status min 0 owl:Thing 
    
Individual: ex:assigned 
Individual: ex:unassigned

my:name-status a rs:ResourceShape ;
    rs:property [
        rs:name "name" ;
        rs:propertyDefinition foaf:name ;
        rs:valueType xsd:string ;
        rs:occurs rs:Exactly-one ;
    ] ;
    rs:property [
        rs:name "state" ;
        rs:propertyDefinition ex:state ;
        rs:allowedValue ex:unassigned> , ex:assigned ;
        rs:occurs rs:Zero-or-one ;
    ] .

ASK {
    { SELECT ?S (COUNT(*) AS ?S_c0) {
      ?S foaf:givenName ?o .
    } GROUP BY ?S}
    { SELECT ?S (COUNT(*) AS ?S_c1) {
      ?S foaf:givenName ?o .
    } GROUP BY ?S}
    FILTER (?S_c0 = ?S_c1 &&
            ?S_c0 = 1)
    { SELECT ?S (COUNT(*) AS ?S_c2) {
      ?S ex:state ?o .
    } GROUP BY ?S HAVING (COUNT(*)=1)}
    { SELECT ?S (COUNT(*) AS ?S_c3) {
      ?S ex:state ?o .
      FILTER ((?o = ex:unassigned ||
               ?o = ex:assigned))
    } GROUP BY ?S HAVING (COUNT(*)=1)}
    FILTER (?S_c2 = ?S_c3 &&
            (?S_c0 = 0 || ?S_c0 = 1))
}

Issues:

Interaction with reasoning:
- Frequently bad if ontology "instantiates" items that should be flagged as missing.
- Can be useful post-unification/discrimination:
```
<Foo> { foaf:parent NONLITERAL{2} }
```
```
<X> foaf:parent [ foaf:mbox <mailto:a@example.com> ],
                [ foaf:mbox <mailto:a@example.com> ].
```
validating data as-is
- normal data doesn't have type arcs discriminating every possible use.
- don't require modification of the data
constraints crossing named graphs.
"type annotations"
- W3C XML Schema adds "types" to the PSVI
- Interacts with mutability.
- Resource Shapes/Shape Expressions don't talk about rdf:type
  OWL and SPIN do.

Terminology

context-free
- DTD: Person/Title is identical to Book/Title
context-sensitive
- Person/Title can have an "authority attribute"
rexexp vs. Perl Regexp
- recursive
```
<(?:[^<>]++|(?1))*>
```
- SPARQL regexps extremely limited
unique particle attribution
- predictable "regexp capture groups"

Requirements

review "small" requirements group.

Questions

?