Goals

Whether trials are submitted in RDF or submitted in e.g. SDTM and then converted to RDF, the TA ontologies define a form of submission "validity". This document describes the how TA ontologies either define validation constraints and how they may be directly invoked vs. translate to another system. Existing tooling can be used for:

Schema (ontology) Validation:
- Detect errors in ontologies.
- Enable error-free distributed development of ontologies.

Instance validation:
- Leverage ontologies to perform validation.
- Control inferences performed before validation.

The end of this document describes some use cases taken from OpenCDISC and identifies what is the simplest validation technology required to meet them.

State of the Art

Standardization work on validation is only just beginning. To date, vendor-neutral solutions involve piecemeal testing with SPARQL. While SPARQL's deployment is widespread, comprehensive queries to validate conformance with some ontologies or profiles are tedious to develop and maintain. Fortunately, a working group will convene in a month to begin developing solutions for validation of RDF instance data. In the mean time, several proprietary and non-standard solutions are available, described below.

Ontology validation

The goal of ontology validation is to, where possible, detect logically impossible assertions in the model. For instance, the following subclass of a bridg:Diagnosis is logically inconsistent and is a good example of a modeling mistake:

 bridg:Diagnosis 
   rdfs:subClassOf bridg:PerformedObservationResult .
 my:ImpossibleObservation 
   rdfs:subClassOf bridg:Diagnosis , bridg:PerformedObservation .
 bridg:PerformedObservation 
   rdfs:disjointFrom bridg:PerformedObservationResult .

Core ontology validation

The core.ttl file places the types from Charlie's metamodel into a hierarchy extended from BRIDG. It is quite easy to make modeling errors which will confuse or break systems. OWL itself can detect logically impossible ontologies producing either "unsatisfiable sets" or, if some instance claims to be of an unsatisfiable set. The latter is difficult to work with as it tends to contaminate all connected assertions rendering all of the data "inconsistent".

Protégé flags unsatisfiable sets in red and labels them as subclass of owl:Nothing. Top Braid Composer doesn't detect unsatisfiable sets without specifically testing with instance data

Shared and TA ontology validation

The mistakes in the TA ontology development could easily lead to the same sorts mistakes, but the grammatical constraints in the .TA language prevent this sort of error (so long as the compilation process notices potentially conflated names).

Governance of ontology development

In principle, queries over the ontology can be used to detect e.g. mutiple definitions of the same concept. If two TA ontologies (or two shared library ontologies) define a particular test with expected results in the same value set, these definitions are possibly redundant. When developing a new ontology, one can use this workflow to detect some redundancies:

Query for all pairs of core:Observations with shared value set as an expected result, excluding those for which the value set is labeled sharedBetweenObservations.
For each result, if the value set is context-neutral, label it with sharedBetweenObservations, else eliminate the definition in the new ontology and re-use the one from an existing ontology.

Similar processes can look for redundancies on observation codes (bridg:Activity.identifier).

Validating Instance Data

Given a simple bit of ontology:

 core:EndpointAssessment 
   rdfs:subClassOf 
     [ owl:onProperty core:hasObservationTime ; owl:allValuesFrom xsd:dateTime ] ,
     [ owl:onProperty core:hasObservationTime ; owl:cardinality 1 ] ,
     [ owl:onProperty core:beforeIntervention ; owl:allValuesFrom core:Observation ] ,
     [ owl:onProperty core:beforeIntervention ; owl:minCardinality 1 ] ,
     [ owl:onProperty core:hasIntervention    ; owl:allValuesFrom core:Intervention ] ,
     [ owl:onProperty core:hasIntervention    ; owl:minCardinality 1 ] ,
     [ owl:onProperty core:afterIntervention  ; owl:allValuesFrom core:Observation ] ,
     [ owl:onProperty core:afterIntervention  ; owl:minCardinality 1 ] .

the objective is that instance documents with

the wrong object type
the wrong object count
missing mandatory properties

be flagged as invalid. Descriptive error messages are certainly a big bonus.

Integrity Constraint Validator (ICV)

The most direct mapping from existing to ontologies to executable instance validation is with Stardog's ICV. This simply re-interprets the standard OWL semantics as constraints to be applied to a graph.

Closed World Assumption (CWA)

Standard OWL semantics assume that related information may exist in some document that it has not seen. For this reason, it is impossible to flag missing assertions (those missing triples might exist somewhere). CWA asserts that if the processor hasn't see a triple, it should treat it as missing. It should signal a missing core:beforeIntervention on the following data:

 :subjectsKidneyCSAR1 a core:EndpointAssessment ;
     core:hasObservationTime "2013-07-07T19:00:00Z"^^xsd:dateTime ;
     core:hasIntervention :adminImmunosuppressantB-2013-07-06T11-20 ;
     core:afterIntervention :subjectsPostOpHour36GFR .

@@flush out contrast closed world semweb modeling.

Unique Name Assumption (UNA)

OWL only treats http://a.example/foo and http://b.example/bar as distinct when specifically told so:

 <http://a.example/foo> owl:differentFrom <http://b.example/bar> .

Such an enumeration must be exhaustive, including even typos, which is an unreallistic requirement. The opposite tactic, more appropriate for instance validation, is to assume that names are distinct unless they are asserted to be equivalent.

 <http://a.example/foo> owl:sameAs <http://b.example/bar> .

Issues

It may not be appropriate to assume that every ontology expresses validation constraints.

For instance, it may be an error if an EndpointAssessments is missing a beforeIntervention observation but not if study subject isn't explicitly associated with some protocol version. Without inference, it can be hard to know what types to validate. For instance, the data in the above example is more likely to be asserted to be of type RenalX:KidneyGraftCSARAssessment than of type core:EndpointAssessment. One possibility is to perform at least RDFS closure and label inferred assertions so that none of them trigger a max cardinality error.

Overloads OWL semantics.

It's hard to know if an owl:imports directive is importing an ontology or constraints (i.e. whether it should be interpreted iwht an open or closed world).

See Stardog ICV submission to RDF Validation Workshop and Stardog ICV documentation.

SPARQL

SPARQL is a powerful graph-matching language which can be used to test constraints. An external framework can invoke the queries and test the results. These queries enumerate the constraints for a core:EndpointAssessment:

 # core:hasObservationTime 1 xsd:dateTime
 ASK { { SELECT (COUNT(*) AS ?c)
          WHERE { ?this core:hasObservationTime ?time } }
       { SELECT (COUNT(*) AS ?c)
          WHERE { ?this core:hasObservationTime ?time .
                  FILTER (datatype(?time) = xsd:dateTime) } }
       FILTER (?c == 1) }
 # core:beforeIntervention 1+ core:Observation
 ASK { { SELECT (COUNT(*) AS ?c)
          WHERE { ?this core:beforeIntervention ?before } }
       { SELECT (COUNT(*) AS ?c)
          WHERE { ?this core:hasObservationTime ?before .
                  ?before a core:Observation } }
       FILTER (?c == 1) }

Issues

Exhaustive enumeration of elements in ontology.
Requires external framework to invoke queries and look for expected results.

SPARQL Inferencing Notation (SPIN)

SPIN extends the SPARQL langauge with the notion of a ?this variable (or spin:_this when in RDF graphs). SPIN constraints attached to classes by the spin:constraint predicate imply a constraint which is universal for that type. For example, one can use SPIN constraints to invoke the above constraints for core:EndpointAssessment:

 core:EndpointAssessment 
   spin:constraint 
     [ a spin:Ask ;
       spin:text 
         """ASK { { SELECT (COUNT(*) AS ?c)
                     WHERE { ?this core:hasObservationTime ?time } }
                  { SELECT (COUNT(*) AS ?c)
                     WHERE { ?this core:hasObservationTime ?time .
                             FILTER (datatype(?time) = xsd:dateTime) } }
                  FILTER (?c == 1) }""" ] ;
     [ a spin:Ask ;
       spin:text 
         """ASK { { SELECT (COUNT(*) AS ?c)
                     WHERE { ?this core:beforeIntervention ?before } }
                  { SELECT (COUNT(*) AS ?c)
                     WHERE { ?this core:hasObservationTime ?before .
                             ?before a xsd:Observation } }
                  FILTER (?c == 1) }""" ]
     # etc ...
     .

Top Braid Composer has an extesion function called spin:objectCount which, combined with declared SPIN functions, enables a terse expression like Resource Shapes (following):

 core:EndpointAssessment 
   spin:constraint 
     [ a spl:ObjectCountPropertyConstraint ;
       arg:property core:hasObservationTime ;
       arg:count 1 ;
       rdfs:range xsd:dateTime
     ] ,
     [ a spl:ObjectCountPropertyConstraint ;
       arg:property core:beforeIntervention ;
       arg:minCount 1 ;
       rdfs:range core:Observation
     ] 
     # etc ...
     .

Issues

Overloads RDFS and OWL semantics (e.g. rdfs:range above).

This raises problems like understanding which interpretation is expected from an owl:imports.

Tying constraints to particular types may be too global, e.g. if some TA ontologies have different constraints for shared classes.

Resource Shapes

Resource Shapes is a vocabulary for describing how a service is using a collection ontologies, e.g. for input to or output from a service. It can require that a specific type arc be present, but in the general case does not require particular type annotations:

 my:EndpointAssessmentShape a rs:ResourceShape
   rs:property 
     [ arg:property core:hasObservationTime ;
       rs:valueType xsd:dateTime ;
       rs:occurs rs:Exactly-one 
     ] ,
     [ arg:property core:beforeIntervention ;
       rs:range core:Observation ;
       rs:occurs rs:One-or-many 
     ] 
     # etc ...
     .

Issues

Existing tooling embedded in large OSLC infrastructure.
Does not directly leverage ontology.

Shape Expressions Compact Syntax (ShExC)

ShExC provies a human-facing syntax for Resource Shapes. It's expressivity is slightly greater than that of Resource Shapes (Or groups and Optional groups):

 my:EndpointAssessmentShape {
   core:hasObservationTime xsd:dateTime,
   core:beforeIntervention core:Observation+,
   core:hasIntervention    core:Intervention+,
   core:afterIntervention  core:Observation+
 }

Issues

Not supported by major vendors.

Expressivity much lower than SPARQL (like ICV and Resource Shapes).

Compiling to SPARQL

ICV and Shape Expressions compile to SPARQL queries. This provides an ubiquitious execution environment, albeit with a potential performance sacrifice. The resulting SPARQL queries are not intended to be maintained so using something which compiles to SPARQL pressures one to do future maintanence in that language.

Instance Validation Use Cases

OpenCDISC SDTM rules provide examples of conventional instance validation. The majority of the these fall into 5 categories. Below each example is the simplest of the validation approaches which an be used to detect errors.

Value set

Value of a variable is from a value set (CT0001-CT0080).

CT0024

Value for SCSTRESC not found in (MARISTAT) CT codelist

Character Result/Finding in Std Format (SCSTRESC) variable values should be populated with terms found in 'Marital Status' (C76348) CDISC controlled terminology codelist, when Subject Characteristic Short Name (SCTESTCD) is 'MARISTAT'

Assume this fragment of an ontology with a property maritalStatusCode using a value set cdisk:C76348:

cdisc:C76240 rdfs:label "ANNULLED" .
cdisc:C51776 rdfs:label "DIVORCED" .
cdisc:C53262 rdfs:label "DOMESTIC PARTNER" .
cdisc:C76241 rdfs:label "INTERLOCUTORY" .
cdisc:C51777 rdfs:label "LEGALLY SEPARATED" .
cdisc:C51773 rdfs:label "MARRIED" .
cdisc:C51774 rdfs:label "NEVER MARRIED" .
cdisc:C76242 rdfs:label "POLYGAMOUS" .
cdisc:C51775 rdfs:label "WIDOWED" .

cdisk:C76348 rdfs:label "MARISTAT" ;
  owl:oneOf (cdisc:C76240 cdisc:C51776 cdisc:C53262 cdisc:C76241
  cdisc:C51777 cdisc:C51773 cdisc:C51774 cdisc:C76242 cdisc:C51775) .

bridg:Person rdfs:subClassOf 
  [ owl:onProperty bridg:Person.maritalStatusCode ; owl:allValuesFrom cdisk:C76348 ] .

We can test the constraint with the above technologies:

OWL/ICV requires no additional assertions, however the interpretation with CWA/UNA requires non-standard (and currently proprietary) software.

ShEx (minimum expressivity):

bridg:Person {
  bridg:Person.maritalStatusCode (cdisc:C76240 cdisc:C51776 cdisc:C53262 cdisc:C76241 
                     cdisc:C51777 cdisc:C51773 cdisc:C51774 cdisc:C76242 cdisc:C51775)
}

SPIN:

bridg:Person
  rdf:type owl:Class ;
  spin:constraint [
      rdf:type sp:Ask ;
      sp:text """ASK { ?this bridg:Person.maritalStatusCode ?action
      FILTER (?action=cdisc:C76240 ?action=cdisc:C51776 ?action=cdisc:C53262
              ?action=cdisc:C76241 ?action=cdisc:C51777 ?action=cdisc:C51773
              ?action=cdisc:C51774 ?action=cdisc:C76242 ?action=cdisc:C51775)
      }"""^^xsd:string ;
    ] .

Graph consistency

Mutual properties both set or neither set.

SD0009

No qualifiers set to 'Y', when AE is Serious

When Serious Event (AESER) variable value is 'Y', then at least one of seriousness criteria variables is expected to have value 'Y' (Involves Cancer (AESCAN), Congenital Anomaly or Birth Defect (AESCONG), Persist or Signif Disability/Incapacity (AESDISAB)

Most of these can be captured as structure constraints on the data such as those described in Validating Instance Data, noting especially that not all of the ontology may be treated as constraints. The minimum expressivity will usually be ShEx and the minimum redundancy will be OWL/ICV.

Reporting format consistency

SDTM consistency with define.xml, duplicate sequence numbers, etc.

SD0063

SDTM/dataset variable label mismatch

Variable Label in the dataset should match the variable label described in SDTM. When creating a new domain Variable Labels could be adjusted as appropriate to properly convey the meaning in the context of the data being submitted

These are mostly not required for RDF. Consistency between resources is covered by a graph consistency test over the merge of the submission data and metadata.

Lexical validation

Inappropriate length, invalid ISO 8601 value, excessive precision.

SD0015

Negative value for --DUR

Non-missing Duration of Event, Exposure or Obseravtion (--DUR) value must be greater than or equal to 0

Most of these can be handled by RDF's use of XSD datatypes. For instance, xsd:dateTime has a definition which restricts the lexical space to ISO 8601-compliant forms. OWL can axioms can extend this to e.g. test the year range:

my:TwentyfirstCenturyTime a owl:restriction ;
  xsd:pattern "20[0-9]{2}-[0-9]{2}-[0-9]{2}" .

Value validation

date comparison

SD0012

--STDY is after --ENDY

Study Day of Start of Event, Exposure or Observation (--STDY) must be less or equal to Study Day of End of Event, Exposure or Observation (--ENDY)

SPARQL/SPIN is the only way to capture the value comparison (or ShEx semantic actions, but that's effectively the same expressivity. An analogous test to make sure that the time of a bridg:PerformedObservationResult is not before its corresponding bridg:PerformedObservation would look like:

bridg:PerformedObservationResult 
  rdf:type owl:Class ;
  spin:constraint [
      rdf:type sp:Ask ;
      sp:text """ASK { ?this a bridg:PerformedObservation ;
                         bridg:PerformedActivity.dateRange ?obsdt ;
                         bridg:PerformedObservation.resultsInPerformedObservationResult ?res ;
                       ?res bridg:PerformedObservationResult.reportedDate ?resDate
      FILTER (?resDate >= ?obsdt)
      }"""^^xsd:string ;
    ] .

Impact on TA ontologies

Some validation rules indicate some value in adding generalizations in the TA ontologies. As an example, current ontologies don't have a flag for baseline observations (they are simply used as the beforeIntervention observation in core:EndpointAssessments). Adding such a marker would enable this rule, though it's probably redundant against rules on core:EndpointAssessments:

SD0006

No baseline result in [Domain] for subject

All subjects should have at least one baseline observation (--BLFL = 'Y') in EG, LB, QS, and VS domains, except for subjects who failed screening (ARMCD = 'SCRNFAIL') or were not fully assigned to an Arm (ARMCD = 'NOTASSGN')

Resources

Dublin Core Application Profiles validation requirements