Difference between revisions of "User Stories"

From RDF Data Shapes Working Group
Jump to: navigation, search
(resolve https://www.w3.org/2014/data-shapes/track/issues/6)
Line 1: Line 1:
== Status of this Document ==
 
 
Per [http://www.w3.org/2014/12/18-shapes-minutes#resolution02 decision 18 Dec 2014], new user stories are added only by WG decision.
 
These are neither final nor approved; the WG will continue to refine them.
 
 
 
== Stakeholders ==
 
== Stakeholders ==
 
* Data modelers
 
* Data modelers
Line 62: Line 57:
 
Input ontology: the ontology that represents RDFS (or OWL) syntax
 
Input ontology: the ontology that represents RDFS (or OWL) syntax
  
==== OWL constraints (Stardog ICV) ====
+
==== OWL constraints (Stardog ICV?) ====
  
 
Each property has to have a specified domain that is a class:
 
Each property has to have a specified domain that is a class:
Line 92: Line 87:
 
Created by: Peter F. Patel-Schneider
 
Created by: Peter F. Patel-Schneider
  
For a tool that builds a list of names for named entity resolution of people to work correctly, every person has to have one or more names specified, each of which is a string.  Constraints can be used to verify that particular sets of data have such names for each person.
+
For a tool to build list of names for named entity resolution of people to work correctly, every person has to have one or more names specified, each of which is a string.  Constraints can be used to verify that particular sets of data have such names for each person.
 
+
Before an RDF graph was fed into the tool the constraints would be run over the graph.  Only if the constraints were not violated would the tool actually be run.  The constraints would be separate from the ontology involved, as other uses of the same ontology would not necessarily have the requirement that all instances of person have their names specified.
+
 
+
The envisioned calling setup would be something like
+
 
+
<code>
+
verify(<graph containing information about people>,<person ontology>,<constraints>)
+
</code>
+
 
+
This is probably the simplest user story, both in terms of setup and requirements.  It illustrates the basic idea of checking to see whether for some RDF graph all nodes that belong to some type have the right kind of information specified for them.  The same story can be repeated for other tools that need particular information specified for the tool to work correctly.
+
  
 
==== OWL constraints (Stardog ICV) ====
 
==== OWL constraints (Stardog ICV) ====
Line 134: Line 119:
  
 
==== ShExC ====
 
==== ShExC ====
 
+
<code>
  my:PersonShape [http://www.w3.org/Submission/2014/SUBM-shapes-20140211/#describes oslc:describes] rdf:Person .
+
 
   my:PersonShape { ex:name xsd:string }
 
   my:PersonShape { ex:name xsd:string }
 +
</code>
  
(PFPS: This needs some indication on how ShExC indicates that all and only instances of person need to satisfy this shape.)
+
(PFPS: This needs some indication on how this is turned into a constraint.)
 
+
(ericP: propose that '''{my:PersonShape oslc:describes rdf:Person .}''' resolves this.)
+
  
 
=== S3: Communicating back to users, kindly ===
 
=== S3: Communicating back to users, kindly ===
Line 184: Line 167:
 
=== S4: Issue repository ===
 
=== S4: Issue repository ===
 
Created by: Eric Prud'hommeaux
 
Created by: Eric Prud'hommeaux
 +
 +
An LDP Container <http://PendingIssues> accepts an IssueShape with a status of "assigned" or "unassigned".
 +
The LDP Container is an interface to a service storing data in a conventional relational database.
 +
The shapes are "closed" in that the system rejects documents with any triples for which it has no storage.
 +
The shapes validation process (initiated by the receiving system or a sender checking) rejects any document with "extraneous" triples.
 +
 +
Any node in the graph may serve multiple roles, e.g. the same node may include properties for a SubmittingUser and for an AssignedEmployee.
 +
 +
Later the issue gets resolved and is available at <http://OldIssues> without acquiring new type arcs.
 +
The constraints for <http://PendingIssues>
 +
are different from those for Issues at <http://OldIssues>
  
 
(PFPS:  A story is needed here!)
 
(PFPS:  A story is needed here!)
 +
(ericP: propose this this story is now sufficient.)
  
 
==== ShEx ====
 
==== ShEx ====
Line 230: Line 225:
 
Created by: Holger Knublauch
 
Created by: Holger Knublauch
  
EPIM Project - petroleum operators on the Norwegian continental shell need to produce environment reports of what chemicals were dumped into the sea and gases to the air. There is a need for access rules on what operators can see what data from what oil and gas fields, and for complex constraints to run during import of XML files.  
+
EPIM Project - petroleum operators on the Norwegian continental shell need to produce environment reports of what chemicals were dumped into the sea and gases to the air. There is a need for access rules on what operators can see what data from what oil and gas fields, and for complex constraints to run during import of XML files. SPIN was used to represent and evaluate those constraints.
  
This is an example of very complex constraints that require many features that are present in SPARQL to represent model-specific scenarios, including the comparison of incoming values against a controlled fact base, transformations from literal values to URIs, string operations, date comparisons etc.  
+
This is an example of very complex constraints that require many features from SPARQL to represent model-specific scenarios, including the comparison of incoming values against a controlled fact base, transformations from literal values to URIs, string operations, date comparisons etc. User-defined SPIN functions were used to make those complex queries maintainable.
  
 
Details: [[EPIM ReportingHub]]
 
Details: [[EPIM ReportingHub]]
 
SPIN was used to represent and evaluate those constraints.  User-defined SPIN functions were used to make those complex queries maintainable.
 
  
 
=== S6: Closed-world recognition for e.g. for partial ontology import ===
 
=== S6: Closed-world recognition for e.g. for partial ontology import ===
Line 290: Line 283:
 
  ex:Bond <= all ex:valid ( exists ex:endTime xsd:date )
 
  ex:Bond <= all ex:valid ( exists ex:endTime xsd:date )
 
</code>
 
</code>
 
The first constraint says that every instance of ex:Contract has to have a provided value for ex:valid that is an ex:TimeInterval.  The second constraint says that every provided value for ex:valid for every instance of ex:Bond has to have a provided value for ex:endTime that is an xsd:date.
 
  
 
(HK: Looks easy to represent in SPIN but I do not understand the syntax above, so I cannot provide an example at this stage)
 
(HK: Looks easy to represent in SPIN but I do not understand the syntax above, so I cannot provide an example at this stage)
Line 333: Line 324:
  
 
=== S11: Model-Driven UI constraints ===
 
=== S11: Model-Driven UI constraints ===
It is useful to build input forms and perform validation of permissible values in user interfaces via a model-driven approach, where the model provides information about the possible values for properties.
+
Need to have constraints provide model-driven validation of permissible values in user interfaces. A number of solutions and applications have been deployed which use SPIN to check constraints on permissible values to user interfaces. This overcomes the software debt that comes from using  javascript that can readily become out-of-sync with the underlying models ".
  
 
The major requirement here is a declarative model of
 
The major requirement here is a declarative model of
Line 343: Line 334:
  
 
A meta-requirement here is to be able to make use of the information above without having to run something like SPARQL queries, i.e. the model should be sufficiently high level so that all kinds of tools can use that information. However, at the same time there are many advanced constraints that need to be validated (either on server or client) before a form can be submitted. These constraints are not necessarily "structural" information, but rather executable code that returns error messages.
 
A meta-requirement here is to be able to make use of the information above without having to run something like SPARQL queries, i.e. the model should be sufficiently high level so that all kinds of tools can use that information. However, at the same time there are many advanced constraints that need to be validated (either on server or client) before a form can be submitted. These constraints are not necessarily "structural" information, but rather executable code that returns error messages.
 
A number of solutions and applications have been deployed which use SPIN to check constraints on permissible values to user interfaces. This overcomes the software debt that comes from using  javascript that can readily become out-of-sync with the underlying models.
 
  
 
Details about an existing implementation in TopBraid: [[Ontology-Driven Forms]].
 
Details about an existing implementation in TopBraid: [[Ontology-Driven Forms]].
Line 505: Line 494:
  
 
(PFPS: Why can't an ontology be used to provide this information?  What makes constraints/shapes better than an ontology?)
 
(PFPS: Why can't an ontology be used to provide this information?  What makes constraints/shapes better than an ontology?)
 
ArthurRyman: @PFPS the W3C standards for RDFS and OWL do not define them in terms of constraints. Shapes are explicitly defined in terms of expected properties and constraints.
 
 
PFPS: RDFS and OWL do not define what in terms of constraints?  I see no reason why an OWL ontology would be inappropriate for providing information about what information is or can be present in a particular set of RDF graphs.  Perhaps there is something here that cannot be handled by OWL, but without examples it is very hard to determine just what is needed.
 
  
 
=== S20: Creation Shapes ===
 
=== S20: Creation Shapes ===
Line 522: Line 507:
  
 
(PFPS:  Why are constraints/shapes better here than an ontology?  Is it the difference between what is expected to be provided and what ends up being inferred?  If so, what gains come from being able to make this distinction?)
 
(PFPS:  Why are constraints/shapes better here than an ontology?  Is it the difference between what is expected to be provided and what ends up being inferred?  If so, what gains come from being able to make this distinction?)
 
 
(kc: The difference I see is that these are closed world requirements, and your ontology may be intended for open world use. So this is the difference between validation and inferencing, CWA and OWA, NUNA and UNA. Many of the CWA's in library data are not available or useful in the OW. I can provide examples. I'm not assuming that people create closed world RDF or OWL ontologies, since to me that is a contradiction.)
 
(kc: The difference I see is that these are closed world requirements, and your ontology may be intended for open world use. So this is the difference between validation and inferencing, CWA and OWA, NUNA and UNA. Many of the CWA's in library data are not available or useful in the OW. I can provide examples. I'm not assuming that people create closed world RDF or OWL ontologies, since to me that is a contradiction.)
 
(PFPS:  For recognition, the open and closed worlds are much closer, so there needs to be something here that indicates where open world doesn't work.)
 
  
 
=== S21: SKOS Constraints ===
 
=== S21: SKOS Constraints ===
Line 652: Line 634:
 
Modified by: Karen Coyle
 
Modified by: Karen Coyle
  
Can we express validating rdf:Lists a in our framework? This is more than just a stresstest but a variation of this can be used to check whether all members of a list have certain characteristics.
+
<del>This is meant as a “stresstest” rather than as a practical use case: </del>Can we express validating rdf:Lists a in our framework?
  
 
Libraries have a number of resources that are issued in ordered series. Any library may own or have access to some parts of the series, either sequential or with broken sequences. The list may be very long, and it is often necessary to display the list of items in order. The order can be nicely numerical, or not. Another ordered list use case is that of authors on academic journal articles. For reasons of attribution (and promotion!), the order of authors in article publishing can be significant. This is not a computable order (e.g. alphabetical by name). There are probably other cases, but essentially there will definitely be a need to have ordered lists for some data. Validation could be: the list must have a beginning and end; there can be/cannot be gaps in the list.  
 
Libraries have a number of resources that are issued in ordered series. Any library may own or have access to some parts of the series, either sequential or with broken sequences. The list may be very long, and it is often necessary to display the list of items in order. The order can be nicely numerical, or not. Another ordered list use case is that of authors on academic journal articles. For reasons of attribution (and promotion!), the order of authors in article publishing can be significant. This is not a computable order (e.g. alphabetical by name). There are probably other cases, but essentially there will definitely be a need to have ordered lists for some data. Validation could be: the list must have a beginning and end; there can be/cannot be gaps in the list.  
Line 663: Line 645:
  
 
HK: A variation of this is very well a real story. We often have the requirement to formalize that a given rdf:List should only have values of certain types in it. It's a bit like with Java generics, where you can write List<Person> to parameterize a generic List class. This is currently missing from the RDF syntax, but could be represented as an additional constraint on a property that has rdf:List value type.
 
HK: A variation of this is very well a real story. We often have the requirement to formalize that a given rdf:List should only have values of certain types in it. It's a bit like with Java generics, where you can write List<Person> to parameterize a generic List class. This is currently missing from the RDF syntax, but could be represented as an additional constraint on a property that has rdf:List value type.
 
PFPS:  How about someone then put the story information at the beginning of the section?
 
 
HK: Ok, done.
 
  
 
=== S27: Relationships between values of multiple properties ===
 
=== S27: Relationships between values of multiple properties ===
Line 679: Line 657:
 
Story: (kc) Cultural heritage data is created in a distributed way, so when data is gathered together in a single aggregation, quite a bit of checking must be done. One of the key aspects of CH data is the identification of persons and subjects, in particular relating them to historical contexts. For persons, a key context is their own birth and death dates; for events, there is often a date range representing a beginning and end of the event. In addition, there are cultural heritage objects that exist over a span of time (serial publications, for example). In each of these cases, it is desirable to validate the relationship of the values of properties that have temporal or other ordered characteristics.
 
Story: (kc) Cultural heritage data is created in a distributed way, so when data is gathered together in a single aggregation, quite a bit of checking must be done. One of the key aspects of CH data is the identification of persons and subjects, in particular relating them to historical contexts. For persons, a key context is their own birth and death dates; for events, there is often a date range representing a beginning and end of the event. In addition, there are cultural heritage objects that exist over a span of time (serial publications, for example). In each of these cases, it is desirable to validate the relationship of the values of properties that have temporal or other ordered characteristics.
  
Details: [[Constraining the order of different properties|Relationships between values of different properties]]
+
Details: [[Constraining the order of different properties]]
  
pfps: kc's story appears to fit the bill here.
+
[PFPS: It would be nice to have a story here.]
 +
 
 +
[HK: Why is this not a story - the use cases in schema.org are obvious and real].
 +
 
 +
[PFPS:  I suppose that this could be turned into a story, but the motivational part is missing from this document.]
 +
 
 +
[HK: I have clarified that this is essentially about input validation; not sure what else to do here.]
  
 
=== S28: Self-Describing Linked Data Resources ===
 
=== S28: Self-Describing Linked Data Resources ===
Line 687: Line 671:
 
Created by: Holger Knublauch
 
Created by: Holger Knublauch
  
In Linked Data related information is accessed by URI dereferencing.  The information that is accessible this way may represent facts about a particular resource, but also typing information for the resource. The types can themselves be used in a similar way to find the ontology describing the resource. It should be possible to use these same mechanisms to find constraints on the information provided about the resource.
+
This is probably the default requirement from a Linked Data perspective: Given a resource URI, tell me all you know about it. The standard procedure is to look up the URI to retrieve the triples for this URI. The next step in RDF/OWL is to look for rdf:type triples and then follow those URIs to look up the class definitions. In OWL, those class definitions often carry owl:Restrictions. In SPIN, those class definitions would carry spin:constraints.
  
For example, the ontology could include constraints or could point to another document that includes constraints. Or the first document accessed might include constraints or point to another document that includes constraints.
+
DCMI story: For some properties there is a requirement that the value IRI resolve to a resource that is a skos:Concept. The resource value is not limited to a particular skos:Concept scheme.  
  
(Old version: This is probably the default requirement from a Linked Data perspective: Given a resource URI, tell me all you know about it. The standard procedure is to look up the URI to retrieve the triples for this URI. The next step in RDF/OWL is to look for rdf:type triples and then follow those URIs to look up the class definitions. In OWL, those class definitions often carry owl:Restrictions. In SPIN, those class definitions would carry spin:constraints.)
+
[PFPS: This may be something that should be done by a constraint system, but there doesn't appear to be any constraint or shape story here.]
  
DCMI story: For some properties there is a requirement that the value IRI resolve to a resource that is a skos:Concept. The resource value is not limited to a particular skos:Concept scheme.
+
[HK: The shape story here is that the linked data produced by the server would not only return the class definition, but also the properties and further constraints of that class. This information can then be used in many ways, constraint checking among them. The point of my story here is that this architecture should be linked data friendly, i.e. have transparent mechanisms to retrieve missing information before constraint checking can happen.]
 +
 
 +
[PFPS: So this is not about the constraints, but about how constraints are accessed?]
 +
 
 +
[HK: Yes, about how constraints are associated with the starting point (Resource) and then accessed for execution.]
  
 
=== S29: Describing interoperable, hypermedia-driven Web APIs (with Hydra) ===
 
=== S29: Describing interoperable, hypermedia-driven Web APIs (with Hydra) ===
Line 699: Line 687:
 
Created by: Holger Knublauch
 
Created by: Holger Knublauch
  
Hydra http://www.hydra-cg.com/ is a lightweight vocabulary to create hypermedia-driven Web APIs. By specifying a number of concepts commonly used in Web APIs it enables the creation of generic API clients. The Hydra core vocabulary can be used to define classes and "supported properties" which carry additional metadata such as whether the property is required and whether it is read-only.  
+
Hydra http://www.hydra-cg.com/ is a lightweight vocabulary to create hypermedia-driven Web APIs. By specifying a number of concepts commonly used in Web APIs it enables the creation of generic API clients. The Hydra core vocabulary can be used to define classes and "supported properties" which carry additional metadata such as whether the property is required and whether it is read-only. This feels very similar to the OSLC Resource Shapes story and uses similar constructs. It is also possible to express the supported properties as a SPIN constraint check, as implemented here: http://topbraid.org/spin/spinhydra
  
This feels very similar to the OSLC Resource Shapes story and uses similar constructs. It is also possible to express the supported properties as a SPIN constraint check, as implemented here: http://topbraid.org/spin/spinhydra
+
[PFPS: This appears to be very similar to S11.  Only one of them should survive.]
 +
 
 +
[HK: I am sure this story and S11 will produce similar requirements. My understanding of the Stories step was to ground the requirements on real use cases, so here is one. Even if it produces the same requirements, it is helpful to have it written up.]
  
 
=== S30: PROV Constraints ===
 
=== S30: PROV Constraints ===
Line 708: Line 698:
  
 
The PROV Family of Documents http://www.w3.org/TR/prov-overview/ defines a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web. One of these documents is a library of Constraints http://www.w3.org/TR/2013/REC-prov-constraints-20130430/ which defines valid PROV instances. The actual validation process is quite complex and requires a normalization step that can be compared to rules. Various implementations of this validation process exist, including a set of SPARQL INSERT/SELECT queries sequenced by a Python script (https://github.com/pgroth/prov-check/blob/master/provcheck/provconstraints.py), an implementation in Java (https://provenance.ecs.soton.ac.uk/validator/view/validator.html) and in Prolog (https://github.com/jamescheney/prov-constraints). Stardog also defines an "archetype" for PROV, which seems to be implemented in SPARQL using their ICV engine (http://docs.stardog.com/admin/#sd-Archetypes).
 
The PROV Family of Documents http://www.w3.org/TR/prov-overview/ defines a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web. One of these documents is a library of Constraints http://www.w3.org/TR/2013/REC-prov-constraints-20130430/ which defines valid PROV instances. The actual validation process is quite complex and requires a normalization step that can be compared to rules. Various implementations of this validation process exist, including a set of SPARQL INSERT/SELECT queries sequenced by a Python script (https://github.com/pgroth/prov-check/blob/master/provcheck/provconstraints.py), an implementation in Java (https://provenance.ecs.soton.ac.uk/validator/view/validator.html) and in Prolog (https://github.com/jamescheney/prov-constraints). Stardog also defines an "archetype" for PROV, which seems to be implemented in SPARQL using their ICV engine (http://docs.stardog.com/admin/#sd-Archetypes).
 
PFPS:  It would be useful to pull out a few examples from this story to show what expressive power is needed for this story.
 
  
 
=== S31: LDP: POST content to Container of a certain shape ===
 
=== S31: LDP: POST content to Container of a certain shape ===
Line 735: Line 723:
 
# if status is "Complete" end time is required.
 
# if status is "Complete" end time is required.
  
The client side does not have access to any triple store/LDP container. If these validations can be expressed in a higher level language which makes it simpler for clients to implement them constraint systems will be useful in more places.
+
The client side does not have access to any triple store/LDP container. So these validations needs to be expressed in a higher level language which makes it simpler for clients to implement these validations.
 +
 
 +
 
 +
PFPS:  I'm having a very hard time trying to figure out why clients working in a disconnected mode need a higher-level language.  I'm also having a very hard time trying to figure out why the need for a higher-level language is tied to constraints between different properties.
  
 
=== S33: Structural validation for queriability ===
 
=== S33: Structural validation for queriability ===
Line 886: Line 877:
 
  </nowiki>
 
  </nowiki>
 
There is no path from the acc:AccessContextList node to either of the acc:AccessContext nodes. There is an implicit containment relation of acc:AccessContext nodes in the acc:AccessContextList by virtue of these nodes being in the same information resource. However, the designers of this representation were attempting to eliminate clutter and appeal to Javascript developers, so they did not define explicit containment triples.
 
There is no path from the acc:AccessContextList node to either of the acc:AccessContext nodes. There is an implicit containment relation of acc:AccessContext nodes in the acc:AccessContextList by virtue of these nodes being in the same information resource. However, the designers of this representation were attempting to eliminate clutter and appeal to Javascript developers, so they did not define explicit containment triples.
 
PFPS: I don't see how the "containment relation" can be considered to be something that can take part in an RDF discussion.  I would need a lot of convincing that this story should be considered at all.
 
 
DK: This user story is also covered from ''S36: Large-scale dataset validation'', In large databases the graph may not be connected
 
  
 
=== S36: Support use of inverse properties ===
 
=== S36: Support use of inverse properties ===

Revision as of 19:32, 8 January 2015

Contents

Stakeholders

  • Data modelers
  • software developers (because they may be doing ontology-driven applications)
  • clinical informaticions
  • data creators
  • data stewards
  • systems analysts/analysts
  • data scientists
  • data re-users
  • ontology modelers
  • user interface (UI) developers
  • anyone on the web creating web pages?
  • API designers
  • API consumers (not necessarily RDF knowledgeable)
  • tool developers (consume shapes as metadata)
  • Business Analysts
  • Devices/tools/services in the IoT
  • system security engineers
  • people with non-RDF legacy systems
  • data aggregators
  • data migration engineers
  • test engineers (software testers but also application conformance testers)
  • integration test engineers
  • W3C standards creators
  • data quality engineers
  • reference data managers

Themes

T1: Recursive Structures (S4, S9, S21)

T2: Model/Data Validation (S1, S2, S21)

T3: Dataset Partitions (S17, S18)

T4: Compliance/Governance (S21)

T5: Closed-world recognition for access control (S5), for partial ontology import (S6)

T7: Value Validation (S11)

T8: Interoperability (S12, S14)

T9: Nuanced validation (S3)

Stories

S1: The model's Broken!

Created by: Dean Allemang

Validate RDFS (maybe also OWL) models

The basic issue here is to ensure that the right kind of information is given for each property (or class) in the model, for example, to require that each property has to have a domain, or that classes have to be explicitly stated to be under some decomposition.

Input data: the RDF representation of an RDFS (or OWL) ontology

Input ontology: the ontology that represents RDFS (or OWL) syntax

OWL constraints (Stardog ICV?)

Each property has to have a specified domain that is a class:

  rdf:Property <= exists rdfs:domain rdfs:Class

Each class has to be specified to be under the top-level decomposition:

  rdfs:Class <= { rdfs:Class, [and the other built-in classes] } union fills rdfs:subClassOf { ex:Endurant, ex:Perdurant }

Note: Because this story works with the built-in RDF, RDFS, and OWL vocabulary, the prohibition of using this vocabulary in OWL axioms would have to be lifted.

SPIN

Example: Each property has to have a domain

   rdf:Property
       spin:constraint [
           sp:text "ASK { NOT EXISTS { ?this rdfs:domain ?anyDomain } }"
       ]

S2: What's the name of that person?

Created by: Peter F. Patel-Schneider

For a tool to build list of names for named entity resolution of people to work correctly, every person has to have one or more names specified, each of which is a string. Constraints can be used to verify that particular sets of data have such names for each person.

OWL constraints (Stardog ICV)

 Person <= exists name xsd:string & all name xsd:string

SPIN

   ex:Person
       spin:constraint [
           sp:text "ASK { FILTER NOT EXISTS { ?this ex:name ?anyName } }" 
       ] ;
       spin:constraint [
           sp:text "ASK { ?this ex:name ?name . FILTER (datatype(?name) != xsd:string) }" 
       ] ;

or using Resource Shapes as a SPIN template

   ex:Person spin:constraint [  # or oslc:property 
       a oslc:Property ;
       oslc:propertyDefinition ex:name ;
       oslc:occurs oslc:One-or-many ;
       oslc:valueType xsd:string ;
   ]

ShExC

 my:PersonShape { ex:name xsd:string }

(PFPS: This needs some indication on how this is turned into a constraint.)

S3: Communicating back to users, kindly

Rather than rejecting or having yes/no, and discouraging users and rejecting a lot of data, have a number of responses that inform users of ways they could improve their data, while still accepting all but the truly unusable data. This requires levels of "validation".

SPIN

In TopBraid EVN (a web-based data entry tool), we have instance edit forms with an OK button. When OK is pressed, a server callback is made to verify all if constraints have been violated. If violations exist, they are presented to the user and depending on the severity and server settings, the user may continue without fixing the errors. SPIN can represent constraints in various severity levels (see also http://spinrdf.org/spin.html#spin-constraint-construct):

  • spin:Fatal: We can stop checking immediately, no way to continue
  • spin:Error: Something should really be fixed
  • spin:Warning: Just report it back to the user but don't block him
  • spin:Info: Just to print something out, e.g. for debugging.

Here is an example in SPIN, using the CONSTRUCT notation to produce a constraint violation - additional properties could be attached to each report, including pointers at the triple that is causing the issue. Not shown here, SPIN even has the ability to point at an INSERT/DELETE update query to fix a violation.

   kennedys:Person
      spin:constraint
         [ a       sp:Construct ;
           sp:text """
               CONSTRUCT {
                   _:violation a spin:ConstraintViolation ;
                        spin:violationRoot ?this ;
                        spin:violationPath kennedys:spouse ;
                        spin:violationValue ?spouse ;
                        spin:violationLevel spin:Warning ;
                        rdfs:label "Same-sex marriage not permitted (in this model)"
               }
               WHERE {
                   ?this kennedys:spouse ?spouse .
                   ?this kennedys:gender ?gender .
                   ?spouse kennedys:gender ?spouseGender .
                   FILTER (?gender = ?spouseGender) .
               }"""
         ] .

S4: Issue repository

Created by: Eric Prud'hommeaux

An LDP Container <http://PendingIssues> accepts an IssueShape with a status of "assigned" or "unassigned". The LDP Container is an interface to a service storing data in a conventional relational database. The shapes are "closed" in that the system rejects documents with any triples for which it has no storage. The shapes validation process (initiated by the receiving system or a sender checking) rejects any document with "extraneous" triples.

Any node in the graph may serve multiple roles, e.g. the same node may include properties for a SubmittingUser and for an AssignedEmployee.

Later the issue gets resolved and is available at <http://OldIssues> without acquiring new type arcs. The constraints for <http://PendingIssues> are different from those for Issues at <http://OldIssues>

(PFPS: A story is needed here!) (ericP: propose this this story is now sufficient.)

ShEx

An LDP Container <http://PendingIssues> accepts an IssueShape with a status of "assigned" or "unassigned".

 <http://mumble/Issue1> ex:status ex:assigned . # not resolved

Later the issue gets resolved and is available at <http://OldIssues> without acquiring new type arcs. The constraints for <http://PendingIssues> Issues

 <PendingIssuesShape> { ex:status (ex:unassigned ex:assigned) }

are different from those for Issues at <http://OldIssues>

 <OldIssuesShape> { ex:status (ex:resolved) }

OWL constraints (Stardog ICV)

 issue & status="assigned" <= [constraints for assigned issues]
 issue & status="resolved" <= [constraints for resolved issues]

SPIN

In SPIN, such scenarios can be expressed by injecting pre-conditions into the constraint, e.g.

   ex:Issue
       spin:constraint [
           sp:text """
               # This constraint applies with status "assigned" only
               ASK {
                   ?this ex:status "assigned" .
                   ... the actual tests
               }"""
       ]

S5: Closed-world recognition (EPIM ReportingHub)

Created by: Holger Knublauch

EPIM Project - petroleum operators on the Norwegian continental shell need to produce environment reports of what chemicals were dumped into the sea and gases to the air. There is a need for access rules on what operators can see what data from what oil and gas fields, and for complex constraints to run during import of XML files. SPIN was used to represent and evaluate those constraints.

This is an example of very complex constraints that require many features from SPARQL to represent model-specific scenarios, including the comparison of incoming values against a controlled fact base, transformations from literal values to URIs, string operations, date comparisons etc. User-defined SPIN functions were used to make those complex queries maintainable.

Details: EPIM ReportingHub

S6: Closed-world recognition for e.g. for partial ontology import

Was: Importing all of an ontology is not always a good practice. When an ontology is imported it is often the case that many concepts and properties will be irrelevant to the needs at hand. In addition transitive imports can lead to increased "Ontology Glut". An increasingly popular practice is to not do any imports but to explicitly declare the use of non-imported resources with rdfs:definedIn to provide the provenance to the authoritative defining point of the resource. Alternatively some way to constrain imports to avoid ontology glut might be useful.

SPIN currently uses owl:imports to include other graphs. If no owl:imports statement is present, then the engine will not execute constraints stored in the remote schema. It is perfectly fine to have local copies of classes and properties defined elsewhere, without requiring the full contract. This is a common scenario in controlled environments, not the full Web.

(PFPS: I don't see a story for constraints here. What constraint mechanism is involved in partial ontology import?)

S7: Different shapes at different times, or different access at the same time.

Created by: Eric Prud'hommeaux

<need more here>; For me it sounds like tying shapes to metainformation (i.e. annotating), so certain constraints are only valid e.g. in certain time periods/intervals. Definitely a nice to have but most likely not formalizable in a "light-weight" manner. If shapes are not explicitly bound to owl/rdfs it is easier to decouple them.

This story feels similar to S4, which is also about context-sensitive constraints.

Dimitris: This is how RDFUnit handles validation execution, with manual & automatic annotations, see related thread

(PFPS: So far this is just S4.)

S8: Properties that can change as they pass through the workflow; different shapes for different functions.

(HK: This looks identical to S4)

(PFPS: There is also no story here, so this non-story should be removed.)

S9: Contract time intervals

Created by: Dean Allemang

OMG time ontology adopted by FIBO. end date *exists* but may not be specified. Some contracts (bonds) have an end date.

pfps: Having a date that exists but might not be specified does not appear to be a constraint. The OWL constraints below require that all contracts have exactly one time interval provided. For contracts, this time interval has to have an end date.

The larger requirement here is restriction refinement in subclasses

OWL constraints (Stardog ICV)

RDFS ontology:

 ex:Bond rdfs:subClassOf ex:Contract .
 ex:valid rdfs:domain ex:Contract .
 ex:valid rdfs:range ex:TimeInterval .
 ex:endTime rdfs:domain ex:TimeInterval .
 ex:endTime rdfs:range xsd:date .

Constraints:

ex:Contract <= =1 ex:valid ex:TimeInterval
ex:Bond <= all ex:valid ( exists ex:endTime xsd:date )

(HK: Looks easy to represent in SPIN but I do not understand the syntax above, so I cannot provide an example at this stage)

S10: card >= 0

Created by: Dean Allemang

Mention a property in a card>= 0 restriction, just to indicate an expectation that it will (or might) be there without requiring that it be there

OWL constraints (Stardog ICV)

Ontology:

ex:name rdfs:domain ex:Person .
ex:Person rdf:type rdfs:Class .

Constraint:

ex:Person <= >=0 name

SPIN

In SPIN with the OSLC ontology, this could look like the following (note that oslc:property is a sub-property of spin:constraint):

   ex:Person
       oslc:property [
           a oslc:Property ; 
           oslc:propertyDefinition ex:name ;
           oslc:occurs oslc:One-or-more
       ]

ShExC

ShExC uses regex chars '?', '*', '+' to indicate cardinality (the '.' means we don't care about the object type):

 my:PersonShape { ex:name . + }

S11: Model-Driven UI constraints

Need to have constraints provide model-driven validation of permissible values in user interfaces. A number of solutions and applications have been deployed which use SPIN to check constraints on permissible values to user interfaces. This overcomes the software debt that comes from using javascript that can readily become out-of-sync with the underlying models ".

The major requirement here is a declarative model of

  • which properties are relevant for a given class/instance
  • what is the value type of those properties
  • what is the valid cardinality (min/maxCount)
  • what is the interval of valid literal values (min/maxValue)
  • any other metadata typically needed to build forms with input widgets

A meta-requirement here is to be able to make use of the information above without having to run something like SPARQL queries, i.e. the model should be sufficiently high level so that all kinds of tools can use that information. However, at the same time there are many advanced constraints that need to be validated (either on server or client) before a form can be submitted. These constraints are not necessarily "structural" information, but rather executable code that returns error messages.

Details about an existing implementation in TopBraid: Ontology-Driven Forms.

S12: App Interoperability

Created by: Sandro Hawke

For example, cimba.co acts as a decentralized twitter, operating over LDP. For another app to interoperate, it needs to know what data shapes cimba reads and write. This is currently documented with diagrams and sparql templates. The SPARQL is fairly complex and hard to read, and it seems like another language might make it easier to write interoperable programs.

(HK: This story requires more details so that we can create examples)

S13: Specification and validation of metadata templates for immunological experiments

Created by: HIPC Consortium

Contributed by: Michel Dumontier

Systems Biology is playing an increasingly important role in unraveling the complexity of human immune responses. A key aspect of this approach involves the analysis and integration of data from a multiplicity of high-throughput immune profiling methods to understand (and eventually predict) the immunological response to infection and vaccination under diverse conditions. To this end, the Human Immunology Project Consortium (HIPC) was established by the National Institute of Allergy and Infectious Diseases (NIAID) of the US National Institutes of Health (NIH). This consortium generates a wide variety of phenotypic and molecular data from well-characterized patient cohorts, including genome-wide expression profiling, high-dimensional flow cytometry and serum cytokine concentrations. The adoption and adherence to data standards is critical to enable data integration across HIPC centers, and facilitate data re-use by the wider scientific community.

In collaboration with ImmPort, we have developed a set of spreadsheet-based templates to capture the metadata associated with experimental results such as Flow Cytometry results and Multiplex Bead Array Assay (MBAA) results. These templates contain metadata elements that are either required or optional, but importantly, define the value of the field to specific datatypes (e.g. string, integer, decimal, date) that may be restricted by length or to a regular expression pattern, and limited to specific categorical values or terminology trees/class expressions of a target ontology, especially those drawn from existing ontologies such as Cell Ontology (CL) and Protein Ontology (PO). Once filled out, these spreadsheets are programmatically validated. The values are then stored in a database and are used to power web applications and application programming interfaces.

Given the rapid change in the kinds of experiments performed and the evolving requirements concerning relevant metadata, it is crucial that a language to define these metadata constraints enable us to define different sets of metadata fields and values sets in a modular manner. In addition to HIPC, there are other immunology consortia that might involve different requirements as to how data templates should be defined according to specific needs. It should be relatively straightforward to substitute one set of shape expressions for another. It is also important that the shapes themselves are versioned and the results of validation record the version of the shape expression. It should be possible to validate data using any set of developed shapes.

Ideally, the shapes language should be readable by computers in order to automatically generate template forms with restriction to specified values. Moreover, libraries and tools to construct and validate templates and their instance data should be readily available.

S14: Object Reconciliation

Created by: David Martin

As an aid in data integration activities, it would be nice if shapes could flexibly state conditions by which to check that identity of objects has been correctly recorded; that is, check conditions under which 2 objects in a KB should explicitly represent the same real-world thing. For example (movies domain), I'd like to say:

if source1.movie.title is highly similar (by some widely adopted measure, or some measure that I can plug in to a tool) to source2.film.title AND source1.movie.release-date.year is identical to source2.film.initial-release, then it should be stated that they are the same movie

OR

if source1.movie.title is identical to source2.film.title AND source1.movie.release-date.year is close (say, < 2 years difference) to source2.film.initial-release then it should be stated that they are the same movie

OR

if source1.movie.directors has the same set of values as source2.film.directed-by AND source1.movie.title is highly similar to source2.film.title then it should be stated that they are the same movie

OR ....

(HK: This story sounds more like an inferencing problem than constraint checking. CONSTRUCT { ?this owl:sameAs ?other } WHERE { ... pattern } which can be expressed using spin:rule. Fuzzy string matching like "title highly similar to another title" may require some SPARQL extension if it cannot be expressed using regex).

(DM: Good point, Holger, and I generally agree, And I'm not wedded to this particular story. But isn't the boundary between inferencing and constraint checking inevitably very blurry? I mean, the essence of this example is meant to be: if there's an object X with property P1, and an object Y with property P2, and the value of P1 is related to the value of P2 in the following way: ... then *there must be* a sameAs relation between X and Y. As with many constraints, the intent here is to check completeness. That's constraint-like, right?)

S15: Validation of Dataset Descriptions

Created by: Michel Dumontier

Access to consistent, high-quality metadata is critical to finding, understanding, exchanging, and reusing scientific data. The W3C Health Care and Life Sciences Interest Group (HCLSIG) has developed consensus among participating stakeholders on key metadata elements and their value sets for description of HCLS datasets. This specification [1], written as a W3C note, meets key functional requirements, reuses existing vocabularies, is expressed using the Resource Description Framework (RDF). It provides guidance for minimal data description, versioning, provenance, statistics. We would like to use RDF Shapes to specify these constraints and validate the correctness of HCLS dataset descriptions.

The specification defines a 3 component model for summary,versioning, and distribution-level descriptions. Each component has access to a specific set of metadata elements and these are specified as MUST, SHOULD, MAY, and MUST NOT. As such there are different conformance criteria for each level. Metadata values are either unrestrained rdfs:Literals, constrained rdfs:Literals, URIs with a specified URI pattern, or instances of a specified URI-identified type, or a disjunction of URI-specified types.

keywords: context-sensitive constraints, cardinality constraints,

[1] http://tinyurl.com/hcls-dataset-description

Cardinalities and ranges are covered by all existing proposals, so I guess the interesting bit here is how to represent that certain constraints only apply in certain contexts ("levels: summary, version, distribution").

SPIN

In SPIN this could be represented in several ways, but the easiest might be to put multiple class definitions into different named graphs, e.g. have a different named graph for summary level than version level. It is unclear how the notion of "level" can be sufficiently generalized. We could also introduce a meta-property spin:context that can be attached to any spin:constraint to define pre-conditions that need to be met before the constraint is evaluated. This context could also be a SPARQL expression, e.g. to call a SPIN function that looks at trigger triples in the current query graph.

S16: Constraints and controlled reasoning.. We need both! (and macro- or rule mechanisms)

Created by: Simon Steyskal and Axel Polleres

A use-case we were facing recently and have discussed in [1], was revolving around the integration of distributed configurations (i.e. object-oriented models) with RDFS and SPARQL. By using Semantic Web technologies for achieving this task we:

  1. aimed to provide a convenient way to perform certain tasks on the global view such as:
    1. querying it (and thus all underlying local schemas)
    2. perform constraint checks (i.e. checking integrity constraints)
    3. perform reasoning or consistency checks.
  2. wanted to leverage the use of SWT in configuration management.

[1] https://ai.wu.ac.at/~polleres/publications/sche-etal-2014ConfigWS.pdf

Assuming UNA and CWA

For this particular use-case we had to assume both Unique Name Assumption (UNA) and Closed World Assumption (CWA) for our ontologies, since the models (i.e. configurations) they were derived from were generated by product configurators which impose both UNA and CWA.

Since neither RDFS or OWL impose UNA/CWA we had to come up with some workarounds which were basically:

UNA 2.0
all URIs are treated as different, unless explicitly stated otherwise by owl:sameAs (UNA 2.0 because in general, if two URIs are different and the ontology they are contained in is assumed to obey the UNA then they cannot be connected via owl:sameAs).
CWA
we assumed to know every existing individual of local configurations and directly connected individuals from other local configurations, thus an absence of a certain individual in the local configuration means that it does not exist.

SPARQL and UNA

As mentioned earlier, we used SPARQL to perform query tasks on the global schema as well as to check simple integrity constraints by translating e.g. cardinality restrictions into ASK queries.

One major problem which arose based on our workaround to impose UNA was that SPARQL is unaware of the special semantics of owl:sameAs. Which means that especially if one wants to use counting aggregates, one usually wants to count the number of real-objects and not the number of URIs referring to it.

As an example we defined two SPARQL queries which should count the number of subnets of a certain system [p.5 Figure 8,1]:

Listing 1: Query without special treatment of sameAs:

SELECT (COUNT(DISTINCT ?subnet) AS ?numberofsubnets)
WHERE {
 ?subnet a ontoSys:Subnet .
}
# result: numberofsubnets = 3 

Listing 2: Query with special sameas treatment (chooses the lexicographic first element as representation of the equivalence class):

SELECT (COUNT(DISTINCT ?first) AS ?numberofsubnets)
WHERE {
 ?subnet a ontoSys:Subnet .
 # first subquery
 { SELECT ?subnet ?first
   WHERE {
     ?subnet ((owl:sameAs|^owl:sameAs)*) ?first .
   OPTIONAL {
    ?notfirst ((owl:sameAs|^owl:sameAs)*) ?first .
    FILTER (STR(?notfirst) < STR(?first))}
    FILTER(!BOUND(?notfirst))}
 }
}
# result : numberofsubset = 1

Obviously Listing 2 is way more ugly than Listing 1, especially due to some nasty path expressions which are necessary to traverse through potential owl:sameAs chains. Other approaches such as replacing those chains with pivot-identifiers in a potential pre-processing step are not feasible since we actually want to keep the different identifiers separate in the data for particular use-cases.

Some thoughts...

  1. A macro- or rule mechanism for certain paths or parts reusable in constraints might help to keep constraint expressions tight and clean.
  2. We have to consider both constraints AND controlled reasoning!
  3. We also note that referring to some controlled reasoning, specifying specific inference rules that should be considered, this may be reviewed to relate to the (postponed) SPARQL feature http://www.w3.org/2009/sparql/wiki/Feature:ParameterizedInference


SPIN

SPIN provides a mechanism to declare new "magic properties" (http://spinrdf.org/spin.html#spin-magic) that can encapsulate complex logic into a more maintainable syntax. One such magic property could represent the owl:sameAs logic you mention above. However, magic properties are not part of the SPARQL standard, so we would need to officially extend SPARQL for this to be workable.

(SS: Those magic properties seem to fit the request of a "macro- or rule mechanism for certain paths" quite nicely. Thanks for the hint!)

S17: Specify subsets of data

Created by: Dean Allemand (for Dave McComb)

Have a lightweight way to refer to a part of a data set, based on the shapes. This could be used for entitlements as well ("you can see AML/KYC shape for this class", "You can only see the identification shape for this class")

(HK: Unclear how this relates to the constraint checking story. Making certain triples invisible depending on user roles looks like a hard problem that needs to be solved on a specific RDF Graph implementation in a programming language. That implementation may use any number of mechanisms, e.g. pre-compute all visible triples into another internal graph by copying them over using CONSTRUCT. For that to work, it may query any helper objects such as shapes to learn which data is visible. Requires more details to be worked out.)

S18: Scope of Export

Created by: David Martin

Starting from a given KB object (individual), I want to export a bunch of related stuff. Use shapes to specify the paths / conditions by which the stuff to be exported can be selected

(HK: Is this identical to S17? Needs more details).

(DM: Yes, I agree. This can be viewed as a special case of S17 – the case where one or more objects are given, which can be used as starting-points for determining the desired subset of a KB. (And in fact, the examples given in S17 already apply to this special case, except I was thinking of instances whereas those examples refer to classes.) So yes, S17 and S18 should be merged. Also, I grant this is not very constraint-like. But it is a very common use case, in my experience, whose solution would have excellent practical value.

S19: Query Builder

Created by: Nick Crossley

Various tools are contributing data to a triple store. A Query Builder wants to know the permitted or likely shapes of the data over which the generated queries must run, so that the end user can be presented with a nice interface prompting for likely predicates and values. Since the data is dynamic, this is not necessarily the same as the shape that could be reverse engineered from the existing data. The Query Builder and the data-producing tools are not provided by the same team - the Query Builder team has very limited control over the data being produced. The source of the data might not provide the necessary shape information, so we need a way for the Query Builder team (or a third party) to be able to provide the shape data independently. See also Ontology-Driven Forms and S11.

(PFPS: Why can't an ontology be used to provide this information? What makes constraints/shapes better than an ontology?)

S20: Creation Shapes

Created by: Nick Crossley

A client creating a new resource by posting to a Linked Data Platform Container [2] wants to know the acceptable properties and their values, including which ones are mandatory and which optional. Note that this creation shape is not necessarily the same as the shape of the resource post-creation - the server may transform some values, add new properties, etc. [2] http://www.w3.org/TR/ldp/#ldpc

See the ongoing discussion at http://lists.w3.org/Archives/Public/public-data-shapes-wg/2014Nov/0160.html with hints at a solution based on named graphs. Other solutions with stand-alone shapes have been proposed as well as an option to select constraints based on decorations (annotations)

(PFPS: This looks very similar to Story 19.)

(PFPS: Why are constraints/shapes better here than an ontology? Is it the difference between what is expected to be provided and what ends up being inferred? If so, what gains come from being able to make this distinction?) (kc: The difference I see is that these are closed world requirements, and your ontology may be intended for open world use. So this is the difference between validation and inferencing, CWA and OWA, NUNA and UNA. Many of the CWA's in library data are not available or useful in the OW. I can provide examples. I'm not assuming that people create closed world RDF or OWL ontologies, since to me that is a contradiction.)

S21: SKOS Constraints

Created by: Holger Knublauch

The well-known SKOS vocabulary defines constraints that are outside of the expressivity of current ontology languages. They can be expressed using SPARQL built-ins, e.g. via SPIN. Examples include

  • make sure that a resource has at most one preferred label for a given language
  • preferred labels and alternative labels must be disjoint

Details: SKOS Constraints

(DCMI doubles down on this one; we have exactly these constraints, specifically on SKOS.)

S22: RDF Data Cube Constraints

Created by: Holger Knublauch

The Data Cube Vocabulary provides a means to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. While the bulk of the vocabulary is defined as an RDF Schema, it also includes integrity constraints:

http://www.w3.org/TR/vocab-data-cube/#wf-rules

Each integrity constraint is expressed as narrative prose and, where possible, a SPARQL ASK query or query template. If the ASK query is applied to an RDF graph then it will return true if that graph contains one or more Data Cube instances which violate the corresponding constraint.

Using SPARQL queries to express the integrity constraints does not imply that integrity checking must be performed this way. Implementations are free to use alternative query formulations or alternative implementation techniques to perform equivalent checks.

A discussion thread can be found here: http://lists.w3.org/Archives/Public/public-rdf-shapes/2014Aug/0055.html


Example: "Every qb:DataStructureDefinition must include at least one declared measure"

   ASK {
       ?dsd a qb:DataStructureDefinition .
       FILTER NOT EXISTS { ?dsd qb:component [qb:componentProperty [a qb:MeasureProperty]] }
   }

In SPIN:

   qb:DataStructureDefinition
       spin:constraint [
           a sp:Ask ;
           sp:text "ASK { FILTER NOT EXISTS { ?this qb:component [qb:componentProperty [a qb:MeasureProperty]] } }"
       ]

S23: schema.org Constraints

Created by: Holger Knublauch

Developers at Google have created a validation tool for the well-known schema.org vocabulary for use in Google Search, Google Now and Gmail. They have found that what may seem like a potentially infinite number of possible constraints can be represented quite succinctly using existing standards like the SPARQL query language and serialized as RDF.

http://www.w3.org/2001/sw/wiki/images/0/00/SimpleApplication-SpecificConstraintsforRDFModels.pdf

Example: Boarding passes will only be shown in Google Now for flights which occur at a future date:

Solution from the Google Paper (JSON-LD), replacing boardingTime with departureTime:

   {
       "@context": {...},
       "@id": "schema:FlightReservation",
       "constraints": [
           {
               "context": "schema:reservationFor",
               "constraint": "ASK WHERE {?s schema:departureTime ?t. FILTER(?t > NOW())}",
               "severity": "warning",
               "message": "A future date is required to show a boarding pass.",
           }
       ]
   }

In SPIN this would look similar (in Turtle, using syntactic sugar supported from TopBraid 4.6 onwards so that ASK can be used instead of CONSTRUCT):

   schema:FlightReservation
       spin:constraint [
           a sp:Ask ;
           sp:text "ASK { ?this schema:reservationFor/schema:departureTime ?t . FILTER (?t <= NOW()) }" ;
           spin:violationPath schema:reservationFor ;
           spin:violationLevel spin:Warning ;
           rdfs:label "A future date is required to show a boarding pass." ;
       ] .

Other example constraints for schema.org:

  • On schema:Person: Children cannot contain cycles, Children must be born after the parent, deathDate must be after birthDate
  • On schema:GeoCoordinates: longitude must be between -180 and 180, latitude between -90 and 90
  • On various: email address must match a certain regular expression
  • On schema:priceCurrency, currenciesAccepted: Currency code must be from a given controlled vocabulary
  • On schema:children, colleagues, follows, knows, parents, relatedTo, siblings, spouse, subEvents, superEvents: Irreflexitity

S24: Open Content Model

Created by: Arthur Ryman

See Open Content Model Example for a detailed example.

Suppose there is a need to integrate similar information from multiple applications and that the application owners have gotten together and defined an RDF representation for this information. However, since the applications have some differences, the application owners can only agree on those data items that are common to all applications. The defined RDF representation includes the common data items, and allows the presence of other undefined data items in order to accommodate differences among the applications. In this situation, the RDF representation is said to have an open content model. In fact, one of the attractive features of RDF technology is that it readily enables open content models.

For example, the OSLC Change Management (CM) specification specifies a very minimal representation for Change Requests (e.g. bug reports). A large software development organization may use several different change management tools, e.g. Bugzilla, Jira, and ClearQuest, each with their own proprietary resource format. The OSLC CM specification provides a way to process change management information in a uniform way, independently of the tool that hosts it. However, there may also be interesting differences in the type of information hosted by each tool. OSLC therefore specifies an open content model which allows implementations to extend the base representation with additional content. This content is represented as additional RDF properties on the resources. Furthermore, it is very common for change management tools to partition their resources into defined projects which restrict who can access the resources and which define custom attributes on the resources. Here the term custom attribute refers to an attribute that is not defined out-of-the-box in the tool. The tool administrators customize the tool by defining custom attributes, typically on a per-project basis. For example, one project might add a customer reference number while another might add a boolean flag indicating if there is an impact to the online documentation. These custom attributes also appear as additional RDF properties of the resources.

OSLC specifications typically define one or more RDF types. For example, the RDF type for change requests is oslc_cm:ChangeRequest where the prefix oslc_cm is <http://open-services.net/ns/cm#>. The RDF representation of an OSLC change request contains a triple that defines its type as oslc_cm:ChangeRequest, triples that define RDF properties as described in the OSLC CM specification, and additional triples that correspond to tool-specific or project-specific custom attributes. Note that the addition of custom attributes does not require the definition of a new RDF type. Furthermore the RDF properties used to represent custom attributes may come from any RDF vocabulary. In fact, tool administrators are encouraged to reuse existing RDF properties rather than define synonyms.

Since the shape of a resource may depend on the tool that hosts it, or the project that hosts it within a tool, but the RDF type of the resource may not depend on the tool or project, there is in general no way to navigate to the shape given only its RDF type. The OSLC Resource Shapes specification provides two mechanisms for navigating to the appropriate shape. First, the RDF property oslc:resourceShape where oslc: is <http://open-services.net/ns/core#> may be used to link a tool or project description to a shape resource. Second, the RDF property oslc:instanceShape may be used to link a resource to its shape.

PFPS: I'm having a hard time determining just what is supposed to come from this story. Taken literally, it appears to be stating a need for a particular RDF property, by name, but that's not something that should be coming from these stories. A more natural need here is the ability to get to different constraints or shapes depending on something other than the type of resources. If so, then this story is subsumed by S4.

Arthur: @PFPS yes, S4 is relevant. However it lacks details about how shapes are associated with REST APIs. I am going to describe how OSLC achieves this. The W3C shape specification must not preclude this, or better still, it should define a way to achieve it. I am creating a detailed example, Open Content Model Example.

kcoyle: Arthur's case sounds like cases that we have in the library/cultural heritage world. We can't rely always on rdf:type because different applications may take very different views of the same data. One application may consider title + author + subject to be conceptually "of class work". Another application may consider title + author + subject + language + musicalKey to be a work. These are a:Work and b:Work. However, adhering to the ontology definitions means not being able to operate over combinations of this data. Therefore, the graphs defined by classes, as intended in RDF, may not be the best entry points for this data. Instead, graphs will need to be derived "opportunistically" in order to allow communities with different views to share data. This may mean violating each other's "data integrity" in order to share; and using profiles rather than RDF/OWL ontologies to "read" the data. [Arthur, let me know if I'm completely missing your point here.]

Arthur: @kcoyle, you get my point. In an environment of diverse stakeholders there is always a tension between conformity, which has the benefit of promoting interoperability, versus differentiation, which enables satisfying local requirements or competing on the basis of value-added features. Some central body defines a standard to the extent required for interoperability, but allows for local customization. However, I disagree with you on the point of violating each other's data integrity. That would defeat interoperability. To be concrete, suppose you need to collect data from multiple sources and query it. The data from each source should still conform to the standard so that queries, aggregation, etc. are possible and meaningful. e.g. If the standard says that the value of a length property must be metres, you better not give it in yards. Concerning rdf:type, I like the Linked Data viewpoint which says that to get information about a thing, you dereference its URI. Therefore it does not seem aligned with Linked Data to :import a definition of a type. The authoritative definition of a type should be obtained from its creator host via dereferencing its type URI. Uses of the type should be consistent with its authoritative definition. Therefore, in order to be widely usable, the definition must not bring in a lot of baggage. The same goes for properties. Applications should be able to compose types and properties into information resources, and the constraints on those resources should be expressible through an orthogonal mechanism such as a shape.

S25: Primary Keys with URI Patterns

Created by: Holger Knublauch

It is very common to have a single property that uniquely identifies instances of a given class. For example, when you import legacy data from a spreadsheet, it should be possible to automatically produce URIs based on a given primary key column. The proposed solution here is to define a standard vocabulary to represent the primary key and a suitable URI pattern. This information can then be used both for constraint checking of existing instances, and to construct new (valid) instances. One requirement here is advanced string processing, including the ability to turn a partial URI and a literal value into a new URI.

Details: Primary Keys with URI Pattern

S26: rdf:Lists and ordered data

Created by: Axel Polleres Modified by: Karen Coyle

This is meant as a “stresstest” rather than as a practical use case: Can we express validating rdf:Lists a in our framework?

Libraries have a number of resources that are issued in ordered series. Any library may own or have access to some parts of the series, either sequential or with broken sequences. The list may be very long, and it is often necessary to display the list of items in order. The order can be nicely numerical, or not. Another ordered list use case is that of authors on academic journal articles. For reasons of attribution (and promotion!), the order of authors in article publishing can be significant. This is not a computable order (e.g. alphabetical by name). There are probably other cases, but essentially there will definitely be a need to have ordered lists for some data. Validation could be: the list must have a beginning and end; there can be/cannot be gaps in the list.

(kcoyle: Aside: I have great trepidation at tackling lists because of the complexity of the use cases in my community. I'd like feedback on this issue.)

Details: rdf:List Stresstest

PFPS: I like this as a stress test, but it's not a story. Perhaps someone can turn it into a story, but otherwise it should be moved elsewhere (maybe the end of this document).

HK: A variation of this is very well a real story. We often have the requirement to formalize that a given rdf:List should only have values of certain types in it. It's a bit like with Java generics, where you can write List<Person> to parameterize a generic List class. This is currently missing from the RDF syntax, but could be represented as an additional constraint on a property that has rdf:List value type.

S27: Relationships between values of multiple properties

Created by: Holger Knublauch

It is quite a common pattern to have relationships between multiple properties. A typical example is "Start date must be before end date" or "All children must be born after their parents". This information can be use to validate user input on forms and incoming data in web services.

Story: (kc) Cultural heritage data is created in a distributed way, so when data is gathered together in a single aggregation, quite a bit of checking must be done. One of the key aspects of CH data is the identification of persons and subjects, in particular relating them to historical contexts. For persons, a key context is their own birth and death dates; for events, there is often a date range representing a beginning and end of the event. In addition, there are cultural heritage objects that exist over a span of time (serial publications, for example). In each of these cases, it is desirable to validate the relationship of the values of properties that have temporal or other ordered characteristics.

Details: Constraining the order of different properties

[PFPS: It would be nice to have a story here.]

[HK: Why is this not a story - the use cases in schema.org are obvious and real].

[PFPS: I suppose that this could be turned into a story, but the motivational part is missing from this document.]

[HK: I have clarified that this is essentially about input validation; not sure what else to do here.]

S28: Self-Describing Linked Data Resources

Created by: Holger Knublauch

This is probably the default requirement from a Linked Data perspective: Given a resource URI, tell me all you know about it. The standard procedure is to look up the URI to retrieve the triples for this URI. The next step in RDF/OWL is to look for rdf:type triples and then follow those URIs to look up the class definitions. In OWL, those class definitions often carry owl:Restrictions. In SPIN, those class definitions would carry spin:constraints.

DCMI story: For some properties there is a requirement that the value IRI resolve to a resource that is a skos:Concept. The resource value is not limited to a particular skos:Concept scheme.

[PFPS: This may be something that should be done by a constraint system, but there doesn't appear to be any constraint or shape story here.]

[HK: The shape story here is that the linked data produced by the server would not only return the class definition, but also the properties and further constraints of that class. This information can then be used in many ways, constraint checking among them. The point of my story here is that this architecture should be linked data friendly, i.e. have transparent mechanisms to retrieve missing information before constraint checking can happen.]

[PFPS: So this is not about the constraints, but about how constraints are accessed?]

[HK: Yes, about how constraints are associated with the starting point (Resource) and then accessed for execution.]

S29: Describing interoperable, hypermedia-driven Web APIs (with Hydra)

Created by: Holger Knublauch

Hydra http://www.hydra-cg.com/ is a lightweight vocabulary to create hypermedia-driven Web APIs. By specifying a number of concepts commonly used in Web APIs it enables the creation of generic API clients. The Hydra core vocabulary can be used to define classes and "supported properties" which carry additional metadata such as whether the property is required and whether it is read-only. This feels very similar to the OSLC Resource Shapes story and uses similar constructs. It is also possible to express the supported properties as a SPIN constraint check, as implemented here: http://topbraid.org/spin/spinhydra

[PFPS: This appears to be very similar to S11. Only one of them should survive.]

[HK: I am sure this story and S11 will produce similar requirements. My understanding of the Stories step was to ground the requirements on real use cases, so here is one. Even if it produces the same requirements, it is helpful to have it written up.]

S30: PROV Constraints

Created by: Holger Knublauch

The PROV Family of Documents http://www.w3.org/TR/prov-overview/ defines a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web. One of these documents is a library of Constraints http://www.w3.org/TR/2013/REC-prov-constraints-20130430/ which defines valid PROV instances. The actual validation process is quite complex and requires a normalization step that can be compared to rules. Various implementations of this validation process exist, including a set of SPARQL INSERT/SELECT queries sequenced by a Python script (https://github.com/pgroth/prov-check/blob/master/provcheck/provconstraints.py), an implementation in Java (https://provenance.ecs.soton.ac.uk/validator/view/validator.html) and in Prolog (https://github.com/jamescheney/prov-constraints). Stardog also defines an "archetype" for PROV, which seems to be implemented in SPARQL using their ICV engine (http://docs.stardog.com/admin/#sd-Archetypes).

S31: LDP: POST content to Container of a certain shape

Similar to S29

Created by: Steve Speicher

Some simple LDP server implementations may be based on lightweight app server technology and only deal with JSON(-LD) and Turtle representations for their LDP RDF Sources (LDP-RS) on top of an existing application, say Bugzilla. As a client implementer, I may have a simple JavaScript application that consumes and produces JSON-LD. I want to have a way to programmatically provide the end-user with a simple form to create new resources and also a way to potential auto-prefill this form based on data from current context.

LDP defines some behavior when a POST fails to a ldp:Container, by outlining expected status codes and additional hints that could be found in either the response body of the HTTP POST request or a response header (such as: Link relation of "http://www.w3.org/ns/ldp#constrainedBy". A client can proactively request headers (instead of trying the POST and it fails) by performing an HTTP HEAD or OPTIONS request on the container URL and inspecting the link relation for "constrainedBy". Typical constraints are: a) not necessarily based on type b) sometimes limited to the action of creation and may not apply to other states of the resource.

Current gap is whatever is at the end of the "constrainedBy" link, could be anything: HTML, OSLC Resource Shapes, SPIN. The LDP WG discussed a need to have something a bit more formalized and deferred making any recommendation looking to apply these requirements unto the Data Shapes work. Once it matures, and meets the requirements, LDP could provide a recommendation for it then.

PFPS: This appears to be similar to S11 and S29. However, this does talk about particular surface forms of the RDF graph, which may go beyond what constraints or shapes are supposed to do.

S32: Non-SPARQL based solution to express constraints between different properties

Created by: Anamitra Bhattacharyya

Consider the case of clients consuming RDF resources, interfacing with an LDP container, needs to work in a disconnected mode (the client being a Workers mobile device where the work zone has no connectivity). The client needs to allow workers to create entries locally in the device to mark completion of different stages of the work. These entries will get synched up with the LDP container at a later time, when the device gets connectivity back. Prior to that, when the client is in disconnected mode, the client software needs to perform a range of validations on the users entries to reduce the probabilty of an invalid entry.

In addition to the basic data type/required/cardinality "stand alone" validations, the client needs to validate constraints between different properties:

  1. start time less than end time
  2. If end time is not specified, the status of the "work" should be "In Progress"
  3. if status is "Complete" end time is required.

The client side does not have access to any triple store/LDP container. So these validations needs to be expressed in a higher level language which makes it simpler for clients to implement these validations.


PFPS: I'm having a very hard time trying to figure out why clients working in a disconnected mode need a higher-level language. I'm also having a very hard time trying to figure out why the need for a higher-level language is tied to constraints between different properties.

S33: Structural validation for queriability

Created by: Eric Prud'hommeaux

Patient data (all data) is frequently full of structural errors. Statistical queries over malformed data leads to misinterpretation and inaccurate conclusions. Shapes can be used to sequester well-formed data for simpler analysis.

Consider a schema where a medical procedure should have no more than one outcome. Accidental double entry occurs when e.g. a clinician and her assistant both enter outcomes into the database:

 _:Bob :hadIntervention [
     :performedProcedure [ a bridg:PerformedProcedure ;
                           :definedBy [ :coding term:MarrowTransplant ; :location terms:Manubrium ] ];
     :assessmentTest     [ a bridg:PerformedObservation ;
                           :definedBy [ :coding term:TumorMarkerTest ; :evaluator <LabX> ] ;
                           :result    [ :coding term:ImprovedToNormal ; :assessedBy clinic:doctor7 ],
                                      [ :coding term:ImprovedToNormal ; :assessedBy clinic:doctor7 ]
                         ]
 ] .

The obvious SPARQL query on this will improperly weight this as two positive outcomes:

 SELECT ?location ?result (COUNT(*) AS ?count)
 WHERE {
   ?who :hadIntervention [
       :performedProcedure [ :definedBy [ :coding term:MarrowTransplant ; :location ?location ] ];
       :assessmentTest     [ :definedBy [ :coding term:TumorMarkerTest ] ;
                             :result    [ :coding ?result ] ]
                         ]
 } GROUP BY ?result ?location

(This is a slight simplification for the sake of readability. In practice, an auxilliary hierarchy identifies multiple codes as positive outcomes, e.g. term:ImprovedToNormal and term2:ClinicalCure, but the effect is the same as described here.)

Shapes can be used to select the subset of the data which will not introduce erroneous results.

ShExC schema

 my:Well-formedPatient {
     :hadIntervention {
         :performedProcedure { :definedBy { :coding IRI , :location IRI } } ,
         :assessmentTest     { :definedBy { :coding IRI , :evaluator IRI} ,
                               :result    { :coding IRI } }
     }
 }

Resource Shapes

 ex:Well-formedPatient a rs:ResourceShape ;
     rs:property [
         rs:occurs rs:Exactly-one ;
         rs:propertyDefinition :hadIntervention ;
         rs:valueShape [ a rs:ResourceShape ;
             rs:property [
                 rs:occurs rs:Exactly-one ;
                 rs:propertyDefinition :performedProcedure ;
                 rs:valueShape [ a rs:ResourceShape ;
                     rs:property [
                         rs:occurs rs:Exactly-one ;
                         rs:propertyDefinition :definedBy ;
                         rs:valueShape [ a rs:ResourceShape ;
                             rs:property [
                                 rs:propertyDefinition :coding ;
                                 rs:valueType shex:IRI ;
                                 rs:occurs rs:Exactly-one ;
                             ] ;
                             rs:property [
                                 rs:propertyDefinition :location ;
                                 rs:valueType shex:IRI ;
                                 rs:occurs rs:Exactly-one ;
                             ]
                         ] ;
                     ]
                 ] ;
             ] ;
             rs:property [
                 rs:propertyDefinition :assessmentTest ;
                 rs:valueShape [ a rs:ResourceShape ;
                     rs:property [
                         rs:occurs rs:Exactly-one ;
                         rs:propertyDefinition :definedBy ;
                         rs:valueShape [ a rs:ResourceShape ;
                             rs:property [
                                 rs:propertyDefinition :coding ;
                                 rs:valueType shex:IRI ;
                                 rs:occurs rs:Exactly-one ;
                             ] ;
                             rs:property [
                                 rs:propertyDefinition :evaluator ;
                                 rs:valueType shex:IRI ;
                                 rs:occurs rs:Exactly-one ;
                             ]
                         ] ;
                     ] ;
                     rs:property [
                         rs:occurs rs:Exactly-one ;
                         rs:propertyDefinition :result ;
                         rs:valueShape [ a rs:ResourceShape ;
                             rs:property [
                                 rs:occurs rs:Exactly-one ;
                                 rs:propertyDefinition :coding ;
                                 rs:valueType shex:IRI ;
                             ]
                         ] ;
                     ]
                 ] ;
             ]
         ] ;
     ] .

(HK: Slightly reformatted, dropped rs:name which isn't specified by ShExC example either).

SPIN

(HK: a SPIN syntax could look almost exactly like the Resource Shapes example above, because RS can be expressed in a SPIN-compliant way).

S34: Large-scale dataset validation

Created by: Dimitris Kontokostas

A publisher has a very large RDF Database (millions - billions triples) and wants to define multiple shapes for the data that will be checked at regular intervals. To make this process effective 1) validation must run within a reasonable time-span and 2) it must be possible to determine just what violations were found, i.e., just a TRUE/FALSE result is inadequate.

S35: Describe disconnected graphs

Created by: Arthur Ryman

In general, the RDF representation of an information resource may be a disconnected graph in the sense that the set of nodes in the graph may be partitioned into two disjoint subsets A and B such that there is no undirected path that starts in A and ends in B. The shape language must be able to describe such graphs. For example, consider the following JSON-LD representation of the Access Context List resource specified in OSLC Indexable Linked Data Provider Specification V2.0:

{
  "@context": {
    "acc": "http://open-services.net/ns/core/acc#",
    "id": "@id",
    "type": "@type",
    "title": "http://purl.org/dc/terms/title",
    "description": "http://purl.org/dc/terms/description"
  },
  "@graph": [{
     "id": "https://a.example.com/acclist",
     "type": "acc:AccessContextList"
    }, {
     "id": "https://a.example.com/acclist#alpha",
     "type": "acc:AccessContext",
     "title": "Alpha",
     "description": "Resources for Alpha project"
    }, {
     "id": "https://a.example.com/acclist#beta",
     "type": "acc:AccessContext",
     "title": "Beta",
     "description": "Resources for Beta project"
  }]
}
 

There is no path from the acc:AccessContextList node to either of the acc:AccessContext nodes. There is an implicit containment relation of acc:AccessContext nodes in the acc:AccessContextList by virtue of these nodes being in the same information resource. However, the designers of this representation were attempting to eliminate clutter and appeal to Javascript developers, so they did not define explicit containment triples.

S36: Support use of inverse properties

Created by: Arthur Ryman

In some cases the best RDF representation of a property-value pair may reuse a pre-existing property in which the described resource is the object and the property value is the subject. The reuse of properties is a best practice for enabling data interoperability. The fact that a pre-existing property might have the opposite direction should not be used as a justification for the creation of a new inverse property. In fact, the existence of both inverse and direct properties makes writing efficient queries more difficult since both the inverse and the direct property must be included in the query.

For example, suppose we are describing test cases and want to express the relations between test cases and the requirements that they validate. Further suppose that there is a pre-existing vocabulary for requirements that defines the property ex:isValidatedBy which asserts that the subject is validated by the object. In this case there is no need to define the inverse property ex:validates. Instead the representation of test case resources should use ex:isValidatedBy with the test case as the object and the requirement as the subject.

This situation cannot be described by the current OSLC Shapes specification because that specification has a directional bias. OSLC Shapes describe properties of a given subject node, so inverse properties cannot be used. The OSLC Shape submission proposes a possible solution. See http://www.w3.org/Submission/shapes/#inverse-properties

[HK: Thanks for this story - this is common indeed. You propose a flag "isInverse" on oslc:Property but I think this isn't the best solution as the facets for an inverse property are different from those in the forward direction (e.g. they can only be object properties so all datatype facets don't apply). Instead, I would introduce a new system property :inverseProperty in addition to :property.]

Arthur: @HK, I agree that :inverseProperty is better than adding :isInverse to :property. Since :isInverse would an extension to the OSLC shape spec, there is a danger that some clients would ignore it and do the wrong thing silently. It is better to have them complain that :property is missing.

S37 Defining allowed/required values

by Karen Coyle

The cultural heritage community has a large number of lists that control values for particular properties. These are similar to the DCMItypes, but some are quite extensive (>200 types of roles for Agents in relation to resources). There is also a concept of "authorities" which control the identities of people, places, subjects, organizations and even resources themselves. Many of these lists are centralized in major agencies (Library of Congress, Getty Art & Architecture Archive, National Library of Medicine, and national libraries throughout the world). Not all have been defined in RDF or RDF/SKOS, but those that have can be identified by their IRI domain name and pattern. Validation tools need to restrict or check usage according to the rules of the agency creating and sharing the data. Some patterns of needed validation are:

1) must be an IRI (not a literal)

2) must be an IRI matching this pattern (e.g. http://id.loc.gov/authorities/names/)

3) must be an IRI matching one of >1 patterns

4) must be a (any) literal

5) must be one of these literals ("red" "blue" "green")

6) must be a typed literal of this type (e.g. XML dataType)

7) literal must have a language code

Some of these are conditional: for resources of type:A, property:P has allowed values a,b,c,f.