SHACL Use Cases and Requirements

Abstract

To foster the development of Shapes Constraint Language (SHACL), this document includes a set of use cases and requirements that motivate a simple language and semantics for formulating structural constraints on RDF graphs. All use cases provide realistic examples describing how people may use structural constraints to validate RDF instance data. Note, that this document avoids the use of any specific vocabulary that might be introduced by the SHACL specification.

3. Use Cases

3.1 UC1: Model validation

There is a general need to validate that the instance data matches the models that have been defined in RDFS or OWL. The primary validation requirement is to ensure that the appropriate information is given for each property (or class) in the model. As examples, one could require that each property must have a domain, or that classes must be explicitly stated in the instance data. Input to this case is the RDF representation of an RDFS (or OWL) ontology.

Summary: Requires the ability to check whether certain information is given/available for a property or class.

Related Requirements: R6.2

3.2 UC2: Enforcing cardinality

For a tool that will build a list of personal names for named entity resolution to work correctly, every person must have one or more names specified, each of which is a string. Constraints can be used to verify that a particular set of data has at least one such name for each person.

Summary: Requires the ability to check the cardinality of properties as well as the type of its values.

Related Requirements: R6.2 and R8

3.3 UC3: Nuanced error conditions

Validation itself will result in a yes/no decision. However, there is a range of responses that any application may wish to act on, or that it may want to echo back to the user as a result of the validation process. There are the obvious results of "keep/reject" but oftentimes there will be a range of error or alert responses. There needs to be a way to associate an error level or code with the output from validation. Some applications will have a number of responses that inform users of ways they could improve their data, while still accepting all but the truly unusable data. Other applications could analyze data using a nuanced granding system.

Summary: Requires the ability to return responses appropriate to the condition, not just "pass/fail."

Related Requirements: R5.1, R5.9.1, R5.9.2, R5.9.3, R5.10, R10, R10.1, and R10.3

3.4 UC4: Shape variations within a process or workflow

The same shape can have different values and different requirements at different points in a process or workflow. Any node in the graph may serve multiple roles, that is, the same node may include properties for a SubmittingUser and for an AssignedEmployee, and these will be relevant at different points in the process. As an example, an LDP Container (e.g PendingIssues) accepts an IssueShape with a status of "assigned" or "unassigned". The LDP Container is an interface to a service storing data in a conventional relational database. Later, the issue gets resolved and is available as OldIssues without acquiring new type arcs. The constraints for PendingIssues are different from those for Issues at OldIssues, even though the instance data occupies a single graph.

Summary: Requires the ability to associate more than one shape to the same graph or node.

Related Requirements: no suitable requirements approved yet.

3.5 UC5: Complex constraints

Data applications may have a number of complex constraints that must interoperate. For example, there can be a wide variety of access rules defining what privileges for viewing and updating data. These can be applied to accounts or to applications or functions. There can be additional complex constraints on different import or export data. Incoming data, which itself can be complex, can be subjected to a large number of validation actions, some of which are dependent on output from prior application steps.

Design of validation must make these complex constraints appropriately efficient in application, as well as fostering a manageable maintenance environment for the validation technology.

Summary: Requires the expressibility of complex constraints that include e.g. value transformations, string operations, date comparisons, etc.

Related Requirements: R6, R6.3, R6.5, R6.6, R6.7, and R8

3.6 UC8: Checking RDF node type

It is often necessary or desirable to check whether certain property values RDF nodes are of a specific node type (IRI, BlankNode or Literal and all combinations thereof). One example is the need to state that a given property shall only have IRIs but not BlankNodes.

There are examples of this functionality in the VOID namespace, (void:dataDump and void:exampleResource), and in SPARQL (isIRI, isBlank, isLiteral).

Summary: Requires the possibility to constrain the value type of a property. E.g. check whether it is an IRI, a literal, a blank node, or some combination of those.

Related Requirements: R5.5

3.7 UC9: Contract time intervals

An ontology may state that instances of a class have a value for a property. Subclasses may be associated with a constraint that requires that there is a provided value for the property. For example, in the OMG time ontology adopted by FIBO every contract has to have an end date. A shape (set of constraints) may require that bonds (a subclass of contracts) have specified end dates without requiring that all contracts have specified end dates.

needs revisionSummary: Validation must allow for (momentarily) unspecified values. For example, an end date may be assumed but is not specified at this time.

Related Requirements: no suitable requirements approved yet.

3.8 UC10: Cardinality >= 0

There is a class in FIBO called IncorporatedCompany, which is a subclass of a bunch of restrictions. Many of them are of the form:

Example

fibo-be-oac-cpty:hasControllingInterestParty min 0 fibo-be-oac-cctl:VotingShareholder

i.e., a qualified cardinality of min 0.

What exactly does this mean? (logically, it is meaningless. Right?) I have an email in to some other FIBO ontologists, but here are some things I think it should mean:

If we build a form for an IncorporatedCompany, there should be a field in that form for hasControllingInterestParty. The field should be pre-populated (e.g., with a drop-down) with known VotingShareholders. We won't draw any inferences about the things here (as we would have done, if we had said min=1 or more)
If we receive a payload describing an IncorporatedCompany, and it has values for hasControllingInterestParty, then at least one of them should be known to be a VotingShareholder. I'm not sure what the appropriate behavior is if this doesn't hold. If we wanted a hard error, we should have said min 1.

Summary: Requires the possibility to indicate an expectation that a certain property will (or might) be there without requiring that it be there.

Related Requirements: R5.1, R5.2, R5.3, R5.4, and R8

3.9 UC11: Model-Driven UI constraints

There is a need to have constraints that provide model-driven validation of permissible values in user interfaces. The major requirement here is a declarative model of:

which properties are relevant for a given class/instance?
what is the value type of those properties?
what is the valid cardinality (min/maxCount)?
what is the interval of valid literal values (min/maxValue)?
any other metadata typically needed to build forms with input widgets.

It must be possible to perform validation of this type on instance data without being required to make use of a specific mechanism, such as SPARQL queries. Instead, the model should be of a sufficiently high level that it is not dependent on a single tool or method. However, at the same time there are many advanced constraints that need to be validated (either on server or client) before a form can be submitted. These constraints are not necessarily "structural" information, but rather executable code that returns error messages.

Summary: Requires the ability to declare and constrain permitted values for properties, as well as their cardinalities, in an abstract and "high-level" fashion.

Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3, R5.10, R8, R14.1, R14.2, and R14.3

3.10 UC12: App interoperability

There is one application (eg Cimba) which stores application state in RDF. It currently queries and modifies that state using HTTP GET and PUT operations on RDF Sources, but we have another version being developed that uses SPARQL to query and modify the data. The question is, how do we communicate the shape of the data this application reads and writes to other developers who want to make compatible applications? We want to say: as long as your data is of this form, Cimba will read it properly. We also want to say: Cimba may write data of any of these forms, so to be interoperable, your application will need to read and correctly process all of them.

Summary: Requires the final "shape syntax" to be light-weight and less verbose than SPARQL.

Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3, R11.5, and R11.7

3.11 UC13: Specification and validation of metadata templates

Data gathering functions, especially those that are consortial or rely on aggregation of data from multiple sources, need to be able to easily create templates to represent metadata. Eas of templating is particularly important in rapidly changing fields, such as medicine. For this reason, it is crucial that a language be developed that can allow easy templating of metadata and constraints. The templates must allow users to define different sets of metadata elements and their requirements. Templates should be modular and re-usable.

These templates will contain metadata elements that are either required or optional, and that define the value of the field to specific datatypes (e.g. string, integer, decimal, date). Values may be restricted by length or to a regular expression pattern; they may limited to specific categorical values or terminology trees/class expressions of a target ontology.

Ideally, the shapes language should be readable by computers in order to automatically generate template forms with restriction to specified values. Moreover, libraries and tools to construct and validate templates and their instance data should be readily available.

Summary: Requires the possibility to define shapes for a specific node in a modular manner, i.e. defining different sets of metadata fields and value sets.

Requires the availability of version information of shapes thus, the results of validation shall record the version of the triggered shape expression.

Related Requirements: R5.1, R5.2, R5.3, and R5.4

3.12 UC14: Quality Assurance for object reconciliation

In data integration activities, tools such as Silk or Limes may be used to discover entity co-references. Entity co-references are pairs of different identifiers, often in different datasets, that refer to the same entity. Detected co-references are often recorded as owl:sameAs triples. This may be a step in an object reconciliation pipeline.

It would be nice if shapes could flexibly state conditions by which to check that identity of objects has been correctly recorded; that is, check conditions under which a same-as link should be present between two identifiers, or conversely, check conditions for misidentified same-as links.

If source1.movie.title is highly similar (by some widely adopted string similarity function, perhaps plugged in through an extension interface) to source2.film.title and source1.movie.release-date.year is identical to source2.film.initial-release, then a owl:sameAs triple should be present
If source1.movie.title is identical to source2.film.title and source1.movie.release-date.year is within two years of source2.film.initial-release, then a owl:sameAs triple should be present
If source1.movie.directors has the same set of values as source2.film.directed-by AND source1.movie.title is highly similar to source2.film.title, then a owl:sameAs triple should be present

The intent here is not that the validation process should produce the expected owl:sameAs triples. We assume that some other tool or process has already produced these triples. The purpose of these validation rules is to perform quality assurance, or sanity checks, on the output of these other tools or processes. Thus, the quality or completeness of the generated linkset could be assessed.

We note however that object reconciliation tools could be driven by constraints like those given above. So potentially, an object reconciliation tool and a validator could use the same input constraints. Thus, this story straddles the boundaries between constraint checking and inference.

Summary: Requires the

Related Requirements: no suitable requirements approved yet.

3.13 UC15: Validation of variant dataset descriptions

Vocabulary and data re-use are desirable features of an RDF application. Metadata for a community or function may be expressed as levels of description that re-use existing vocabularies in a way that is appropriate to different contexts. For some data it may be possible to define a subset that satisfies a minimum description. In other cases, data may be re-used in a variety of configurations. Each of these contexts can have different validation constraints.

For example, in a data environment that has a 3 component model for summary, versioning, and distribution-level descriptions, each component has access to a specific set of metadata elements and these are specified as MUST, SHOULD, MAY, and MUST NOT. As such there are different conformance criteria for each level. Metadata values are either unrestrained rdfs:Literals, constrained rdfs:Literals, URIs with a specified URI pattern, or instances of a specified URI-identified type, or a disjunction of URI-specified types.

Summary: Requires the functionality to restrict application of constraints to certain contexts.

Requires expressibility of cardinality constraints and property value restrictions.

Related Requirements: R5.1

3.14 UC16: Constraints and controlled reasoning

A use-case we were facing recently, revolved around the integration of distributed configurations (i.e. object-oriented models) with RDFS and SPARQL. In this particular use-case we had to assume both Unique Name Assumption (UNA) and Closed World Assumption (CWA) for our ontologies, since the models (i.e. configurations) from which those ontologies were derived were generated by product configurators that impose both UNA and CWA. Since neither RDFS or OWL impose UNA/CWA we had to come up with some workarounds which were basically:

UNA 2.0: all URIs are treated as different, unless explicitly stated otherwise by owl:sameAs (UNA 2.0 because in general, if two URIs are different and the ontology they are contained in is assumed to obey the UNA then they cannot be connected via owl:sameAs).
CWA: we assumed to know every existing individual of local configurations and directly connected individuals from other local configurations, thus an absence of a certain individual in the local configuration means that it does not exist.

SPARQL was used to perform query tasks on the global schema as well as to check simple integrity constraints by translating e.g. cardinality restrictions into ASK queries.

One major problem which arose based on our workaround to impose UNA was, that SPARQL is unaware of the special semantics of owl:sameAs. Which means that especially if one wants to use counting aggregates, one usually wants to count the number of real-objects and not the number of URIs referring to it. As an example we defined two SPARQL queries which should count the number of subnets of a certain system:

Summary: Requires support of unique name assumption, such that each unique IRI is assumed to represent a unique entity.

Requires possibility to encapsulate verbose constraint definitions into macros, thus allowing their reuse in other shapes as well as increase readability of shape expressions.

Related Requirements: R7 and R7.1

3.15 UC17: Specifing subsets of data

The medical community has an interest in the notion of "archetypes" that are expressed as abstract constraints on a reference model. The reference model describes the largest set of possible possible instances of a given collection of data and the archetypes then constrain this set of instances by restricting cardinality, types, value ranges, etc.. One way to implement archetype models would be through RDF and SHACL, where the reference model would be viewed as the "constraints" -- the set of rules that are used to validate incoming data and to document data set validity.

The archetypes, however, would serve the additional purpose of defining "instance subsets". The archetypes identify filters/queries that would allow a user to return the a set of shapes that met certain criteria such as abnormal values, co-occurence, etc. They could also act as filters, funneling incoming instances to secondary processes where necessary.

It should be noted that the primary representation for archetypes in the medical community will probably not be SHACL -- they will be using Archetype Definition Language (ADL) (or the UML equivalent, AML) and/or profiles, with SHACL being a translation.

Note

This version was heavily edited and still needs to be approved.

3.16 UC19: Query Builder

Various tools are contributing data to a triple store. A Query Builder wants to know the permitted or likely shapes of the data over which the generated queries must run, so that the end user can be presented with a nice interface prompting for likely predicates and values. Since the data is dynamic, this is not necessarily the same as the shape that could be reverse engineered from the existing data. The Query Builder and the data-producing tools are not provided by the same team - the Query Builder team has very limited control over the data being produced. The source of the data might not provide the necessary shape information, so we need a way for the Query Builder team (or a third party) to be able to provide the shape data independently. See also Ontology-Driven Forms.

Summary: Requires the possibility to provide shape definitions independently of instance data.

Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3, R8, R11.5, R11.7, R14.1, R14.2, and R14.3

3.17 UC20: Creation Shapes

A client creating a new resource by posting to a Linked Data Platform Container wants to know the acceptable properties and their values, including which ones are mandatory and which optional. Note that this creation shape is not necessarily the same as the shape of the resource post-creation - the server may transform some values, add new properties, etc.

Summary: Requires the ability to decide which shape definitions should be valid/triggered for a certain node (in case those shape definitions are mutually exclusive).

Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3, R8, and R14.1

3.18 UC21: SKOS constraints

The well-known SKOS vocabulary defines constraints that are outside of the expressivity of current ontology languages, such as:

make sure that a resource has at most one preferred label for a given language
preferred labels and alternative labels must be disjoint

The constraint language must include the capability to define these constraints, and in particular these constraints should be provided as easily re-usable modules.

Summary: Requires the possibility to define complex constraints including ones on property/value pairs.

Related Requirements: R6, R6.4, R6.6, R7, and R7.3

3.19 UC22: RDF Data Cube constraints

The Data Cube Vocabulary provides a means to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. While the bulk of the vocabulary is defined as an RDF Schema, it also includes integrity constraints.

Each integrity constraint is expressed as narrative prose and, where possible, a SPARQL ASK query or query template. If the ASK query is applied to an RDF graph then it will return true if that graph contains one or more Data Cube instances which violate the corresponding constraint.

Using SPARQL queries to express the integrity constraints does not imply that integrity checking must be performed this way. Implementations are free to use alternative query formulations or alternative implementation techniques to perform equivalent checks.

Summary: Requires support of RDF Data Cube Integrity Constraints

Related Requirements: R6, R6.2, and R6.6

3.20 UC23: schema.org constraints

Developers at Google have created a validation tool for the well-known schema.org vocabulary for use in Google Search, Google Now and Gmail. They have found that what may seem like a potentially infinite number of possible constraints can be represented quite succinctly using existing standards and serialized as RDF. Some examples of schema.org constraints are:

On schema:Person: Children cannot contain cycles, Children must be born after the parent, deathDate must be after birthDate
On schema:GeoCoordinates: longitude must be between -180 and 180, latitude between -90 and 90
On various: email address must match a certain regular expression
On schema:priceCurrency, currenciesAccepted: Currency code must be from a given controlled vocabulary
On schema:children, colleagues, follows, knows, parents, relatedTo, siblings, spouse, subEvents, superEvents: Irreflexitity

It must be possible to encode schema.org constraints in SHACL.

Summary: Requires support of schema.org constraints.

Related Requirements: R6, R6.2, R6.3, R6.6, and R6.8

3.21 UC24: Open Content Model

Consider a situation in which there is a need to integrate similar information from multiple applications and that the application owners have agreed on an RDF representation for this information. However, because the applications have some differences, the application owners can only agree on those data items that are common to all applications. The defined RDF representation will include the common data items, and will allow the presence of other undefined data items in order to accommodate differences among the applications. In this situation, the RDF representation is said to have an open content model.

Since the shape of a resource may depend on the tool that hosts it, or the project that hosts it within a tool, but the RDF type of the resource may not depend on the tool or project, there is in general no way to navigate to the shape given only its RDF type. The OSLC Resource Shapes specification provides two mechanisms for navigating to the appropriate shape. First, the RDF property oslc:resourceShape where oslc: is <http://open-services.net/ns/core# > may be used to link a tool or project description to a shape resource. Second, the RDF property oslc:instanceShape may be used to link a resource to its shape.

See Open Content Model Example for a detailed example.

Summary: The constraint language must support an open content model that can operate on designated data elements within a larger set of undefined elements that are ignored by the application.

Requires the possibility to address a resource graph based on criteria unrelated to its rdf:type. This can be a general context, or a specific application function.

Related Requirements: R8

3.22 UC25: Primary Keys with URI patterns

It is very common to have a single property that uniquely identifies instances of a given class. For example, when you import legacy data from a spreadsheet, it should be possible to automatically produce URIs based on a given primary key column. The proposed solution here is to define a standard vocabulary to represent the primary key and a suitable URI pattern. This information can then be used both for constraint checking of existing instances, and to construct new (valid) instances. One requirement here is advanced string processing, including the ability to turn a partial URI and a literal value into a new URI.

Details: Primary Keys with URI Pattern

Summary: Requires The ability to create IRIs from non-IRI identifiers.

Related Requirements: R6 and R8

3.23 UC26: rdf:Lists and ordered data

Can we express validating rdf:Lists a in our framework? This is more than just a stresstest but a variation of this can be used to check whether all members of a list have certain characteristics.

Libraries have a number of resources that are issued in ordered series. Any library may own or have access to some parts of the series, either sequential or with broken sequences. The list may be very long, and it is often necessary to display the list of items in order. The order can be nicely numerical, or not. Another ordered list use case is that of authors on academic journal articles. For reasons of attribution (and promotion!), the order of authors in article publishing can be significant. This is not a computable order (e.g. alphabetical by name). There are probably other cases, but essentially there will definitely be a need to have ordered lists for some data. Validation could be: a) the list must have a beginning and end; b) there can be/cannot be gaps in the list.

Details: rdf:List Stresstest

Summary: Requires the possibility to define ordered and unordered lists of properties, including attributes like begin_element, end_element, etc.

Related Requirements: R6, R6.7, and R6.8

3.24 UC27: Relationships between values of multiple properties

Cultural heritage (CH) data is generally created in a distributed way, so when data is gathered together in a single aggregation, quite a bit of checking must be done. One of the key aspects of CH data is the identification of persons and subjects, in particular relating them to historical contexts. For persons, a key context is their own birth and death dates; for events, there is often a date range representing a beginning and end of the event. In addition, there are cultural heritage objects that exist over a span of time (serial publications, for example). In each of these cases, it is desirable to validate the relationship of the values of properties that have temporal or other ordered characteristics.

Details: Relationships between values of different properties

Summary: Requires ability to perform comparisons on the values in selected sets of properties. For example, to compare the value of properties representing birth date and death date to validate that birthdate precedes death date. Similar tests may be needed within workflows, for example to check that step one is completed before step two.

Related Requirements: R6, R6.6, R6.7, R7.3, and R8

3.25 UC28: Self-Describing Linked Data resources

In Linked Data related information is accessed by URI dereferencing. The information that is accessible this way may represent facts about a particular resource, but also typing information for the resource. The types can themselves be used in a similar way to find the ontology describing the resource. It should be possible to use these same mechanisms to find constraints on the information provided about the resource.

For example, the ontology could include constraints or could point to another document that includes constraints. Or the first document accessed might include constraints or point to another document that includes constraints.

DCMI story: For some properties there is a requirement that the value IRI resolve to a resource that is a skos:Concept. The resource value is not limited to a particular skos:Concept scheme.

Summary: SHACL must be able to define validation for information received from a dereferencing of the value IRI, e.g. that the value is a member of a skos:ConceptScheme.

Related Requirements: R7, R7.1, R7.3, and R8

3.26 UC29: Describing interoperable, hypermedia-driven Web APIs (with Hydra)

Hydra is a lightweight vocabulary to create hypermedia-driven Web APIs. By specifying a number of concepts commonly used in Web APIs it enables the creation of generic API clients. The Hydra core vocabulary can be used to define classes and "supported properties" which carry additional metadata such as whether the property is required and whether it is read-only. The constraints vocabulary should support the constraints commonly used in API clients.

Summary: Requires the possibility to define a set of routines or concepts that will fulfil commonly required validation tasks, with perhaps some selectable options.

Related Requirements: R5.1, R5.9.1, R5.9.2, R5.9.3, and R8

3.27 UC30: PROV constraints

The PROV Family of Documents defines a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web. One of these documents is a library of Constraints which defines valid PROV instances. The actual validation process is quite complex and requires a normalization step that can be compared to rules. Various implementations of this validation process exist, including a set of SPARQL INSERT/SELECT queries sequenced by a Python script, an implementation in Java and in Prolog. Stardog also defines an "archetype" for PROV, which seems to be implemented in SPARQL using their ICV engine.

Summary: Requires support of PROV Constraints.

Requires a mechanism to define rules within shape definitions.

Related Requirements: R6

3.28 UC31: LDP: POST content to container of a certain shape

Some simple LDP server implementations may be based on lightweight app server technology and only deal with JSON(-LD) and Turtle representations for their LDP RDF Sources (LDP-RS) on top of an existing application, say Bugzilla. As a client implementer, I may have a simple JavaScript application that consumes and produces JSON-LD. I want to have a way to programmatically provide the end-user with a simple form to create new resources and also a way to potential auto-prefill this form based on data from current context.

LDP defines some behavior when a POST fails to a ldp:Container, by outlining expected status codes and additional hints that could be found in either the response body of the HTTP POST request or a response header (such as: Link relation of "http://www.w3.org/ns/ldp#constrainedBy". A client can proactively request headers (instead of trying the POST and it fails) by performing an HTTP HEAD or OPTIONS request on the container URL and inspecting the link relation for "constrainedBy". Typical constraints are: a) not necessarily based on type b) sometimes limited to the action of creation and may not apply to other states of the resource.

Current gap is whatever is at the end of the "constrainedBy" link, could be anything: HTML, OSLC Resource Shapes, SPIN. The LDP WG discussed a need to have something a bit more formalized and deferred making any recommendation looking to apply these requirements unto the Data Shapes work.Once it matures, and meets the requirements, LDP could provide a recommendation for it then.

Summary: This use case covers similar topics as discussed in UC11.

Related Requirements: no suitable requirements approved yet.

3.29 UC32: Non-SPARQL based solution to express constraints between different properties

In this case there are potentially clients consuming RDF resources, interfacing with an LDP container, that need to work asynchronously (the client being a Workers mobile device where the work zone has no connectivity). The client needs to allow workers to create entries locally in the offline application to mark completion of different stages of the work. These entries will be synched with the LDP container at a later time, when the device has network connectivity. Prior to that, when the client is in disconnected mode, the client software needs to perform a range of validations on the users entries to reduce the probabilty of an invalid entry.

In addition to the basic data type/required/cardinality "stand alone" validations, the client needs to validate constraints between different properties:

start time less than end time
if end time is not specified, the status of the "work" should be "In Progress"
if status is "Complete" end time is required.

The client side does not have access to any triple store/LDP container. If these validations can be expressed in a higher level language which makes it simpler for clients to implement them constraint systems will be useful in more places.

Summary: Expresses the requirement to be able to define constraints over more than one property. E.g., value of property start time must be less than value of property end time.

Those interdependencies between properties of the same RDF node should be expressible in a higher level language.

Related Requirements: R7, R7.4, R11.5, and R11.7

3.30 UC33: Structural validation for queriability

Data frequently has structural errors. Consider a schema where a medical procedure should have no more than one outcome. Accidental double entry occurs when e.g. a clinician and her assistant both enter outcomes into the database. Statistical queries over malformed data such as this leads to misinterpretation and inaccurate conclusions. Shapes can be used to sequester well-formed data for simpler analysis.

Example Data

_:Bob :hadIntervention [
	:performedProcedure [ 	a bridg:PerformedProcedure ;
				:definedBy [ :coding term:MarrowTransplant ; :location terms:Manubrium ] ];
	:assessmentTest     [   a bridg:PerformedObservation ;
				:definedBy [ :coding term:TumorMarkerTest ; :evaluator <LabX> ] ;
				:result    [ :coding term:ImprovedToNormal ; :assessedBy clinic:doctor7 ],
					   [ :coding term:ImprovedToNormal ; :assessedBy clinic:doctor7 ]
				]
] .

The obvious SPARQL query on this will improperly weight this as two positive outcomes:

Example Query

SELECT ?location ?result (COUNT(*) AS ?count)
WHERE {
	?who :hadIntervention [
		:performedProcedure [ :definedBy [ :coding term:MarrowTransplant ; :location ?location ] ];
		:assessmentTest     [ :definedBy [ :coding term:TumorMarkerTest ] ;
					  :result    [ :coding ?result ] ]
	]
} GROUP BY ?result ?location

(This is a slight simplification for the sake of readability. In practice, an auxiliary hierarchy identifies multiple codes as positive outcomes, e.g. term:ImprovedToNormal and term2:ClinicalCure, but the effect is the same as described here.) Being able to select subsets of data related to an RDF node and thus, define a well-formed/cleansed representation of that node (which is represented as shape), allows to improve the quality of data as well as its queriability.

Summary: Requires the ability to perform structural validation over RDF data.

Related Requirements: R7.4

3.31 UC34: Large-scale dataset validation

A publisher has a very large RDF Database (in terms of millions or billions of triples) and wants to define multiple shapes for the data that will be checked at regular intervals. To make this process effective the validation must be able to run within a reasonable time-span and the validation engine must be flexible enough to provide different levels of the violation result details. The different levels can range from specific nodes that are violating a shape facet, the success or fail of a shape facet or aggregated violations per shape facet, possibly along with an error prevalence.

Applying a shape in a large database can return thousands or millions of violations and it is not efficient to look at all erroneous RDF nodes one by one. In addition, many times all violations for a specific facet can be attributed to a specific mapping or source code function. An expected workflow in this case is that the maintainer runs a validation asking aggregated violations per shape facet along with a sample of (i.e. 10) specific nodes. Having the higher overview along with the sample data the maintainer can choose the order she will address the errors.

Summary: Basically a repetition of UC3 with additional requirements regarding the validation performance.

Related Requirements: R10

3.32 UC35: Describe disconnected graphs

This use case reflects how information resources are created (e.g. via HTTP POST) or modified (e.g. via HTTP PUT). In these situations, the body of the HTTP request has an RDF content type (RDF/XML, Turtle, JSON-LD, etc.). The server typically needs to verify that the body of the request satisfies some application-specific constraints. Many proposed solutions have an implicit assumption that all RDF graphs have a distinguished root node which is the subject of triples that define either literal properties or links to other subjects, which may in turn have literal properties or links to further subjects. The implication is that all the nodes of interest are connected to the root node. However, an RDF graph may not be connected to other graphs acted on by the same application, and in fact disconnected RDF graphs do appear in real-world Linked Data specifications. The RDF representation of an information resource may be a disconnected graph in the sense that the set of nodes in the graph may be partitioned into two disjoint subsets A and B such that there is no undirected path that starts in A and ends in B.

The example can be taken from a specification related to access control. A conformant access control service must host an access control list resource that supports HTTP GET requests. The response to an HTTP GET request have a response body whose content type is application/ld+json, i.e. JSON-LD. An example is given below. In this example, there is a distinguished root node, i.e. the node of type acc:AccessContextList, but it is not connected to the other nodes of interest, i.e. the nodes of type acc:AccessContext.

SHACL must be able to describe such graphs. However, this user use case does not propose that SHACL must be able to distinguish between connected and disconnected graphs.

Summary: States the requirement, that constraints over RDF graphs must be describable for both disconnected and connected graphs.

Related Requirements: R6.7, R9, and R12.1

3.33 UC36: Support use of inverse properties

In some cases the best RDF representation of a property-value pair may reuse a pre-existing property in which the described resource is the object and the property value is the subject. The reuse of properties is a best practice for enabling data interoperability. The fact that a pre-existing property might have the opposite direction should not be used as a justification for the creation of a new inverse property. In fact, the existence of both inverse and direct properties makes writing efficient queries more difficult since both the inverse and the direct property must be included in the query.

For example, suppose we are describing test cases and want to express the relations between test cases and the requirements that they validate. Further suppose that there is a pre-existing vocabulary for requirements that defines the property ex:isValidatedBy which asserts that the subject is validated by the object. In this case there is no need to define the inverse property ex:validates. Instead the representation of test case resources should use ex:isValidatedBy with the test case as the object and the requirement as the subject.

This situation cannot be described by the current OSLC Shapes specification because that specification has a directional bias. OSLC Shapes describe properties of a given subject node, so inverse properties cannot be used. The OSLC Shape submission proposes a possible solution. See http://www.w3.org/Submission/shapes/#inverse-properties.

Summary: For sake of simplicity, a potential constraint language shall allow the usage of properties in their inverse direction if applicable. I.e. allowing the reuse of already defined properties (in an inverse manner) in a shape, even if the node the respective shape is describing only occurs in the object position.

Related Requirements: R5.1 and R5.11

3.34 UC37: Defining allowed/required values

The cultural heritage community has a large number of lists that control values for particular properties. These are similar to the DCMItypes, but some are quite extensive (>200 types of roles for Agents in relation to resources). There is also a concept of "authorities" which control the identities of people, places, subjects, organizations and even resources themselves. Many of these lists are centralized in major agencies (Library of Congress, Getty Art & Architecture Archive, National Library of Medicine, and national libraries throughout the world). Not all have been defined in RDF or RDF/SKOS, but those that have can be identified by their IRI domain name and pattern. Validation tools need to restrict or check usage according to the rules of the agency creating and sharing the data. Some patterns of needed validation are:

must be an IRI (not a literal)
must be an IRI matching this pattern
must be an IRI matching one of >1 patterns
must be a (any) literal
must be one of these literals ("red" "blue" "green")
must be a typed literal of this type (e.g. XML dataType)
literal must have a language code

Some of these are conditional: for resources of type:A, property:P has allowed values a,b,c,f.

Summary: Requires the ability to constrain the potential values of properties of a shape.

Related Requirements: R10

3.35 UC38: Describing and validating Linked Data Portals(proposed)

A small company is specialized in the development of linked data portals. The contents of those portals are usually from statistical data that comes from Excel sheets and can easily be mapped to RDF Data Cube observations.

The company needs a way to describe the model of the RDF graphs that need to be generated from the Excel sheets which will also be published as an SPARQL endpoint. Notice that those linked data portals could contain observations which will usually be instances of qb:Observation but can contain different properties.

Some constraints could be, for example, that any observation has only one floating point value, or that any observation refers to one geographical area, one year, one indicator and one dataset. That those datasets refer to organizations and those organizations have one rdfs:label property in English, another in French, and another in Spanish, etc.

In this context, the company is looking for a solution that can be easily understood by the team of developers which are familiar work with OO programming languages, relational databases, XML technologies and some basic RDF knowledge, but they are not familiar with other semantic web technologies like SPARQL, OWL, etc.

The company also wants some solution that can be published and understood by external semantic web developers so they can easily know how to query the SPARQL endpoint.

There is also a need that the solution can be machine processable, so the contents of the linked data portal can automatically be validated.

Finally, the company would like to compare the schemas employed by the different linked data portals so they can check which are the differences between the RDF nodes that appear in those portals and they can even create new applications on top of the data aggregated by those portals.

The company would also like to promote third party companies to be able to reuse the data available in those data portals so there could be third-party applications on top of them which could, for example, visualize or compare the different observations, create faceted browsers, search engines, etc. To that end, those third party companies need some way to query the schemas available in those partals and build those applications from those schemas.

Summary: Requires the ability to constrain the potential values of properties of a shape.

Related Requirements: TBD

3.36 UC39: Arbitrary cardinality

Some clinical data require specific cardinality constraints, e.g.

zero or one (optional) birth date.
zero or more lab tests.
one active patient marker.
one or more emergency contact.
two biological parents.

Which makes it necessary to be able to define arbitrary cardinality constraints, i.e. not be limited to a number of predefined values.

Summary: Requires the ability to define arbitrary cardinality constraints.

Related Requirements: R5.2

3.37 UC40: Describing inline content versus references (proposed)

Suppose an RDF graph contains a triple (S, P, O) where O is the URI of a resource. Sometimes O is itself the subject of other triples, i.e. these other triples are inline in the graph, and sometime it is not, i.e. O is only a reference to another graph.

For example, consider an RDF graph that describes a table of data. The graph contains nodes that correspond to the table, its rows, and its cells, i.e. the contents of the table is inlined in the graph.

In contrast, suppose an RDF graph describes a failed test case and it contains a triple that links the failed test case to a bug report, but no further information about the bug report is contained in the graph. The URI of the bug report is a reference to another graph.

In the case of a reference, it is up to the application to associate a graph with a URI. In Linked Data this association is done by sending an HTTP GET request for the URI. In a SPARQL RDF dataset, the association may be done by using the URI as the name of a named graph in the dataset. The specific mechanism used to associate a referenced URI with an RDF graph should be outside the scope of this working group.

The OSLC Resource Shape submission describes this situation using the property oslc:representation which has the allowed values oslc:Inline, oslc:Reference, and oslc:Either.

Note

new use case that still needs to be approved.

3.38 UC41: Validating schema.org instances against model and metamodel

This use case focuses on the validation of schema.org instances against the constraints expressed in the schema.org model and metamodel. (The related user story, UC23, focuses on domain-specific constraints attached to specific schema.org classes and properties, and not on the model and metamodel.)

A processor for our validation language should be able to accept a schema.org instance as well as the schema.org model, expressed in an RDF syntax, as inputs (perhaps as separate named graphs), and validate the instance against the model.

domainIncludes/rangeIncludes: In schema.org, properties can be associated with multiple types via the “domainIncludes” and “rangeIncludes” properties. The semantics is that the domain/range consist of the union of these types (rather than the intersection, as with the “domain” and “range” properties in RDFS). Validation requires that the subject and object of a triple can be compared against a set of types given in the model graph, and a validation error would be raised if the subject/object is not an instance of one of these types, or of one of their subtypes.
Datatypes and plain literals: In schema.org, properties may be associated with datatypes, but literals in instance data are always plain (string) literals. In other words, a property may be typed as a date property, but the date would be given as a plain literal, not as a xsd:date typed literal. Examples of named datatypes in schema.org include: ISO 8601 dates and datetimes; xsd:time; boolean “True” and “False”; integers. For validation, the language should be able to make use of annotations on the properties. For example, if we have { :thing schema:date "value" }, it should be possible to write a validation rule that depends on a “rangeIncludes” annotation on the schema:date property. As each named datatype is used many times throughout the model, it would also be good if the regular expression (or similar mechanism) for the datatype wouldn't have to be repeated for each property that uses the datatype, but could be referred to by reference, or by rule.
Conformance levels: Processing of schema.org by the major search engines tends to be quite permissive. For example, often, where an “Organization” instance is expected according to the model, a “Text” literal with the organization's name is sufficient. This could be treated as a warning/notice. Also, some literal properties contain markup recommendations such as for the “price” property: Putting “USD” into a separate currency property is preferred over sticking “$” into the numeric price literal. Again, values like “$99” could be treated as warnings.

3.39 UC42: Constraining RDF Graphs to provide better mapping to JSON (proposed)

In client-side application development and in integrating between RDF-based systems and JSON-based APIs, the problem of mapping between the RDF data model and the JSON data model recurs.

In the unconstrained RDF data model, there are too many variations to map arbitrary RDF graphs cleanly to JSON. By selecting an RDF vocabulary that covers the desired JSON structure, and using Shapes to express constraints over the vocabulary, the mapping could be made sensible and predictable. In other words, Shapes could be used to constrain RDF graphs in a way that gives them a well-defined isomorphic mapping to some JSON model. As a side effect, we also get better UI for these constrained RDF graphs.

This rises a number of requirements:

MaxCardinality 1: Properties with a maximum cardinality of 1 can be mapped easily to keys in JSON objects.
Support for RDF lists: Ordered lists are a standard feature of JSON. A clean mapping requires the ability to declare that the value of a property must be an rdf:List, and the ability to place constraints on the members of the list (e.g., be of a certain class or have a certain shape). This has UI advantages too. Knowing that an RDF property is an ordered multi-valued property calls for specific UI widgets. (Partially already covered in UC26)
Maximum one string literal per language: For i18n-capable applications, “one string per language” is an important kind of value. In RDF, this shows up simply as a multi-valued string property. An example is skos:prefLabel. In JSON, the natural representation of this is an object with language codes as keys and the string literals as values. This pattern has special support in JSON-LD for example ("language maps"). Again, if we can declare this constraint on a property, we can use better UI widgets and better API access.

Note

new use case that still needs to be approved.

4. Requirements

This section lists the requirements arising from the use-cases catalogued in this document. Specific requirements that have been de-prioritized or rejected have been left in the document for completeness, but are shown as struck out.

R1: Higher-Level Language

Constraints/shapes shall be specifiable in a higher-level language with 1. definitional capabilities, such as macro rolling up and naming, and 2. control infrastructure for, e.g., recursion.

Motivation: Dublin Core Requirement 103

R2: Concise Language

Constraints/shapes shall be specifiable in a concise language.

Motivation: Dublin Core Requirement 184

R3: Addressability

Collections of constraints/shapes may be addressable and discoverable. Individual constraints/shapes may be addressable and discoverable.

Motivation: Dublin Core Requirement 147 and Dublin Core Requirement 148

R4: Annotations

Constraints/shapes may incorporate extra information that does not affect validation. It shall be possible to search for constraints/shapes with particular extra information.

Motivation: Dublin Core Requirement 208

R5.1: Association of Class with Shape

There must be an "easy" way of associating a shape with a class, meaning that nodes in a graph that are instances of that class must conform with that shape

Motivation: UC3, UC10, UC11, UC12, UC13, UC15, UC19, UC20, UC29, and UC36

R5.2: Property Min/Max Cardinality

The stated values for a property may be limited by minimum/maximum cardinality, with typical patterns being [0..1], [1..1], [0..*] and [1..*].

Motivation: UC10, UC11, UC13, UC19, UC20, and UC39

R5.3: Property Datatype

The values of a property may be limited to be an RDF Literal with a stated datatype, such as xsd:string or xsd:date.

Motivation: UC10, UC11, UC13, UC19, and UC20

R5.4: Property Type

The values of a property may be limited by their type, e.g., all children have to be of type person.

Motivation: UC10, UC11, UC13, UC19, and UC20

R5.5: Property's RDF Node Type (e.g. only IRIs are allowed)

The values of a property on instances of a class may be limited by their RDF node type, e.g. IRI, BlankNode, Literal, or BlankNodeOrIRI (for completeness we may want to support all 7 combinations including Node as parent).

Motivation: UC8

R5.9.1: Datatype Property Facets: min/max values

Similar to xsd:minInclusive/maxExclusive

Motivation: UC3, UC11, UC12, UC13, UC19, UC20, and UC29

R5.9.2: Datatype Property Facets: regular expression patterns

Pattern matching against regular expressions (xsd:pattern).

Motivation: UC3, UC11, UC12, UC13, UC19, UC20, and UC29

R5.9.3: Datatype Property Facets: string length

todo

Motivation: UC3, UC11, UC12, UC13, UC19, UC20, and UC29

R5.10: Property Value Enumerations

Shapes will provide exhaustive enumerations of the valid values (literals and IRIs).

Motivation: UC3, UC11, and UC37

R5.11: Properties Used in Inverse Direction

Shapes can have constraints where the tested node is the object of a triple.

Motivation: UC36

R6: Complex Constraints

The language should allow users to implement constraints that check complex conditions, with an expressivity as covered by the following sub-requirements (e.g. basic graph patterns, string and mathematical operations and comparison of multiple values).

Motivation: UC5, UC21, UC22, UC23, UC26, UC27, and UC30

R6.2: Expressivity: Non-Existence of Patterns

Many constraints require that a certain pattern does not exist in the graph.

Motivation: UC1, UC2, UC22, and UC23

R6.3: Expressivity: String Operations

Some constraints require building new strings out of other strings, and building new URIs out of other values.

Motivation: UC5 and UC23

R6.4: Expressivity: Language Tags

Some constraints require comparing language tags of RDF literals, e.g. to check that no language is used more than once per property. Also to produce multi-lingual error messages.

Motivation: UC21

R6.5: Expressivity: Mathematical Operations

Some constraints require mathematical calculations and comparisons, e.g. area = width * height.

Motivation: UC5

R6.6: Expressivity: Literal Value Comparison

Some constraints require operators such as <, <=, != etc, either against constants or other values that are dynamically retrieved at query time. Includes date/time comparison and functions such as NOW().

Motivation: UC5, UC21, UC22, UC23, and UC27

R6.7: Expressivity: Logical Operators

The language should make it possible to express the basic logical operators intersection, union and negation of conditions.

Motivation: UC5, UC26, and UC35

R6.8: Expressivity: Transitive Traversal of Properties

Some constraints need to be able to traverse a property transitively, such as parent-child or partOf relationships.

Motivation: UC16, UC23, and UC26

R7: Macro-Language Features

The language should enable the definition of macros as short cuts to recurring patterns, and to enable inexperienced users define rich constraints. Macros should be high-level terms that improve overall readability, separation of concerns and maintainability. This overlaps with the already approved "Higher-Level Language".

Motivation: UC5, UC16, UC21, UC27, UC28, and UC32

R7.1: Named Shapes

It should be possible to encapsulate a group of constraints (a Shape) into a named entity, so that the Shape can be reused in multiple places, also across the Web

Motivation: UC16 and UC28

R7.3: Constraint Macros

Some constraint patterns are recurring with only slight modifications. Example: SKOS constraints that multiple properties must be pairwise disjoint. The language should make it possible to encapsulate such recurring patterns in a parameterizable form. Examples include SPIN/LDOM Templates.

Motivation: UC21, UC27, and UC28

R7.4: Nested Constraint Macros

It should be possible to combine the high-level terms of the constraint language into larger expressions using nested constraints. Examples of this include ShEx, Resource Shapes' oslc:valueShape and owl:allValuesFrom.

Motivation: UC32 and UC33

R8: Specialization of Shapes

It should be possible to specialize/extend shapes so that the constraints defined for a more general (super) shape also apply to the specialized (sub) shape. Sub-shapes can only narrow down, i.e. further constrain.

Motivation: UC2, UC5, UC10, UC11, UC19, UC20, UC24, UC25, UC27, UC28, and UC29

R9: Global Constraints

It should be possible to specify constraint conditions that need to be checked "globally" for a whole graph, without referring to a specific set of resources or class. In programming languages such global entities are often called "static", but "global" is probably better known.

Motivation: UC35

R10: Vocabulary for Constraint Violations

Instead of just reporting yes/no, the language needs to be able to return more meaningful messages including severity levels, human-readable error descriptions and pointers at specific patterns in the graph.

Motivation: UC3, UC34, and (almost every other use case)

R10.1: Severity Levels

The language should allow the creation of error responses that can include severity levels as desired.

Motivation: UC3

R10.3: Constraint Violations should point at Specific Nodes

The language should make it possible for authors of constraint checks to produce pointers at specific nodes and graph fragments that caused the violation. Typical examples of such information includes the starting point (root node), a path from the root, and specific values that caused the problem.

Motivation: UC3

R11.5: Profiles

The language should include a notion of profiles, so that certain applications with limited features can only use certain elements of the overall language.

Motivation: UC11, UC19 and UC32

R11.7: Separation of structural from complex constraints

There shall be a core language or SHACL profile that excludes any support for constraints defined via embedded SPARQL queries or other complex lower-level expressions. This is so that lightweight applications can validate constraints without requiring a SPARQL processor or similar subsystem.

Motivation: UC11, UC19 and UC32

R11.8: Evaluating Constraints for a Single Node Only

It should be possible to validate constraints on a single node in a graph. This may be impossible to implement 100% correctly, because sometimes a change to a resource invalidates conditions in a very different place in the graph. However, the language could propose a framework that identifies those constraints that SHOULD be checked when a given node is evaluated, e.g. by following its rdf:type and the superclasses of that. This would include validating shacl:valueShape but not shacl:valueType.

Motivation: (Orthogonal to basically all use cases)

R12.1: Select Whole Graph

It should be possible to select all the RDF nodes in a graph for validation. This is similar to the Global Constraints (R9) requirement.

Motivation: UC35

R12.2: Selection by Type

It should be possible to have some mechanism to select the nodes that are instances of some class for validation.

Motivation: (Orthogonal to basically all use cases)

R12.3: Selection by Single Node

It should be possible to select a single RDF node for validation.

Motivation: (Orthogonal to basically all use cases)

R14.1: Property Default Value

It should be possible to provide a default value for a given property, e.g. so that input forms can be pre-populated. This requirement is not about using default values as "inferred" triples at run-time.

Motivation: UC11, UC19, and UC20

R14.2: Property Labels at Shape

It should be possible to provide human-readable labels of a property in the context of a shape, intended for human consumption such as documentation or UI, not just globally for the rdf:Property. Multiple languages should be supported.

Motivation: UC11 and UC19

R14.3: Property Comment in a Shape

It should be possible to provide human-readable descriptions of the role of a property in the context of a shape, not just globally using triples that have the rdf:Property as subject. Multiple languages should be supported.

Motivation: UC11 and UC19

Acknowledgements

We would like to acknowledge the contributions of user story authors: Dean Allemang, Anamitra Bhattacharyya, Karen Coyle, Nick Crossley, Michel Dumontier, Jose Emilio Labra Gayo, Sandro Hawke, Dimitris Kontokostas, Holger Knublauch, David Martin, Dave McComb, Peter F. Patel-Schneider, Axel Polleres, Eric Prud'hommeaux, Arthur Ryman, Steve Speicher, and Simon Steyskal.

Abstract

Status of This Document

Table of Contents

1. Scope and Motivation

2. Organization of this Document

3. Use Cases

3.1 UC1: Model validation

3.2 UC2: Enforcing cardinality

3.3 UC3: Nuanced error conditions

3.4 UC4: Shape variations within a process or workflow

3.5 UC5: Complex constraints

3.6 UC8: Checking RDF node type

3.7 UC9: Contract time intervals

3.8 UC10: Cardinality >= 0

3.9 UC11: Model-Driven UI constraints

3.10 UC12: App interoperability

3.11 UC13: Specification and validation of metadata templates

3.12 UC14: Quality Assurance for object reconciliation

3.13 UC15: Validation of variant dataset descriptions

3.14 UC16: Constraints and controlled reasoning

3.15 UC17: Specifing subsets of data

3.16 UC19: Query Builder

3.17 UC20: Creation Shapes

3.18 UC21: SKOS constraints

3.19 UC22: RDF Data Cube constraints

3.20 UC23: schema.org constraints

3.21 UC24: Open Content Model

3.22 UC25: Primary Keys with URI patterns

3.23 UC26: rdf:Lists and ordered data

3.24 UC27: Relationships between values of multiple properties

3.25 UC28: Self-Describing Linked Data resources

3.26 UC29: Describing interoperable, hypermedia-driven Web APIs (with Hydra)

3.27 UC30: PROV constraints

3.28 UC31: LDP: POST content to container of a certain shape

3.29 UC32: Non-SPARQL based solution to express constraints between different properties

3.30 UC33: Structural validation for queriability

3.31 UC34: Large-scale dataset validation

3.32 UC35: Describe disconnected graphs

3.33 UC36: Support use of inverse properties

3.34 UC37: Defining allowed/required values

3.35 UC38: Describing and validating Linked Data Portals(proposed)

3.36 UC39: Arbitrary cardinality

3.37 UC40: Describing inline content versus references (proposed)

3.38 UC41: Validating schema.org instances against model and metamodel

3.39 UC42: Constraining RDF Graphs to provide better mapping to JSON (proposed)

4. Requirements

Acknowledgements

A. References

A.1 Informative references