Abstract

To foster the development of Shapes Constraint Language (SHACL), this document includes a set of use cases and requirements that motivate a simple language and semantics for formulating structural constraints on RDF graphs. All use cases provide realistic examples describing how people may use structural constraints to validate RDF instance data. Note, that this document avoids the use of any specific vocabulary that might be introduced by the SHACL specification.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document was published by the RDF Data Shapes Working Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-rdf-shapes@w3.org (subscribe, archives). All comments are welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 September 2015 W3C Process Document.

Table of Contents

1. Scope and Motivation

One motivation for SHACL is Application Integration, where different software components, potentially maintained by different organizations, need to function together smoothly. As an everyday example, imagine an international company with a dozen divisions, each providing a feed of their Human Resources data to authorized users. Different divisions might use different software to produce their feeds, and there might be many distinct applications which consume the data, ranging from an employee phone book to a hiring-compliance monitoring system.

While systems like this are built and maintained around the world today, their complexity often becomes a problem. Not only are the systems expensive and sometimes unpleasant to maintain, but changing data fields and adding new applications can grow to be practically impossible. An "RDF Data Shapes" standard would help manage the complexity, greatly reducing the cost and hassle, by separating components while still allowing them to work together.

Specifically, in this example, SHACL would allow:

In all cases, the semantics of the data are determined by RDF and the vocabularies specified by the shape, so if the shapes match, the systems can reasonably be expected to interoperate correctly.

While SHACL is expected to have immediate everyday utility, as illustrated above, it has even wider potential applicability, ranging in scale. At the large end, SHACL might be used by loosely-knit communities, where data is provided by organizations which are not under any central authority, such as charities and researchers around the world concerned with quality-of-life measures. At the small end, SHACL might be used within a mobile application environment to provide interoperability among independent sensor modules and tools for analyzing and acting on sensor results. The common thread is that SHACL allow a loose coupling, where independently maintained elements of an overall system can reliably and comfortably interoperate.

2. Organization of this Document

This document is organized as follows:

3. Use Cases

3.1 UC1: Model validation

There is a general need to validate that the instance data matches the models that have been defined in RDFS or OWL. The primary validation requirement is to ensure that the appropriate information is given for each property (or class) in the model. As examples, one could require that each property must have a domain, or that classes must be explicitly stated in the instance data. Input to this case is the RDF representation of an RDFS (or OWL) ontology.

Summary: Requires the ability to check whether certain information is given/available for a property or class.

Related Requirements: R6.2

3.2 UC2: Enforcing cardinality

For a tool that will build a list of personal names for named entity resolution to work correctly, every person must have one or more names specified, each of which is a string. Constraints can be used to verify that a particular set of data has at least one such name for each person.

Summary: Requires the ability to check the cardinality of properties as well as the type of its values.

Related Requirements: R6.2 and R8

3.3 UC3: Nuanced error conditions

There is a range of responses that any application may wish to act on, or that it may want to echo back to the user as a result of a validation process. There are the obvious results of "keep/reject" but often there will be a range of error or alert responses. There needs to be a way to associate an error level or code with the output of validation. Some applications will have a number of responses that inform users of ways they could improve their data, while still accepting all but the truly unusable data. Other applications could analyze data using a nuanced granding system.

Summary: Requires the ability to return more fine-grained validation results, not just "pass/fail."

Related Requirements: R5.1, R5.9.1, R5.9.2, R5.9.3, R5.10, R10, R10.1, R10.2, and R10.3

3.4 UC4: Shape variations within a process or workflow

The same shape can have different values and different requirements at different points in a process or workflow. Any node in the graph may serve multiple roles, that is, the same node may include properties for a SubmittingUser and for an AssignedEmployee, and these will be relevant at different points in the process. As an example, an LDP Container (e.g PendingIssues) accepts an IssueShape with a status of "assigned" or "unassigned". The LDP Container is an interface to a service storing data in a conventional relational database. Later, the issue gets resolved and is available as OldIssues without acquiring new type arcs. The constraints for PendingIssues are different from those for Issues at OldIssues, even though the instance data occupies a single graph.

Summary: Requires the ability to specify which RDF nodes should be validated against specific Shapes, e.g. by using filtering and/or scoping mechanisms.

Related Requirements: R12.1, R12.2, and R12.3

3.5 UC5: Complex constraints

Data applications may have a number of complex constraints that must interoperate. For example, there can be a wide variety of access rules defining privileges for viewing and updating data. These can be applied to accounts or to applications and functions. Incoming data, which itself can be complex, can be subjected to a large number of validation actions, some of which are dependent on output from prior application steps.

Design of validation must make these complex constraints appropriately efficient in application, as well as fostering a manageable maintenance environment for the validation technology.

Summary: Requires the constraint language to be designed in a way that it can be used efficiently in productive environments dealing with numerous complex constraint definitions.

Related Requirements: R6, R6.3, R6.5, R6.6, R6.7, R7.2, and R8

3.6 UC8: Checking RDF node type

It is often necessary or desirable to check whether certain property values of RDF nodes are of a specific node type (IRI, BlankNode or Literal and all combinations thereof). One example is the need to state that a given property shall only have IRIs but no blank nodes as its value.

There are examples of this functionality in the VOID namespace, (void:dataDump and void:exampleResource), and in SPARQL (isIRI, isBlank, isLiteral).

Summary: Requires the possibility to specify the expected node type of a property, i.e. check whether it is an IRI, a literal, a blank node, or some combination of those.

Related Requirements: R5.5

3.7 UC9: Contract time intervals

An ontology may state that instances of a class have a value for a property. Subclasses may be associated with a constraint that requires that there is a provided value for the property. For example, in the OMG time ontology adopted by FIBO every contract has to have an end date. A shape (set of constraints) may require that bonds (a subclass of contracts) have specified end dates without requiring that all contracts have specified end dates.

Summary: Requires the possibility to inherit and extend Shapes of superclasses.

Related Requirements: R8

3.8 UC10: Cardinality >= 0

There is a class in FIBO called IncorporatedCompany, which is a subclass of a bunch of restrictions. Many of them are of the form:

Example
fibo-be-oac-cpty:hasControllingInterestParty min 0 fibo-be-oac-cctl:VotingShareholder
i.e., a qualified cardinality of min 0. For example:

Summary: Requires the possibility to select focus nodes based on specific conditions. Requires the possibility to specify default values.

Related Requirements: R5.1, R5.2, R5.3, R5.4, R8, and R12.3

3.9 UC11: Model-Driven UI constraints

There is a need to have constraints that provide model-driven validation of permissible values in user interfaces. The major requirement here is a declarative model of:

It must be possible to perform validation of this type on instance data without being required to make use of a specific mechanism, such as SPARQL queries. Instead, the model should be of a sufficiently high level that it is not dependent on a single tool or method. However, at the same time there are many advanced constraints that need to be validated (either on server or client) before a form can be submitted. These constraints are not necessarily "structural" information, but rather executable code that returns error messages.

Summary: Requires the ability to declare and constrain permitted values for properties, as well as their cardinalities, in an abstract and "high-level" fashion.

Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3, R5.10, R8, R14.1, R14.2, and R14.3

3.10 UC12: Application interoperability

There is one application (e.g. Cimba) which stores its application state in RDF. It currently queries and modifies that state using HTTP GET and PUT operations on RDF sources, whereas another version that is currently under developement uses SPARQL to query and modify the data. The question is, how do we communicate the shape of the data this application reads and writes to other developers who want to make compatible applications? We want to say: as long as your data is of this form, Cimba will read it properly. We also want to say: Cimba may write data of any of these forms, so to be interoperable your application will need to read and correctly process all of them.

Summary: Requires the possibility to make shape definitions exchangeable and independently accessible from the data graph.

Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3, R11.5, and R11.7

3.11 UC13: Specification and validation of metadata templates

Data gathering functions, especially those that are consortial or rely on aggregation of data from multiple sources, need to be able to easily create templates to represent metadata. Ease of templating is particularly important in rapidly changing fields, such as medicine. For this reason, it is crucial that a language be developed that can allow easy templating of metadata and constraints. The templates must allow users to define different sets of metadata elements and their requirements. Templates should be modular and re-usable.

These templates will contain metadata elements that are either required or optional, and that define the value of the field to specific datatypes (e.g. string, integer, decimal, date). Values may be restricted by length or to a regular expression pattern; they may limited to specific categorical values or terminology trees/class expressions of a target ontology.

Ideally, the shapes language should be readable by computers in order to automatically generate template forms with restriction to specified values. Moreover, libraries and tools to construct and validate templates and their instance data should be readily available.

Summary: Requires the possibility to define shapes for a specific node in a modular manner.

Requires the possibility to define costum constraint templates.

Related Requirements: R5.1, R5.2, R5.3, R5.4, R7, R7.1, R7.2, R7.3, and R7.4

3.12 UC14: Quality Assurance for object reconciliation

In data integration activities, tools such as Silk or Limes may be used to discover entity co-references. Entity co-references are pairs of different identifiers, often in different datasets, that refer to the same entity. Detected co-references are often recorded as owl:sameAs triples. This may be a step in an object reconciliation pipeline.

It would be nice if shapes could flexibly state conditions by which to check that identity of objects has been correctly recorded; that is, check conditions under which a same-as link should be present between two identifiers, or conversely, check conditions for misidentified same-as links.

The intent here is not that the validation process should produce the expected owl:sameAs triples. We assume that some other tool or process has already produced these triples. The purpose of these validation rules is to perform quality assurance, or sanity checks, on the output of these other tools or processes. Thus, the quality or completeness of the generated linkset could be assessed.

We note however that object reconciliation tools could be driven by constraints like those given above. So potentially, an object reconciliation tool and a validator could use the same input constraints. Thus, this story straddles the boundaries between constraint checking and inference.

Summary: Requires the possibility to appropriately apply filtering and scoping mechanisms to select focus nodes for validating constraints.

Related Requirements: R12.1, R12.2, and R12.3

3.13 UC15: Validation of variant dataset descriptions

Vocabulary and data re-use are desirable features of an RDF application. Metadata for a community or function may be expressed as levels of description that re-use existing vocabularies in a way that is appropriate to different contexts. For some data it may be possible to define a subset that satisfies a minimum description. In other cases, data may be re-used in a variety of configurations. Each of these contexts can have different validation constraints.

For example, in a data environment that has a 3 component model for summary, versioning, and distribution-level descriptions, each component has access to a specific set of metadata elements and these are specified as MUST, SHOULD, MAY, and MUST NOT. As such there are different conformance criteria for each level. Metadata values are either unrestrained rdfs:Literals, constrained rdfs:Literals, URIs with a specified URI pattern, or instances of a specified URI-identified type, or a disjunction of URI-specified types.

Summary: Requires the functionality to restrict application of constraints to certain contexts.

Related Requirements: R5.1

3.14 UC16: Constraints and controlled reasoning

A use-case we were facing recently, revolved around the integration of distributed configurations (i.e. object-oriented models) with RDFS and SPARQL. In this particular use-case we had to assume both Unique Name Assumption (UNA) and Closed World Assumption (CWA) for our ontologies, since the models (i.e. configurations) from which those ontologies were derived were generated by product configurators that impose both UNA and CWA. Since neither RDFS or OWL impose UNA/CWA we had to come up with some workarounds which were basically:

SPARQL was used to perform query tasks on the global schema as well as to check simple integrity constraints by translating e.g. cardinality restrictions into ASK queries.

One major problem which arose based on our workaround to impose UNA was, that SPARQL is unaware of the special semantics of owl:sameAs. Which means that especially if one wants to use counting aggregates, one usually wants to count the number of real-objects and not the number of URIs referring to it. As an example we defined two SPARQL queries which should count the number of subnets of a certain system:

Summary: Requires the possibility to encapsulate verbose constraint definitions into constraint templates, thus allowing their reuse in other shapes as well as increase readability of shape definitions.

Related Requirements: R7, R7.1, R7.2, R7.3, and R7.4

3.15 UC17: Specifing subsets of data

The medical community has an interest in the notion of "archetypes" that are expressed as abstract constraints on a reference model. The reference model describes the largest set of possible instances of a given collection of data and the archetypes then constrain this set of instances by restricting cardinality, types, value ranges, etc.. One way to implement archetype models would be through RDF and SHACL, where the reference model would be viewed as the "constraints" -- the set of constraints that are used to validate incoming data and to document dataset validity.

The archetypes, however, would serve the additional purpose of defining "instance subsets". The archetypes identify filters/queries that would allow a user to return the a set of shapes that met certain criteria such as abnormal values, co-occurence, etc. They could also act as filters, funneling incoming instances to secondary processes where necessary.

It should be noted that the primary representation for archetypes in the medical community will probably not be SHACL -- they will be using Archetype Definition Language (ADL) (or the UML equivalent, AML) and/or profiles, with SHACL being a translation.

Summary: Defines a use case, where shape definitions could be used to partition a data set (i.e. one could query for individuals that are compliant to a specific shape).

Related Requirements: R12.1, R12.2, and R12.3

3.16 UC19: Query Builder

Various tools are contributing data to a triple store. A Query Builder wants to know the permitted or likely shapes of the data over which the generated queries must run, so that the end user can be presented with a nice interface prompting for likely predicates and values. Since the data is dynamic, this is not necessarily the same as the shape that could be reverse engineered from the existing data. The Query Builder and the data-producing tools are not provided by the same team - the Query Builder team has very limited control over the data being produced. The source of the data might not provide the necessary shape information, so we need a way for the Query Builder team (or a third party) to be able to provide the shape data independently. See also Ontology-Driven Forms.

Summary: Requires the possibility to provide shape definitions independently of instance data.

Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3, R8, R11.5, R11.7, R14.1, R14.2, and R14.3

3.17 UC20: Creation Shapes

A client creating a new resource by posting to a Linked Data Platform Container wants to know the acceptable properties and their values, including which ones are mandatory and which optional. Note that this creation shape is not necessarily the same as the shape of the resource post-creation - the server may transform some values, add new properties, etc.

Summary: Requires the ability to decide which shape definitions should be valid/triggered for a certain node (in case those shape definitions are mutually exclusive).

Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3 R8, and R14.1

3.18 UC21: SKOS constraints

The well-known SKOS vocabulary defines constraints that are outside of the expressivity of current ontology languages, such as:

The constraint language must include the capability to define these constraints, and in particular these constraints should be provided as easily re-usable modules.

Summary: Requires the possibility to define complex constraints similar to those defined in the SKOS vocabulary.

Related Requirements: R6, R6.4, R6.6, R7, and R7.3

3.19 UC22: RDF Data Cube constraints

The Data Cube Vocabulary provides a means to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. While the bulk of the vocabulary is defined as an RDF Schema, it also includes integrity constraints.

Each integrity constraint is expressed as narrative prose and, where possible, a SPARQL ASK query or query template. If the ASK query is applied to an RDF graph then it will return true if that graph contains one or more Data Cube instances which violate the corresponding constraint.

Using SPARQL queries to express the integrity constraints does not imply that integrity checking must be performed this way. Implementations are free to use alternative query formulations or alternative implementation techniques to perform equivalent checks.

Summary: Requires support of RDF Data Cube integrity constraints

Related Requirements: R6, R6.2, and R6.6

3.20 UC23: schema.org constraints

Developers at Google have created a validation tool for the well-known schema.org vocabulary for use in Google Search, Google Now and Gmail. They have discovered that - what may seem like a potentially infinite number of possible constraints - can be represented quite succinctly using existing standards and serialized as RDF. Some examples of schema.org constraints are:

Summary: Requires the possibility to represent schema.org constraints.

Related Requirements: R6, R6.2, R6.3, R6.6, and R6.8

3.21 UC24: Open Content Model

Consider a situation in which there is a need to integrate similar information from multiple applications and that the application owners have agreed on an RDF representation for this information. However, because the applications have some differences, the application owners can only agree on those data items that are common to all applications. The defined RDF representation will include the common data items, and will allow the presence of other undefined data items in order to accommodate differences among the applications. In this situation, the RDF representation is said to have an open content model.

Since the shape of a resource may depend on the tool that hosts it, or the project that hosts it within a tool, but the RDF type of the resource may not depend on the tool or project, there is in general no way to navigate to the shape given only its RDF type. The OSLC Resource Shapes specification provides two mechanisms for navigating to the appropriate shape. First, the RDF property oslc:resourceShape where oslc: is <http://open-services.net/ns/core#> may be used to link a tool or project description to a shape resource. Second, the RDF property oslc:instanceShape may be used to link a resource to its shape.

See Open Content Model Example for a detailed example.

Summary: Requires the possibility to address a resource graph based on criteria unrelated to its rdf:type. This can be a general context, or a specific application function.

Related Requirements: R8

3.22 UC25: Primary Keys with URI patterns

It is very common to have a single property that uniquely identifies instances of a given class. For example, when you import legacy data from a spreadsheet, it should be possible to automatically produce URIs based on a given primary key column. The proposed solution here is to define a standard vocabulary to represent the primary key and a suitable URI pattern. This information can then be used both for constraint checking of existing instances, and to construct new (valid) instances. One requirement here is advanced string processing, including the ability to turn a partial URI and a literal value into a new URI.

Details: Primary Keys with URI Pattern

Summary: Requires The ability to create IRIs from non-IRI identifiers.

Related Requirements: R6 and R8

3.23 UC26: rdf:Lists and ordered data

Libraries have a number of resources that are issued in ordered series. Any library may own or have access to some parts of the series, either sequential or with broken sequences. The list may be very long, and it is often necessary to display the list of items in order. The order can be nicely numerical, or not. Another ordered list use case is that of authors on academic journal articles. For reasons of attribution (and promotion!), the order of authors in article publishing can be significant. This is not a computable order (e.g. alphabetical by name). There are probably other cases, but essentially there will definitely be a need to have ordered lists for some data.

Validation could be:

Details: rdf:List Stresstest

Summary: Requires the possibility to check whether all members of a list have certain characteristics.

Related Requirements: R6, R6.7, and R6.8

3.24 UC27: Relationships between values of multiple properties

Cultural heritage (CH) data is generally created in a distributed way, so when data is gathered together in a single aggregation, quite a bit of checking must be done. One of the key aspects of CH data is the identification of persons and subjects, in particular relating them to historical contexts. For persons, a key context is their own birth and death dates; for events, there is often a date range representing a beginning and end of the event. In addition, there are cultural heritage objects that exist over a span of time (serial publications, for example). In each of these cases, it is desirable to validate the relationship of the values of properties that have temporal or other ordered characteristics.

Details: Relationships between values of different properties

Summary: Requires ability to perform comparisons on the values in selected sets of properties. For example, to compare the value of properties representing birth date and death date to validate that birthdate precedes death date.

Related Requirements: R6, R6.6, R6.7, R7.3, and R8

3.25 UC28: Self-Describing Linked Data resources

In Linked Data, related information is accessed by URI dereferencing. The information that is accessible this way may represent facts about a particular resource, but also typing information for the resource. The types can themselves be used in a similar way to find the ontology describing the resource. It should be possible to use these same mechanisms to find constraints on the information provided about the resource.

For example, the ontology could include constraints or could point to another document that includes constraints. Or the first document accessed might include constraints or point to another document that includes constraints.

DCMI story: For some properties there is a requirement that the value IRI resolve to a resource that is a skos:Concept. The resource value is not limited to a particular skos:Concept scheme.

Summary: The constraint language must be able to validate information received from dereferencing the value IRI, e.g. check whether the value is a member of a skos:ConceptScheme.

Related Requirements: R7, R7.1, R7.2, R7.3, and R8

3.26 UC29: Describing interoperable, hypermedia-driven Web APIs (with Hydra)

Hydra is a lightweight vocabulary to create hypermedia-driven Web APIs. By specifying a number of concepts commonly used in Web APIs it enables the creation of generic API clients. The Hydra core vocabulary can be used to define classes and "supported properties" which carry additional metadata such as whether the property is required and whether it is read-only.

Summary: The constraint language should support constraints commonly used in API clients.

Related Requirements: R5.1, R5.9.1, R5.9.2, R5.9.3, and R8

3.27 UC30: PROV constraints

The PROV Family of Documents defines a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web. One of these documents is a library of constraints which defines valid PROV instances. The actual validation process is quite complex and requires a rule-like normalization step. Various implementations of this validation process exist, including a set of SPARQL INSERT/SELECT queries sequenced by a Python script, as well as an implementation in Java and in Prolog. Stardog also defines an "archetype" for PROV, which seems to be implemented in SPARQL using their ICV engine.

Summary: Requires the possibility to express constraints as defined in PROV's library of constraints.

Related Requirements: R6

3.28 UC31: LDP: POST content to container of a certain shape

Some simple LDP server implementations may be based on lightweight app server technology and only deal with JSON(-LD) and Turtle representations for their LDP RDF Sources (LDP-RS) on top of an existing application, say Bugzilla. As a client implementer, I may have a simple JavaScript application that consumes and produces JSON-LD. I want to have a way to programmatically provide the end-user with a simple form to create new resources and also a way to potential auto-prefill this form based on data from current context.

LDP defines some behavior when a POST fails to a ldp:Container, by outlining expected status codes and additional hints that could be found in either the response body of the HTTP POST request or a response header (such as: Link relation of "http://www.w3.org/ns/ldp#constrainedBy". A client can proactively request headers (instead of trying the POST and it fails) by performing an HTTP HEAD or OPTIONS request on the container URL and inspecting the link relation for "constrainedBy".

Typical constraints are:

Current gap is whatever is at the end of the "constrainedBy" link, which could be anything: HTML, OSLC Resource Shapes, SPIN. The LDP WG discussed a need to have something a bit more formalized and deferred making any recommendation looking to apply these requirements unto the Data Shapes work. Once it matures, and meets the requirements, LDP could provide a recommendation for it then.

Summary: This use case covers similar topics as discussed in UC11.

Related Requirements: no suitable requirements approved yet.

3.29 UC32: Non-SPARQL based solution to express constraints between different properties

Assuming there are potential clients consuming RDF resources, interfacing with an LDP container that needs to work asynchronously (the client being a workers mobile device where the work zone has no connectivity). The client needs to allow workers to create entries locally in the offline application to mark completion of different stages of the work. These entries will again be synced with the LDP container once the device has network connectivity. Prior to that when the client is offline, the client software needs to perform a range of validations on the user's entries to reduce the probability of an invalid entry.

In addition to the basic data type/required/cardinality "stand alone" validations, the client needs to validate constraints between different properties:

The client side does not have access to any triple store/LDP container. If these validations can be expressed in a higher level language, it would make it easier for clients to implement them.

Summary: Expresses the requirement to be able to define constraints over more than one property. E.g., value of property start_time must be less than value of property end_time.

Those interdependencies between properties of the same RDF node should be expressible in a higher level language.

Related Requirements: R7, R7.4, R11.5, and R11.7

3.30 UC33: Structural validation for queriability

Data frequently has structural errors. Consider a schema where a medical procedure should have no more than one outcome. Accidental double entry occurs when e.g. a clinician and her assistant both enter outcomes into the database. Statistical queries over malformed data such as this leads to misinterpretation and inaccurate conclusions. Shapes can be used to sequester well-formed data for simpler analysis.

Example Data
_:Bob :hadIntervention [
	:performedProcedure [ 	a bridg:PerformedProcedure ;
				:definedBy [ :coding term:MarrowTransplant ; :location terms:Manubrium ] ];
	:assessmentTest     [   a bridg:PerformedObservation ;
				:definedBy [ :coding term:TumorMarkerTest ; :evaluator <LabX> ] ;
				:result    [ :coding term:ImprovedToNormal ; :assessedBy clinic:doctor7 ],
					   [ :coding term:ImprovedToNormal ; :assessedBy clinic:doctor7 ]
				]
] .

The obvious SPARQL query on this will improperly weight this as two positive outcomes:

Example Query
SELECT ?location ?result (COUNT(*) AS ?count)
WHERE {
	?who :hadIntervention [
		:performedProcedure [ :definedBy [ :coding term:MarrowTransplant ; :location ?location ] ];
		:assessmentTest     [ :definedBy [ :coding term:TumorMarkerTest ] ;
					  :result    [ :coding ?result ] ]
	]
} GROUP BY ?result ?location
(This is a slight simplification for the sake of readability. In practice, an auxiliary hierarchy identifies multiple codes as positive outcomes, e.g. term:ImprovedToNormal and term2:ClinicalCure, but the effect is the same as described here.) Being able to select subsets of data related to an RDF node and thus, define a well-formed/cleansed representation of that node (which is represented as shape), allows to improve the quality of data as well as its queriability.

Summary: Requires the ability to perform structural validation over RDF data.

Related Requirements: R7.4

3.31 UC34: Large-scale dataset validation

A publisher has a very large RDF Database (in terms of millions or billions of triples) and wants to define multiple shapes for the data that will be checked at regular intervals. To make this process effective the validation must be able to run within a reasonable time-span and the validation engine must be flexible enough to provide different levels of the violation result details. The different levels can range from specific nodes that are violating a shape facet, the success or fail of a shape facet or aggregated violations per shape facet, possibly along with an error prevalence.

Applying a shape in a large database can return thousands or millions of violations and it is not efficient to look at all erroneous RDF nodes one by one. In addition, many times all violations for a specific facet can be attributed to a specific mapping or source code function. An expected workflow in this case is that the maintainer runs a validation asking aggregated violations per shape facet along with a sample of (i.e. 10) specific nodes. Having the higher overview along with the sample data the maintainer can choose the order she will address the errors.

Summary: Basically a repetition of UC3 with additional requirements regarding the validation performance.

Related Requirements: R10

3.32 UC35: Describe disconnected graphs

This use case reflects how information resources are created (e.g. via HTTP POST) or modified (e.g. via HTTP PUT). In these situations, the body of the HTTP request has an RDF content type (RDF/XML, Turtle, JSON-LD, etc.). The server typically needs to verify that the body of the request satisfies some application-specific constraints. Many proposed solutions have an implicit assumption that all RDF graphs have a distinguished root node which is the subject of triples that define either literal properties or links to other subjects, which may in turn have literal properties or links to further subjects. The implication is that all the nodes of interest are connected to the root node. However, an RDF graph may not be connected to other graphs acted on by the same application, and in fact disconnected RDF graphs do appear in real-world Linked Data specifications. The RDF representation of an information resource may be a disconnected graph in the sense that the set of nodes in the graph may be partitioned into two disjoint subsets A and B such that there is no undirected path that starts in A and ends in B.

The example can be taken from a specification related to access control. A conformant access control service must host an access control list resource that supports HTTP GET requests. The response to an HTTP GET request have a response body whose content type is application/ld+json, i.e. JSON-LD. An example is given below. In this example, there is a distinguished root node, i.e. the node of type acc:AccessContextList, but it is not connected to the other nodes of interest, i.e. the nodes of type acc:AccessContext.

Example
{
  "@context": {
    "acc": "http://open-services.net/ns/core/acc#",
    "id": "@id",
    "type": "@type",
    "title": "http://purl.org/dc/terms/title",
    "description": "http://purl.org/dc/terms/description"
  },
  "@graph": [{
     "id": "https://a.example.com/acclist",
     "type": "acc:AccessContextList"
    }, {
     "id": "https://a.example.com/acclist#alpha",
     "type": "acc:AccessContext",
     "title": "Alpha",
     "description": "Resources for Alpha project"
    }, {
     "id": "https://a.example.com/acclist#beta",
     "type": "acc:AccessContext",
     "title": "Beta",
     "description": "Resources for Beta project"
  }]
}
 

Summary: States the requirement, that constraints over RDF graphs must be describable for both disconnected and connected graphs.

Related Requirements: R6.7, R9, and R12.1

3.33 UC36: Support use of inverse properties

In some cases the best RDF representation of a property-value pair may reuse a pre-existing property in which the described resource is the object and the property value is the subject. The reuse of properties is a best practice for enabling data interoperability. The fact that a pre-existing property might have the opposite direction should not be used as a justification for the creation of a new inverse property. In fact, the existence of both inverse and direct properties makes writing efficient queries more difficult since both the inverse and the direct property must be included in the query.

For example, suppose we are describing test cases and want to express the relations between test cases and the requirements that they validate. Further suppose that there is a pre-existing vocabulary for requirements that defines the property ex:isValidatedBy which asserts that the subject is validated by the object. In this case there is no need to define the inverse property ex:validates. Instead the representation of test case resources should use ex:isValidatedBy with the test case as the object and the requirement as the subject.

This situation cannot be described by the current OSLC Shapes specification because OSLC Shapes describe properties of a given subject node, so inverse properties cannot be used. The OSLC Shape submission however proposes a possible solution. See http://www.w3.org/Submission/shapes/#inverse-properties.

Summary: For sake of simplicity, a potential constraint language shall allow the usage of properties in their inverse direction if applicable.

Related Requirements: R5.1 and R5.11

3.34 UC37: Defining allowed/required values

The cultural heritage community has a large number of lists that control values for particular properties. These are similar to the DCMI types, but some are quite extensive (>200 types of roles for agents in relation to resources). There is also the concept of "authorities" which control the identities of people, places, subjects, organizations, and even resources themselves. Many of these lists are centralized in major agencies (Library of Congress, Getty Art & Architecture Archive, National Library of Medicine, and national libraries throughout the world). Not all have been defined in RDF or RDF/SKOS, but those that have can be identified by their IRI domain name and pattern. Validation tools need to restrict or check usage according to the rules of the agency creating and sharing the data. Some patterns of needed validation are:

  1. must be an IRI (not a literal)
  2. must be an IRI matching this pattern
  3. must be an IRI matching one of >1 patterns
  4. must be a (any) literal
  5. must be one of these literals ("red" "blue" "green")
  6. must be a typed literal of this type (e.g. XML dataType)
  7. literal must have a language code
Some of these are conditional: for resources of type:A, property:P has allowed values a,b,c,f.

Summary: Requires the possibility to constrain property values using Shapes.

Related Requirements: R10, R5.5

3.35 UC38: Describing and validating LDP

A small company specialized in the development of LDP needs to describe the model of the RDF graphs that will be generated from Excel spreadsheets and will also be published as SPARQL endpoints. The LDP could contain observations which are usually instances of type qb:Observation, but may contain different properties. The content of those portals is generally statistical data which is derived from Excel spreadsheets and can easily be mapped to RDF Data Cube observations.

Examples of constraints are:

In this context, the company is looking for a solution that can be easily understood by the team of developers who are familiar work with OO programming languages, relational databases, XML technologies and some basic RDF knowledge, but they are not familiar with other semantic web technologies like SPARQL, OWL, etc. The solution must be machine processable, so the contents of the LDP can be automatically validated and reused, both internally and by third parties.

Finally, the company would like to compare the schemas employed by the different LDP so they can evaluate the differences between RDF nodes that appear in those portals and even be able to create new applications on top of the data aggregated by the portals.

Summary: Define RDF graphs to be generated from spread sheet software and made available through a LDP.
Provide a comparison function for RDF graphs.

Related Requirements: TBD

3.36 UC39: Arbitrary cardinality

Some clinical data require specific cardinality constraints, e.g.

Which makes it necessary to be able to define arbitrary cardinality constraints, i.e. not be limited to a number of predefined values.

Summary: Requires the ability to define arbitrary cardinality constraints.

Related Requirements: R5.2

3.37 UC40: Describing inline content versus references

IRIs as values in triples may be the subjects of triples that are inline or may need to be de-referenced to complete the graph. In some cases the URI must be de-referenced to perform validation; in other cases, de-referencing isn't needed or is considered too costly for a low-value property.

Summary: The constraint language must make it possible to indicate IRIs that must be de-referenced.

Related Requirements: TBD

3.38 UC41: Validating schema.org instances against model and metamodel

Validation of schema.org instances must adhere to the definitions used in that vocabulary. A processor for our validation language should be able to accept a schema.org instance as well as the schema.org model, expressed in an RDF syntax, as inputs (perhaps as separate named graphs), and validate the instance against the model.

Summary: The constraint language should adhere to schema.org vocabulary practices to process schema.org data.

Related Requirements: TBD

3.39 UC42: Constraining RDF graphs to provide better mapping to JSON

In client-side application development and in integrating between RDF-based systems and JSON-based APIs, certain problems arise when mapping between the RDF data model and the JSON data model. In the unconstrained RDF data model, there are too many variations to map arbitrary RDF graphs cleanly to JSON. By selecting an RDF vocabulary that covers the desired JSON structure, and using Shapes to express constraints over the vocabulary, the mapping could be made sensible and predictable.

The requirements for this are:

Summary: Use Shapes to define JSON compatible RDF, in particular maxCardinality of "1", RDF lists function, and limit of one string literal per language tag.

Related Requirements: R5.2, R6.4, R6.12, R6.13

3.40 UC45: Linked Data Update via HTTP GET and PUT

As a client of a Linked Data application, I need to know the constraints on the data so I can update resources. The data is in an RDF format. I retrieve the data via HTTP GET, edit it, validate it, then modify the resource via HTTP PUT. I need to know how to validate the data before I send the HTTP PUT request.

For example, information about the constraints that the application enforces could be provided by linking the data to the shape via a triple in the data. If the data IRI is X and the shape IRI is Y then a link such as (X sh:hasShape Y) would work. Y could be a resource hosted anywhere on the web.

Summary: Linked Data users need to be able to access shape constraints together with the data so they can maintain the integrity of graphs that are updated.

Related Requirements: TBD

3.41 UC46: Software regression testing with SHACL

As an RDF software and data developer I need to define constraints for the data I generate with my software. It is important to see which constraints succeed or fail and to store the results in a database. When a previously successful test fails it is generally an indication of a software regression.

I am not interested in storing detailed violation instances as most times I work with sample or mock data that are subject to change and cannot be directly comparable. What can instead be persistent are the actual constraints (shapes or shape facets) and I need a standardized way to store the status for each constraint as true/false or with additional metadata (e.g. error count or prevalence) for a specific validation.

Summary: There is a need to store test results related to constraints on shapes for the purposes of software testing.

Related Requirements: R10, R10.1, R10.2, and R10.3

3.42 UC47: Clinical data constraints

Clinical information systems reuse general predicates for observations and relationships between observations. For example, a blood pressure is an observation with two constituent observations: systolic and diastolic Likewise, an APGAR observation is a constellation of nine observations. Definition of these data elements requires repeated constraints on the same predicate, analogous to OWL qualified cardinality constraints.

Summary: There is a need for qualified cardinality constraints on shapes.

Related Requirements: TBD

4. Requirements

This section lists the requirements arising from the use-cases catalogued in this document. Specific requirements that have been de-prioritized or rejected have been left in the document for completeness, but are shown as struck out.

4.1 SHACL Language Requirements

4.1.1 R1: Higher-Level Language

Constraints/shapes shall be specifiable in a higher-level language with 1. definitional capabilities, such as macro rolling up and naming, and 2. control infrastructure for, e.g., recursion.

Motivation: Dublin Core Requirement 103

4.1.2 R2: Concise Language

Constraints/shapes shall be specifiable in a concise language.

Motivation: Dublin Core Requirement 184

4.1.3 R3: Addressability

Collections of constraints/shapes may be addressable and discoverable. Individual constraints/shapes may be addressable and discoverable.

Motivation: Dublin Core Requirement 147 and Dublin Core Requirement 148

4.1.4 R4: Annotations

Constraints/shapes may incorporate extra information that does not affect validation. It shall be possible to search for constraints/shapes with particular extra information.

Motivation: Dublin Core Requirement 208

4.1.5 R7: Macro-Language Features

The language should enable the definition of macros as short cuts to recurring patterns, and to enable inexperienced users define rich constraints. Macros should be high-level terms that improve overall readability, separation of concerns and maintainability. This overlaps with the already approved "Higher-Level Language".

Motivation: UC5, UC16, UC21, UC27, UC28, and UC32

4.1.6 R7.1: Named Shapes

It should be possible to encapsulate a group of constraints (a Shape) into a named entity, so that the Shape can be reused in multiple places, also across the Web

Motivation: UC16 and UC28

4.1.7 R7.2: Function and Property Macros

In order to support maintainable and readable constraints, it should be possible to encapsulate recurring patterns into named entities such as functions and dynamically computed properties. This requirement is orthogonal to almost every user story. It includes a vocabulary to share function definitions.

Motivation: UC5, UC16, and UC28

4.1.8 R7.3: Constraint Macros

Some constraint patterns are recurring with only slight modifications. Example: SKOS constraints that multiple properties must be pairwise disjoint. The language should make it possible to encapsulate such recurring patterns in a parameterizable form.

Motivation: UC21, UC27, and UC28

4.1.9 R7.4: Nested Constraint Macros

It should be possible to combine the high-level terms of the constraint language into larger expressions using nested constraints. Examples of this include ShEx, Resource Shapes' oslc:valueShape and owl:allValuesFrom.

Motivation: UC32 and UC33

4.1.10 R10: Vocabulary for Constraint Violations

Instead of just reporting yes/no, the language needs to be able to return more meaningful messages including severity levels, human-readable error descriptions and pointers at specific patterns in the graph.

Motivation: UC3, UC34, and (almost every other use case)

4.1.11 R10.1: Severity Levels

The language should allow the creation of error responses that can include severity levels as desired.

Motivation: UC3

4.1.12 R10.2: Human-readable Violation Messages

The language should make it possible for constraint checks to create human-readable violation messages that can be either created explicitly by the user or generated dynamically from constraint definition. It should be possible to create such messages in multiple languages.

Motivation: UC3

4.1.13 R10.3: Constraint Violations should point at Specific Nodes

The language should make it possible for authors of constraint checks to produce pointers at specific nodes and graph fragments that caused the violation. Typical examples of such information includes the starting point (root node), a path from the root, and specific values that caused the problem.

Motivation: UC3

4.1.14 R11.5: Profiles

The language should include a notion of profiles, so that certain applications with limited features can only use certain elements of the overall language.

Motivation: UC11, UC19 and UC32

4.1.15 R11.7: Separation of structural from complex constraints

There shall be a core language or SHACL profile that excludes any support for constraints defined via embedded SPARQL queries or other complex lower-level expressions. This is so that lightweight applications can validate constraints without requiring a SPARQL processor or similar subsystem.

Motivation: UC11, UC19 and UC32

4.2 Property Constraint Requirements

4.2.1 R5.2: Property Min/Max Cardinality

The stated values for a property may be limited by minimum/maximum cardinality, with typical patterns being [0..1], [1..1], [0..*] and [1..*].

Motivation: UC10, UC11, UC13, UC19, UC20, UC39, and UC42

4.2.2 R5.3: Property Datatype

The values of a property may be limited to be an RDF Literal with a stated datatype, such as xsd:string or xsd:date.

Motivation: UC10, UC11, UC13, UC19, and UC20

4.2.3 R5.4: Property Type

The values of a property may be limited by their type, e.g., all children have to be of type person.

Motivation: UC10, UC11, UC13, UC19, and UC20

4.2.4 R5.5: Property's RDF Node Type (e.g. only IRIs are allowed)

The values of a property on instances of a class may be limited by their RDF node type, e.g. IRI, BlankNode, Literal, or BlankNodeOrIRI (for completeness we may want to support all 7 combinations including Node as parent).

Motivation: UC8

4.2.5 R5.9.1: Datatype Property Facets: min/max values

Similar to xsd:minInclusive/maxExclusive

Motivation: UC3, UC11, UC12, UC13, UC19, UC20, and UC29

4.2.6 R5.9.2: Datatype Property Facets: regular expression patterns

Pattern matching against regular expressions (xsd:pattern).

Motivation: UC3, UC11, UC12, UC13, UC19, UC20, and UC29

4.2.7 R5.9.3: Datatype Property Facets: string length

Constraining the length of a string.

Motivation: UC3, UC11, UC12, UC13, UC19, UC20, and UC29

4.2.8 R5.10: Property Value Enumerations

Shapes will provide exhaustive enumerations of the valid values (literals and IRIs).

Motivation: UC3, UC11, and UC37

4.2.9 R5.11: Properties Used in Inverse Direction

Shapes can have constraints where the tested node is the object of a triple.

Motivation: UC36

R14.1: Property Default Value

It should be possible to provide a default value for a given property, e.g. so that input forms can be pre-populated. This requirement is not about using default values as "inferred" triples at run-time.

Motivation: UC11, UC19, and UC20

4.3 Value Constraint Requirements

4.3.1 R6.3: Expressivity: String Operations

Some constraints require building new strings out of other strings, and building new URIs out of other values.

Motivation: UC5 and UC23

4.3.2 R6.4: Expressivity: Language Tags

Some constraints require comparing language tags of RDF literals, e.g. to check that no language is used more than once per property. Also to produce multi-lingual error messages.

Motivation: UC21, UC42

4.3.3 R6.5: Expressivity: Mathematical Operations

Some constraints require mathematical calculations and comparisons, e.g. area = width * height.

Motivation: UC5

4.3.4 R6.6: Expressivity: Literal Value Comparison

Some constraints require operators such as <, >=, != etc, either against constants or other values that are dynamically retrieved at query time. Includes date/time comparison and functions such as NOW().

Motivation: UC5, UC21, UC22, UC23, and UC27

4.4 Complex Constraint Requirements

4.4.1 R6: Complex Constraint Requirements

The language should allow users to implement constraints that check complex conditions, with an expressivity as covered by the following sub-requirements (e.g. basic graph patterns, string and mathematical operations and comparison of multiple values).

Motivation: UC5, UC21, UC22, UC23, UC26, UC27, and UC30

4.4.2 R6.2: Expressivity: Non-Existence of Patterns

Many constraints require that a certain pattern does not exist in the graph.

Motivation: UC1, UC2, UC22, and UC23

4.4.3 R6.7: Expressivity: Logical Operators

The language should make it possible to express the basic logical operators intersection, union and negation of conditions.

Motivation: UC5, UC26, and UC35

4.4.4 R6.8: Expressivity: Transitive Traversal of Properties

Some constraints need to be able to traverse a property transitively, such as parent-child or partOf relationships.

Motivation: UC16, UC23, and UC26

4.4.5 R6.12: Expressivity: Checking for well-formed rdf:Lists

There shall be a concise construct for expressing that a list must be well-formed.

Motivation: UC42

4.4.6 R6.13: Expressivity: Placing constraints on the values of rdf:Lists

There shall be a way of applying the constraints that we can express for normal properties (require a certain rdf:type, require a certain shape, require a certain datatype, require a certain node kind, etc.) to the members of rdf:Lists.

Motivation: UC42

4.5 Shape Constraint Requirements

4.5.1 R8: Specialization of Shapes

It should be possible to specialize/extend shapes so that the constraints defined for a more general (super) shape also apply to the specialized (sub) shape. Sub-shapes can only narrow down, i.e. further constrain.

Motivation: UC2, UC5, UC10, UC11, UC19, UC20, UC24, UC25, UC27, UC28, and UC29

4.5.2 R9: Global Constraints

It should be possible to specify constraint conditions that need to be checked "globally" for a whole graph, without referring to a specific set of resources or class. In programming languages such global entities are often called "static", but "global" is probably better known.

Motivation: UC35

4.5.3 R11.8: Evaluating Constraints for a Single Node Only

It should be possible to validate constraints on a single node in a graph. This may be impossible to implement 100% correctly, because sometimes a change to a resource invalidates conditions in a very different place in the graph. However, the language could propose a framework that identifies those constraints that SHOULD be checked when a given node is evaluated, e.g. by following its rdf:type and the superclasses of that. This would include validating shacl:valueShape but not shacl:valueType.

Motivation: (Orthogonal to basically all use cases)

4.5.4 R12.1: Select Whole Graph

It should be possible to select all the RDF nodes in a graph for validation. This is similar to the Global Constraints (R9) requirement.

Motivation: UC35

4.5.5 R12.2: Selection by Type

It should be possible to have some mechanism to select the nodes that are instances of some class for validation.

Motivation: (Orthogonal to basically all stories)

4.5.6 R12.3: Selection by Single Node

It should be possible to select a single RDF node for validation.

Motivation: (Orthogonal to basically all stories)

4.5.7 R5.1: Association of Class with Shape

There must be an "easy" way of associating a shape with a class, meaning that nodes in a graph that are instances of that class must conform with that shape

Motivation: UC3, UC10, UC11, UC12, UC13, UC15, UC19, UC20, UC29, and UC36

4.5.8 R14.2: Property Labels at Shape

It should be possible to provide human-readable labels of a property in the context of a shape, intended for human consumption such as documentation or UI, not just globally for the rdf:Property. Multiple languages should be supported.

Motivation: UC11 and UC19

4.5.9 R14.3: Property Comment in a Shape

It should be possible to provide human-readable descriptions of the role of a property in the context of a shape, not just globally using triples that have the rdf:Property as subject. Multiple languages should be supported.

Motivation: UC11 and UC19

A. Acknowledgements

We would like to acknowledge the contributions of user story authors: Dean Allemang, Anamitra Bhattacharyya, Karen Coyle, Nick Crossley, Michel Dumontier, Jose Emilio Labra Gayo, Sandro Hawke, Dimitris Kontokostas, Holger Knublauch, David Martin, Dave McComb, Peter F. Patel-Schneider, Axel Polleres, Eric Prud'hommeaux, Arthur Ryman, Steve Speicher, and Simon Steyskal.

B. References

B.1 Informative references

[COCKBURN-2000]
Alistair Cockburn. Writing Effective Use Cases. 2000. URL: http://alistair.cockburn.us/get/2465
[DCAT-UCR]
R. Cyganiak; F. Maali.. Use Cases and Requirements for the Data Catalog Vocabulary. 16 December 2012. W3C Editor's Draft. URL: http://dvcs.w3.org/hg/gld/raw-file/default/dcat-ucr/index.html