Copyright © 2015 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and document use rules apply.
To foster the development of Shapes Constraint Language (SHACL), this document includes a set of use cases and requirements that motivate a simple language and semantics for formulating structural constraints on RDF graphs. All use cases provide realistic examples describing how people may use structural constraints to validate RDF instance data. Note, that this document avoids the use of any specific vocabulary that might be introduced by the SHACL specification.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document was published by the RDF Data Shapes Working Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-rdf-shapes@w3.org (subscribe, archives). All comments are welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 September 2015 W3C Process Document.
One motivation for SHACL is Application Integration, where different software components, potentially maintained by different organizations, need to function together smoothly. As an everyday example, imagine an international company with a dozen divisions, each providing a feed of their Human Resources data to authorized users. Different divisions might use different software to produce their feeds, and there might be many distinct applications which consume the data, ranging from an employee phone book to a hiring-compliance monitoring system.
While systems like this are built and maintained around the world today, their complexity often becomes a problem. Not only are the systems expensive and sometimes unpleasant to maintain, but changing data fields and adding new applications can grow to be practically impossible. An "RDF Data Shapes" standard would help manage the complexity, greatly reducing the cost and hassle, by separating components while still allowing them to work together.
Specifically, in this example, SHACL would allow:
In all cases, the semantics of the data are determined by RDF and the vocabularies specified by the shape, so if the shapes match, the systems can reasonably be expected to interoperate correctly.
While SHACL is expected to have immediate everyday utility, as illustrated above, it has even wider potential applicability, ranging in scale. At the large end, SHACL might be used by loosely-knit communities, where data is provided by organizations which are not under any central authority, such as charities and researchers around the world concerned with quality-of-life measures. At the small end, SHACL might be used within a mobile application environment to provide interoperability among independent sensor modules and tools for analyzing and acting on sensor results. The common thread is that SHACL allow a loose coupling, where independently maintained elements of an overall system can reliably and comfortably interoperate.
This document is organized as follows:
There is a general need to validate that the instance data matches the models that have been defined in RDFS or OWL. The primary validation requirement is to ensure that the appropriate information is given for each property (or class) in the model. As examples, one could require that each property must have a domain, or that classes must be explicitly stated in the instance data. Input to this case is the RDF representation of an RDFS (or OWL) ontology.
Summary: Requires the ability to check whether certain information is given/available for a property or class.
Related Requirements: R6.2
For a tool that will build a list of personal names for named entity resolution to work correctly, every person must have one or more names specified, each of which is a string. Constraints can be used to verify that a particular set of data has at least one such name for each person.
Summary: Requires the ability to check the cardinality of properties as well as the type of its values.
There is a range of responses that any application may wish to act on, or that it may want to echo back to the user as a result of a validation process. There are the obvious results of "keep/reject" but often there will be a range of error or alert responses. There needs to be a way to associate an error level or code with the output of validation. Some applications will have a number of responses that inform users of ways they could improve their data, while still accepting all but the truly unusable data. Other applications could analyze data using a nuanced granding system.
Summary: Requires the ability to return more fine-grained validation results, not just "pass/fail."
Related Requirements: R5.1, R5.9.1, R5.9.2, R5.9.3, R5.10, R10, R10.1, R10.2, and R10.3
The same shape can have different values and different requirements at different points in a process or workflow. Any node in the graph may serve multiple roles, that is, the same node may include properties for a SubmittingUser and for an AssignedEmployee, and these will be relevant at different points in the process. As an example, an LDP Container (e.g PendingIssues) accepts an IssueShape with a status of "assigned" or "unassigned". The LDP Container is an interface to a service storing data in a conventional relational database. Later, the issue gets resolved and is available as OldIssues without acquiring new type arcs. The constraints for PendingIssues are different from those for Issues at OldIssues, even though the instance data occupies a single graph.
Summary: Requires the ability to specify which RDF nodes should be validated against specific Shapes, e.g. by using filtering and/or scoping mechanisms.
Data applications may have a number of complex constraints that must interoperate. For example, there can be a wide variety of access rules defining privileges for viewing and updating data. These can be applied to accounts or to applications and functions. Incoming data, which itself can be complex, can be subjected to a large number of validation actions, some of which are dependent on output from prior application steps.
Design of validation must make these complex constraints appropriately efficient in application, as well as fostering a manageable maintenance environment for the validation technology.
Summary: Requires the constraint language to be designed in a way that it can be used efficiently in productive environments dealing with numerous complex constraint definitions.
Related Requirements: R6, R6.3, R6.5, R6.6, R6.7, R7.2, and R8
It is often necessary or desirable to check whether certain property values of RDF nodes are of a specific node type (IRI, BlankNode or Literal and all combinations thereof). One example is the need to state that a given property shall only have IRIs but no blank nodes as its value.
There are examples of this functionality in the VOID namespace, (void:dataDump and void:exampleResource), and in SPARQL (isIRI, isBlank, isLiteral).
Summary: Requires the possibility to specify the expected node type of a property, i.e. check whether it is an IRI, a literal, a blank node, or some combination of those.
Related Requirements: R5.5
An ontology may state that instances of a class have a value for a property. Subclasses may be associated with a constraint that requires that there is a provided value for the property. For example, in the OMG time ontology adopted by FIBO every contract has to have an end date. A shape (set of constraints) may require that bonds (a subclass of contracts) have specified end dates without requiring that all contracts have specified end dates.
Summary: Requires the possibility to inherit and extend Shapes of superclasses.
Related Requirements: R8
There is a class in FIBO called IncorporatedCompany
, which is a subclass of a bunch of restrictions. Many of them are of the form:
fibo-be-oac-cpty:hasControllingInterestParty min 0 fibo-be-oac-cctl:VotingShareholder
min 0
. For example:
IncorporatedCompany
, there should be a field in that form for hasControllingInterestParty
. The field should be pre-populated (e.g., with a drop-down) with known VotingShareholder
s. We won't draw any inferences about the things here (as we would have done, if we had said min=1 or more)IncorporatedCompany
, and it has values for hasControllingInterestParty
, then at least one of them should be known to be a VotingShareholder
.Summary: Requires the possibility to select focus nodes based on specific conditions. Requires the possibility to specify default values.
There is a need to have constraints that provide model-driven validation of permissible values in user interfaces. The major requirement here is a declarative model of:
It must be possible to perform validation of this type on instance data without being required to make use of a specific mechanism, such as SPARQL queries. Instead, the model should be of a sufficiently high level that it is not dependent on a single tool or method. However, at the same time there are many advanced constraints that need to be validated (either on server or client) before a form can be submitted. These constraints are not necessarily "structural" information, but rather executable code that returns error messages.
Summary: Requires the ability to declare and constrain permitted values for properties, as well as their cardinalities, in an abstract and "high-level" fashion.
Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3, R5.10, R8, R14.1, R14.2, and R14.3
There is one application (e.g. Cimba) which stores its application state in RDF. It currently queries and modifies that state using HTTP GET and PUT operations on RDF sources, whereas another version that is currently under developement uses SPARQL to query and modify the data. The question is, how do we communicate the shape of the data this application reads and writes to other developers who want to make compatible applications? We want to say: as long as your data is of this form, Cimba will read it properly. We also want to say: Cimba may write data of any of these forms, so to be interoperable your application will need to read and correctly process all of them.
Summary: Requires the possibility to make shape definitions exchangeable and independently accessible from the data graph.
Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3, R11.5, and R11.7
Data gathering functions, especially those that are consortial or rely on aggregation of data from multiple sources, need to be able to easily create templates to represent metadata. Ease of templating is particularly important in rapidly changing fields, such as medicine. For this reason, it is crucial that a language be developed that can allow easy templating of metadata and constraints. The templates must allow users to define different sets of metadata elements and their requirements. Templates should be modular and re-usable.
These templates will contain metadata elements that are either required or optional, and that define the value of the field to specific datatypes (e.g. string, integer, decimal, date). Values may be restricted by length or to a regular expression pattern; they may limited to specific categorical values or terminology trees/class expressions of a target ontology.
Ideally, the shapes language should be readable by computers in order to automatically generate template forms with restriction to specified values. Moreover, libraries and tools to construct and validate templates and their instance data should be readily available.
Summary: Requires the possibility to define shapes for a specific node in a modular manner.
Requires the possibility to define costum constraint templates.
Related Requirements: R5.1, R5.2, R5.3, R5.4, R7, R7.1, R7.2, R7.3, and R7.4
In data integration activities, tools such as Silk or Limes may be used to discover entity co-references. Entity co-references are pairs of different identifiers, often in different datasets, that refer to the same entity. Detected co-references are often recorded as owl:sameAs
triples. This may be a step in an object reconciliation pipeline.
It would be nice if shapes could flexibly state conditions by which to check that identity of objects has been correctly recorded; that is, check conditions under which a same-as link should be present between two identifiers, or conversely, check conditions for misidentified same-as links.
source1.movie.title
is highly similar (by some widely adopted string similarity function, perhaps plugged in through an extension interface) to source2.film.title
and source1.movie.release-date.year
is identical to source2.film.initial-release
, then a owl:sameAs
triple should be present
source1.movie.title
is identical to source2.film.title
and source1.movie.release-date.year
is within two years of source2.film.initial-release
, then a owl:sameAs
triple should be present
source1.movie.directors
has the same set of values as source2.film.directed-by
AND source1.movie.title
is highly similar to source2.film.title
, then a owl:sameAs
triple should be present
The intent here is not that the validation process should produce the expected owl:sameAs triples. We assume that some other tool or process has already produced these triples. The purpose of these validation rules is to perform quality assurance, or sanity checks, on the output of these other tools or processes. Thus, the quality or completeness of the generated linkset could be assessed.
We note however that object reconciliation tools could be driven by constraints like those given above. So potentially, an object reconciliation tool and a validator could use the same input constraints. Thus, this story straddles the boundaries between constraint checking and inference.
Summary: Requires the possibility to appropriately apply filtering and scoping mechanisms to select focus nodes for validating constraints.
Vocabulary and data re-use are desirable features of an RDF application. Metadata for a community or function may be expressed as levels of description that re-use existing vocabularies in a way that is appropriate to different contexts. For some data it may be possible to define a subset that satisfies a minimum description. In other cases, data may be re-used in a variety of configurations. Each of these contexts can have different validation constraints.
For example, in a data environment that has a 3 component model for summary, versioning, and distribution-level descriptions, each component has access to a specific set of metadata elements and these are specified as MUST, SHOULD, MAY, and MUST NOT. As such there are different conformance criteria for each level. Metadata values are either unrestrained rdfs:Literals
, constrained rdfs:Literals
, URIs with a specified URI pattern, or instances of a specified URI-identified type, or a disjunction of URI-specified types.
Summary: Requires the functionality to restrict application of constraints to certain contexts.
Related Requirements: R5.1
A use-case we were facing recently, revolved around the integration of distributed configurations (i.e. object-oriented models) with RDFS and SPARQL. In this particular use-case we had to assume both Unique Name Assumption (UNA) and Closed World Assumption (CWA) for our ontologies, since the models (i.e. configurations) from which those ontologies were derived were generated by product configurators that impose both UNA and CWA. Since neither RDFS or OWL impose UNA/CWA we had to come up with some workarounds which were basically:
SPARQL was used to perform query tasks on the global schema as well as to check simple integrity constraints by translating e.g. cardinality restrictions into ASK queries.
One major problem which arose based on our workaround to impose UNA was, that SPARQL is unaware of the special semantics of owl:sameAs. Which means that especially if one wants to use counting aggregates, one usually wants to count the number of real-objects and not the number of URIs referring to it. As an example we defined two SPARQL queries which should count the number of subnets of a certain system:
Summary: Requires the possibility to encapsulate verbose constraint definitions into constraint templates, thus allowing their reuse in other shapes as well as increase readability of shape definitions.
The medical community has an interest in the notion of "archetypes" that are expressed as abstract constraints on a reference model. The reference model describes the largest set of possible instances of a given collection of data and the archetypes then constrain this set of instances by restricting cardinality, types, value ranges, etc.. One way to implement archetype models would be through RDF and SHACL, where the reference model would be viewed as the "constraints" -- the set of constraints that are used to validate incoming data and to document dataset validity.
The archetypes, however, would serve the additional purpose of defining "instance subsets". The archetypes identify filters/queries that would allow a user to return the a set of shapes that met certain criteria such as abnormal values, co-occurence, etc. They could also act as filters, funneling incoming instances to secondary processes where necessary.
It should be noted that the primary representation for archetypes in the medical community will probably not be SHACL -- they will be using Archetype Definition Language (ADL) (or the UML equivalent, AML) and/or profiles, with SHACL being a translation.
Summary: Defines a use case, where shape definitions could be used to partition a data set (i.e. one could query for individuals that are compliant to a specific shape).
Various tools are contributing data to a triple store. A Query Builder wants to know the permitted or likely shapes of the data over which the generated queries must run, so that the end user can be presented with a nice interface prompting for likely predicates and values. Since the data is dynamic, this is not necessarily the same as the shape that could be reverse engineered from the existing data. The Query Builder and the data-producing tools are not provided by the same team - the Query Builder team has very limited control over the data being produced. The source of the data might not provide the necessary shape information, so we need a way for the Query Builder team (or a third party) to be able to provide the shape data independently. See also Ontology-Driven Forms.
Summary: Requires the possibility to provide shape definitions independently of instance data.
Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3, R8, R11.5, R11.7, R14.1, R14.2, and R14.3
A client creating a new resource by posting to a Linked Data Platform Container wants to know the acceptable properties and their values, including which ones are mandatory and which optional. Note that this creation shape is not necessarily the same as the shape of the resource post-creation - the server may transform some values, add new properties, etc.
Summary: Requires the ability to decide which shape definitions should be valid/triggered for a certain node (in case those shape definitions are mutually exclusive).
Related Requirements: R5.1, R5.2, R5.3, R5.4, R5.9.1, R5.9.2, R5.9.3 R8, and R14.1
The well-known SKOS vocabulary defines constraints that are outside of the expressivity of current ontology languages, such as:
Summary: Requires the possibility to define complex constraints similar to those defined in the SKOS vocabulary.
The Data Cube Vocabulary provides a means to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. While the bulk of the vocabulary is defined as an RDF Schema, it also includes integrity constraints.
Each integrity constraint is expressed as narrative prose and, where possible, a SPARQL ASK query or query template. If the ASK query is applied to an RDF graph then it will return true if that graph contains one or more Data Cube instances which violate the corresponding constraint.
Using SPARQL queries to express the integrity constraints does not imply that integrity checking must be performed this way. Implementations are free to use alternative query formulations or alternative implementation techniques to perform equivalent checks.
Summary: Requires support of RDF Data Cube integrity constraints
Developers at Google have created a validation tool for the well-known schema.org vocabulary for use in Google Search, Google Now and Gmail. They have discovered that - what may seem like a potentially infinite number of possible constraints - can be represented quite succinctly using existing standards and serialized as RDF. Some examples of schema.org constraints are:
schema:Person
: Children cannot contain cycles, Children must be born after the parent, deathDate must be after birthDate schema:GeoCoordinates
: longitude must be between -180 and 180, latitude between -90 and 90 various
: email address must match a certain regular expression schema:priceCurrency, currenciesAccepted
: Currency code must be from a given controlled vocabulary schema:children, colleagues, follows, knows, parents, relatedTo, siblings, spouse, subEvents, superEvents
: Irreflexitity Summary: Requires the possibility to represent schema.org constraints.
Consider a situation in which there is a need to integrate similar information from multiple applications and that the application owners have agreed on an RDF representation for this information. However, because the applications have some differences, the application owners can only agree on those data items that are common to all applications. The defined RDF representation will include the common data items, and will allow the presence of other undefined data items in order to accommodate differences among the applications. In this situation, the RDF representation is said to have an open content model.
Since the shape of a resource may depend on the tool that hosts it, or the project that hosts it within a tool, but the RDF type of the resource may not depend on the tool or project, there is in general no way to navigate to the shape given only its RDF type. The OSLC Resource Shapes specification provides two mechanisms for navigating to the appropriate shape. First, the RDF property oslc:resourceShape where oslc: is <http://open-services.net/ns/core#> may be used to link a tool or project description to a shape resource. Second, the RDF property oslc:instanceShape may be used to link a resource to its shape.
See Open Content Model Example for a detailed example.
Summary: Requires the possibility to address a resource graph based on criteria unrelated to its rdf:type. This can be a general context, or a specific application function.
Related Requirements: R8
It is very common to have a single property that uniquely identifies instances of a given class. For example, when you import legacy data from a spreadsheet, it should be possible to automatically produce URIs based on a given primary key column. The proposed solution here is to define a standard vocabulary to represent the primary key and a suitable URI pattern. This information can then be used both for constraint checking of existing instances, and to construct new (valid) instances. One requirement here is advanced string processing, including the ability to turn a partial URI and a literal value into a new URI.
Details: Primary Keys with URI Pattern
Summary: Requires The ability to create IRIs from non-IRI identifiers.
Libraries have a number of resources that are issued in ordered series. Any library may own or have access to some parts of the series, either sequential or with broken sequences. The list may be very long, and it is often necessary to display the list of items in order. The order can be nicely numerical, or not. Another ordered list use case is that of authors on academic journal articles. For reasons of attribution (and promotion!), the order of authors in article publishing can be significant. This is not a computable order (e.g. alphabetical by name). There are probably other cases, but essentially there will definitely be a need to have ordered lists for some data.
Validation could be:
Details: rdf:List Stresstest
Summary: Requires the possibility to check whether all members of a list have certain characteristics.
Cultural heritage (CH) data is generally created in a distributed way, so when data is gathered together in a single aggregation, quite a bit of checking must be done. One of the key aspects of CH data is the identification of persons and subjects, in particular relating them to historical contexts. For persons, a key context is their own birth and death dates; for events, there is often a date range representing a beginning and end of the event. In addition, there are cultural heritage objects that exist over a span of time (serial publications, for example). In each of these cases, it is desirable to validate the relationship of the values of properties that have temporal or other ordered characteristics.
Details: Relationships between values of different properties
Summary: Requires ability to perform comparisons on the values in selected sets of properties. For example, to compare the value of properties representing birth date and death date to validate that birthdate precedes death date.
In Linked Data, related information is accessed by URI dereferencing. The information that is accessible this way may represent facts about a particular resource, but also typing information for the resource. The types can themselves be used in a similar way to find the ontology describing the resource. It should be possible to use these same mechanisms to find constraints on the information provided about the resource.
For example, the ontology could include constraints or could point to another document that includes constraints. Or the first document accessed might include constraints or point to another document that includes constraints.
DCMI story: For some properties there is a requirement that the value IRI resolve to a resource that is a skos:Concept. The resource value is not limited to a particular skos:Concept scheme.
Summary: The constraint language must be able to validate information received from dereferencing the value IRI, e.g. check whether the value is a member of a skos:ConceptScheme.
Hydra is a lightweight vocabulary to create hypermedia-driven Web APIs. By specifying a number of concepts commonly used in Web APIs it enables the creation of generic API clients. The Hydra core vocabulary can be used to define classes and "supported properties" which carry additional metadata such as whether the property is required and whether it is read-only.
Summary: The constraint language should support constraints commonly used in API clients.
The PROV Family of Documents defines a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web. One of these documents is a library of constraints which defines valid PROV instances. The actual validation process is quite complex and requires a rule-like normalization step. Various implementations of this validation process exist, including a set of SPARQL INSERT/SELECT queries sequenced by a Python script, as well as an implementation in Java and in Prolog. Stardog also defines an "archetype" for PROV, which seems to be implemented in SPARQL using their ICV engine.
Summary: Requires the possibility to express constraints as defined in PROV's library of constraints.
Related Requirements: R6
Some simple LDP server implementations may be based on lightweight app server technology and only deal with JSON(-LD) and Turtle representations for their LDP RDF Sources (LDP-RS) on top of an existing application, say Bugzilla. As a client implementer, I may have a simple JavaScript application that consumes and produces JSON-LD. I want to have a way to programmatically provide the end-user with a simple form to create new resources and also a way to potential auto-prefill this form based on data from current context.
LDP defines some behavior when a POST fails to a ldp:Container
, by outlining expected status codes and additional hints that could be found in either the response body of the HTTP POST request or a response header (such as: Link relation of "http://www.w3.org/ns/ldp#constrainedBy"
. A client can proactively request headers (instead of trying the POST and it fails) by performing an HTTP HEAD or OPTIONS request on the container URL and inspecting the link relation for "constrainedBy".
Typical constraints are:
Current gap is whatever is at the end of the "constrainedBy" link, which could be anything: HTML, OSLC Resource Shapes, SPIN. The LDP WG discussed a need to have something a bit more formalized and deferred making any recommendation looking to apply these requirements unto the Data Shapes work. Once it matures, and meets the requirements, LDP could provide a recommendation for it then.
Summary: This use case covers similar topics as discussed in UC11.
Related Requirements: no suitable requirements approved yet.
Assuming there are potential clients consuming RDF resources, interfacing with an LDP container that needs to work asynchronously (the client being a workers mobile device where the work zone has no connectivity). The client needs to allow workers to create entries locally in the offline application to mark completion of different stages of the work. These entries will again be synced with the LDP container once the device has network connectivity. Prior to that when the client is offline, the client software needs to perform a range of validations on the user's entries to reduce the probability of an invalid entry.
In addition to the basic data type/required/cardinality "stand alone" validations, the client needs to validate constraints between different properties:
Summary: Expresses the requirement to be able to define constraints over more than one property. E.g., value of property start_time must be less than value of property end_time.
Those interdependencies between properties of the same RDF node should be expressible in a higher level language.
Data frequently has structural errors. Consider a schema where a medical procedure should have no more than one outcome. Accidental double entry occurs when e.g. a clinician and her assistant both enter outcomes into the database. Statistical queries over malformed data such as this leads to misinterpretation and inaccurate conclusions. Shapes can be used to sequester well-formed data for simpler analysis.
_:Bob :hadIntervention [ :performedProcedure [ a bridg:PerformedProcedure ; :definedBy [ :coding term:MarrowTransplant ; :location terms:Manubrium ] ]; :assessmentTest [ a bridg:PerformedObservation ; :definedBy [ :coding term:TumorMarkerTest ; :evaluator <LabX> ] ; :result [ :coding term:ImprovedToNormal ; :assessedBy clinic:doctor7 ], [ :coding term:ImprovedToNormal ; :assessedBy clinic:doctor7 ] ] ] .
The obvious SPARQL query on this will improperly weight this as two positive outcomes:
SELECT ?location ?result (COUNT(*) AS ?count) WHERE { ?who :hadIntervention [ :performedProcedure [ :definedBy [ :coding term:MarrowTransplant ; :location ?location ] ]; :assessmentTest [ :definedBy [ :coding term:TumorMarkerTest ] ; :result [ :coding ?result ] ] ] } GROUP BY ?result ?location
term:ImprovedToNormal
and term2:ClinicalCure
, but the effect is the same as described here.)
Being able to select subsets of data related to an RDF node and thus, define a well-formed/cleansed representation of that node (which is represented as shape), allows to improve the quality of data as well as its queriability.
Summary: Requires the ability to perform structural validation over RDF data.
Related Requirements: R7.4
A publisher has a very large RDF Database (in terms of millions or billions of triples) and wants to define multiple shapes for the data that will be checked at regular intervals. To make this process effective the validation must be able to run within a reasonable time-span and the validation engine must be flexible enough to provide different levels of the violation result details. The different levels can range from specific nodes that are violating a shape facet, the success or fail of a shape facet or aggregated violations per shape facet, possibly along with an error prevalence.
Applying a shape in a large database can return thousands or millions of violations and it is not efficient to look at all erroneous RDF nodes one by one. In addition, many times all violations for a specific facet can be attributed to a specific mapping or source code function. An expected workflow in this case is that the maintainer runs a validation asking aggregated violations per shape facet along with a sample of (i.e. 10) specific nodes. Having the higher overview along with the sample data the maintainer can choose the order she will address the errors.
Summary: Basically a repetition of UC3 with additional requirements regarding the validation performance.
Related Requirements: R10
This use case reflects how information resources are created (e.g. via HTTP POST) or modified (e.g. via HTTP PUT). In these situations, the body of the HTTP request has an RDF content type (RDF/XML, Turtle, JSON-LD, etc.). The server typically needs to verify that the body of the request satisfies some application-specific constraints. Many proposed solutions have an implicit assumption that all RDF graphs have a distinguished root node which is the subject of triples that define either literal properties or links to other subjects, which may in turn have literal properties or links to further subjects. The implication is that all the nodes of interest are connected to the root node. However, an RDF graph may not be connected to other graphs acted on by the same application, and in fact disconnected RDF graphs do appear in real-world Linked Data specifications. The RDF representation of an information resource may be a disconnected graph in the sense that the set of nodes in the graph may be partitioned into two disjoint subsets A and B such that there is no undirected path that starts in A and ends in B.
The example can be taken from a specification related to access control. A conformant access control service must host an access control list resource that supports HTTP GET requests. The response to an HTTP GET request have a response body whose content type is application/ld+json
, i.e. JSON-LD. An example is given below. In this example, there is a distinguished root node, i.e. the node of type acc:AccessContextList
, but it is not connected to the other nodes of interest, i.e. the nodes of type acc:AccessContext
.
{ "@context": { "acc": "http://open-services.net/ns/core/acc#", "id": "@id", "type": "@type", "title": "http://purl.org/dc/terms/title", "description": "http://purl.org/dc/terms/description" }, "@graph": [{ "id": "https://a.example.com/acclist", "type": "acc:AccessContextList" }, { "id": "https://a.example.com/acclist#alpha", "type": "acc:AccessContext", "title": "Alpha", "description": "Resources for Alpha project" }, { "id": "https://a.example.com/acclist#beta", "type": "acc:AccessContext", "title": "Beta", "description": "Resources for Beta project" }] }
Summary: States the requirement, that constraints over RDF graphs must be describable for both disconnected and connected graphs.
In some cases the best RDF representation of a property-value pair may reuse a pre-existing property in which the described resource is the object and the property value is the subject. The reuse of properties is a best practice for enabling data interoperability. The fact that a pre-existing property might have the opposite direction should not be used as a justification for the creation of a new inverse property. In fact, the existence of both inverse and direct properties makes writing efficient queries more difficult since both the inverse and the direct property must be included in the query.
For example, suppose we are describing test cases and want to express the relations between test cases and the requirements that they validate. Further suppose that there is a pre-existing vocabulary for requirements that defines the property ex:isValidatedBy
which asserts that the subject is validated by the object. In this case there is no need to define the inverse property ex:validates
. Instead the representation of test case resources should use ex:isValidatedBy
with the test case as the object and the requirement as the subject.
This situation cannot be described by the current OSLC Shapes specification because OSLC Shapes describe properties of a given subject node, so inverse properties cannot be used. The OSLC Shape submission however proposes a possible solution. See http://www.w3.org/Submission/shapes/#inverse-properties.
Summary: For sake of simplicity, a potential constraint language shall allow the usage of properties in their inverse direction if applicable.
The cultural heritage community has a large number of lists that control values for particular properties. These are similar to the DCMI types, but some are quite extensive (>200 types of roles for agents in relation to resources). There is also the concept of "authorities" which control the identities of people, places, subjects, organizations, and even resources themselves. Many of these lists are centralized in major agencies (Library of Congress, Getty Art & Architecture Archive, National Library of Medicine, and national libraries throughout the world). Not all have been defined in RDF or RDF/SKOS, but those that have can be identified by their IRI domain name and pattern. Validation tools need to restrict or check usage according to the rules of the agency creating and sharing the data. Some patterns of needed validation are:
Summary: Requires the possibility to constrain property values using Shapes.
A small company specialized in the development of LDP needs to describe the model of the RDF graphs that will be generated from Excel spreadsheets and will also be published as SPARQL endpoints. The LDP could contain observations which are usually instances of type qb:Observation
, but may contain different properties. The content of those portals is generally statistical data which is derived from Excel spreadsheets and can easily be mapped to RDF Data Cube observations.
Examples of constraints are:
In this context, the company is looking for a solution that can be easily understood by the team of developers who are familiar work with OO programming languages, relational databases, XML technologies and some basic RDF knowledge, but they are not familiar with other semantic web technologies like SPARQL, OWL, etc. The solution must be machine processable, so the contents of the LDP can be automatically validated and reused, both internally and by third parties.
Finally, the company would like to compare the schemas employed by the different LDP so they can evaluate the differences between RDF nodes that appear in those portals and even be able to create new applications on top of the data aggregated by the portals.
Summary: Define RDF graphs to be generated from spread sheet software and made available through a LDP.
Provide a comparison function for RDF graphs.
Related Requirements: TBD
Some clinical data require specific cardinality constraints, e.g.
Summary: Requires the ability to define arbitrary cardinality constraints.
Related Requirements: R5.2
IRIs as values in triples may be the subjects of triples that are inline or may need to be de-referenced to complete the graph. In some cases the URI must be de-referenced to perform validation; in other cases, de-referencing isn't needed or is considered too costly for a low-value property.
Summary: The constraint language must make it possible to indicate IRIs that must be de-referenced.
Related Requirements: TBD
Validation of schema.org instances must adhere to the definitions used in that vocabulary. A processor for our validation language should be able to accept a schema.org instance as well as the schema.org model, expressed in an RDF syntax, as inputs (perhaps as separate named graphs), and validate the instance against the model.
{ :thing schema:date "value" }
, it should be possible to write a validation rule that depends on a “rangeIncludes” annotation on the schema:date property. As each named datatype is used many times throughout the model, it would also be good if the regular expression (or similar mechanism) for the datatype wouldn't have to be repeated for each property that uses the datatype, but could be referred to by reference, or by rule.Summary: The constraint language should adhere to schema.org vocabulary practices to process schema.org data.
Related Requirements: TBD
In client-side application development and in integrating between RDF-based systems and JSON-based APIs, certain problems arise when mapping between the RDF data model and the JSON data model. In the unconstrained RDF data model, there are too many variations to map arbitrary RDF graphs cleanly to JSON. By selecting an RDF vocabulary that covers the desired JSON structure, and using Shapes to express constraints over the vocabulary, the mapping could be made sensible and predictable.
The requirements for this are:
Summary: Use Shapes to define JSON compatible RDF, in particular maxCardinality of "1", RDF lists function, and limit of one string literal per language tag.
As a client of a Linked Data application, I need to know the constraints on the data so I can update resources. The data is in an RDF format. I retrieve the data via HTTP GET, edit it, validate it, then modify the resource via HTTP PUT. I need to know how to validate the data before I send the HTTP PUT request.
For example, information about the constraints that the application enforces could be provided by linking the data to the shape via a triple in the data. If the data IRI is X and the shape IRI is Y then a link such as (X sh:hasShape Y) would work. Y could be a resource hosted anywhere on the web.
Summary: Linked Data users need to be able to access shape constraints together with the data so they can maintain the integrity of graphs that are updated.
Related Requirements: TBD
As an RDF software and data developer I need to define constraints for the data I generate with my software. It is important to see which constraints succeed or fail and to store the results in a database. When a previously successful test fails it is generally an indication of a software regression.
I am not interested in storing detailed violation instances as most times I work with sample or mock data that are subject to change and cannot be directly comparable. What can instead be persistent are the actual constraints (shapes or shape facets) and I need a standardized way to store the status for each constraint as true/false or with additional metadata (e.g. error count or prevalence) for a specific validation.
Summary: There is a need to store test results related to constraints on shapes for the purposes of software testing.
Clinical information systems reuse general predicates for observations and relationships between observations. For example, a blood pressure is an observation with two constituent observations: systolic and diastolic Likewise, an APGAR observation is a constellation of nine observations. Definition of these data elements requires repeated constraints on the same predicate, analogous to OWL qualified cardinality constraints.
Summary: There is a need for qualified cardinality constraints on shapes.
Related Requirements: TBD
This section lists the requirements arising from the use-cases catalogued in this document. Specific requirements that have been de-prioritized or rejected have been left in the document for completeness, but are shown as struck out.
Constraints/shapes shall be specifiable in a higher-level language with 1. definitional capabilities, such as macro rolling up and naming, and 2. control infrastructure for, e.g., recursion.
Motivation: Dublin Core Requirement 103
Constraints/shapes shall be specifiable in a concise language.
Motivation: Dublin Core Requirement 184
Collections of constraints/shapes may be addressable and discoverable. Individual constraints/shapes may be addressable and discoverable.
Motivation: Dublin Core Requirement 147 and Dublin Core Requirement 148
Constraints/shapes may incorporate extra information that does not affect validation. It shall be possible to search for constraints/shapes with particular extra information.
Motivation: Dublin Core Requirement 208
The language should enable the definition of macros as short cuts to recurring patterns, and to enable inexperienced users define rich constraints. Macros should be high-level terms that improve overall readability, separation of concerns and maintainability. This overlaps with the already approved "Higher-Level Language".
It should be possible to encapsulate a group of constraints (a Shape) into a named entity, so that the Shape can be reused in multiple places, also across the Web
In order to support maintainable and readable constraints, it should be possible to encapsulate recurring patterns into named entities such as functions and dynamically computed properties. This requirement is orthogonal to almost every user story. It includes a vocabulary to share function definitions.
Some constraint patterns are recurring with only slight modifications. Example: SKOS constraints that multiple properties must be pairwise disjoint. The language should make it possible to encapsulate such recurring patterns in a parameterizable form.
It should be possible to combine the high-level terms of the constraint language into larger expressions using nested constraints. Examples of this include ShEx, Resource Shapes' oslc:valueShape and owl:allValuesFrom.
Instead of just reporting yes/no, the language needs to be able to return more meaningful messages including severity levels, human-readable error descriptions and pointers at specific patterns in the graph.
The language should allow the creation of error responses that can include severity levels as desired.
Motivation: UC3
The language should make it possible for constraint checks to create human-readable violation messages that can be either created explicitly by the user or generated dynamically from constraint definition. It should be possible to create such messages in multiple languages.
Motivation: UC3
The language should make it possible for authors of constraint checks to produce pointers at specific nodes and graph fragments that caused the violation. Typical examples of such information includes the starting point (root node), a path from the root, and specific values that caused the problem.
Motivation: UC3
The language should include a notion of profiles, so that certain applications with limited features can only use certain elements of the overall language.
There shall be a core language or SHACL profile that excludes any support for constraints defined via embedded SPARQL queries or other complex lower-level expressions. This is so that lightweight applications can validate constraints without requiring a SPARQL processor or similar subsystem.
The stated values for a property may be limited by minimum/maximum cardinality, with typical patterns being [0..1], [1..1], [0..*] and [1..*].
The values of a property may be limited to be an RDF Literal with a stated datatype, such as xsd:string or xsd:date.
The values of a property may be limited by their type, e.g., all children have to be of type person.
The values of a property on instances of a class may be limited by their RDF node type, e.g. IRI, BlankNode, Literal, or BlankNodeOrIRI (for completeness we may want to support all 7 combinations including Node as parent).
Motivation: UC8
Similar to xsd:minInclusive/maxExclusive
Pattern matching against regular expressions (xsd:pattern).
Constraining the length of a string.
Shapes will provide exhaustive enumerations of the valid values (literals and IRIs).
Shapes can have constraints where the tested node is the object of a triple.
Motivation: UC36
It should be possible to provide a default value for a given property, e.g. so that input forms can be pre-populated. This requirement is not about using default values as "inferred" triples at run-time.
Some constraints require building new strings out of other strings, and building new URIs out of other values.
Some constraints require mathematical calculations and comparisons, e.g. area = width * height.
Motivation: UC5
Some constraints require operators such as <, >=, != etc, either against constants or other values that are dynamically retrieved at query time. Includes date/time comparison and functions such as NOW().
The language should allow users to implement constraints that check complex conditions, with an expressivity as covered by the following sub-requirements (e.g. basic graph patterns, string and mathematical operations and comparison of multiple values).
Many constraints require that a certain pattern does not exist in the graph.
The language should make it possible to express the basic logical operators intersection, union and negation of conditions.
Some constraints need to be able to traverse a property transitively, such as parent-child or partOf relationships.
There shall be a concise construct for expressing that a list must be well-formed.
Motivation: UC42
There shall be a way of applying the constraints that we can express for normal properties (require a certain rdf:type, require a certain shape, require a certain datatype, require a certain node kind, etc.) to the members of rdf:Lists.
Motivation: UC42
It should be possible to specialize/extend shapes so that the constraints defined for a more general (super) shape also apply to the specialized (sub) shape. Sub-shapes can only narrow down, i.e. further constrain.
Motivation: UC2, UC5, UC10, UC11, UC19, UC20, UC24, UC25, UC27, UC28, and UC29
It should be possible to specify constraint conditions that need to be checked "globally" for a whole graph, without referring to a specific set of resources or class. In programming languages such global entities are often called "static", but "global" is probably better known.
Motivation: UC35
It should be possible to validate constraints on a single node in a graph. This may be impossible to implement 100% correctly, because sometimes a change to a resource invalidates conditions in a very different place in the graph. However, the language could propose a framework that identifies those constraints that SHOULD be checked when a given node is evaluated, e.g. by following its rdf:type
and the superclasses of that. This would include validating shacl:valueShape
but not shacl:valueType
.
Motivation: (Orthogonal to basically all use cases)
It should be possible to select all the RDF nodes in a graph for validation. This is similar to the Global Constraints (R9) requirement.
Motivation: UC35
It should be possible to have some mechanism to select the nodes that are instances of some class for validation.
Motivation: (Orthogonal to basically all stories)
It should be possible to select a single RDF node for validation.
Motivation: (Orthogonal to basically all stories)
There must be an "easy" way of associating a shape with a class, meaning that nodes in a graph that are instances of that class must conform with that shape
Motivation: UC3, UC10, UC11, UC12, UC13, UC15, UC19, UC20, UC29, and UC36
It should be possible to provide human-readable labels of a property in the context of a shape, intended for human consumption such as documentation or UI, not just globally for the rdf:Property. Multiple languages should be supported.
It should be possible to provide human-readable descriptions of the role of a property in the context of a shape, not just globally using triples that have the rdf:Property as subject. Multiple languages should be supported.
We would like to acknowledge the contributions of user story authors: Dean Allemang, Anamitra Bhattacharyya, Karen Coyle, Nick Crossley, Michel Dumontier, Jose Emilio Labra Gayo, Sandro Hawke, Dimitris Kontokostas, Holger Knublauch, David Martin, Dave McComb, Peter F. Patel-Schneider, Axel Polleres, Eric Prud'hommeaux, Arthur Ryman, Steve Speicher, and Simon Steyskal.