This document describes ShEx 1.0. For the product of the ShEx Community Group please see the ShEx 2.0 primer.
Shape Expressions Primer is a general introduction to the Shape Expressions language. The concepts in this document are linked to the normative definitions in Shape Expressions Definition.
This document was developed in response to the W3C RDF Validation Workshop. The contributions are not associated with any W3C Working Group or any Recommendations-track work.
Most data expression languages have an associated constraints language.
For instance, SQL has DDL, a language with expressions like CREATE TABLE "Issue" ("state" ENUM("unassigned", "assigned"), "reportedBy" (FOREIGN KEY "reportedBy" REFERENCES "User"("ID")...)
Likewise, XML has W3C XML Schema and RelaxNG to define data structure.
Shape Expressions is intended to perform the same function for RDF graphs.
Shape Expressions can be used to validate documents, communicate expected graph patterns for interfaces, and generate user interface forms and interface code.
The syntax and semantics of Shape Expressions are designed to be familiar to users of regular expressions.
The conspicuous differences are that regular expressions correlate and ordered pattern of atomic characters and logical operators against an ordered sequence of characters.
Shape Expressions correlate an ordered pattern of pairs of predicate and object classes (called NameClass
and ValueClass
) and logical operators against an unordered set of arcs in a graph.
The logical operators in Shape Expressions, grouping, conjunction, disjunction and cardinality constraints, are defined to make as closely as possible to their counterparts in regular expressions and grammar languages like BNF.
The examples in this document can be used in an online demo. Links are indicated with a demoref class. Most of the document will focus on an annotated issue tracking example. An accompanying Examples document lists the pre-built examples and describes the demo user interface.
Most RDF languages have adopted the SPARQL conventions of BASE
and PREFIX
declarations.
The behavior of these is detailed Turtle section 2.4 IRIs but briefly described here.
The PREFIX
directive provides associates a short string with a long URI called a namespace.
These are used when writing URIs with the form prefix:localName
.
The URI denoted by this is the concatonation of namespace associated with the prefix, and the part to the right of the first ':' ("localName", in the above example).
Our example uses BASE
and PREFIX
directives:
BASE <http://base.example/#> PREFIX ex: <http://ex.example/#> PREFIX foaf: <http://foaf.example/#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
Some grammar languages provide some starting point for validating documents or generating forms.
In Shape Expressions, the starting point is specified by the start
keyword.
start = <IssueShape>
This directive says to start with the <IssueShape>
, which is really <http://base.example/#IssueShape>
because of the BASE
directive above.
It is not necessary to identify a particular node in the graph for validation operations.
Nor is it necessary to provide a start
ing point for all opperations.
For instance, generating a sequence of forms obviously needs to start somewhere, but some documents can be validated by optimistially testing each shape expression against each node in the graph.
This exhaustive search is more expensive and raises the possibility that a document validates in a way that the author of the document did not intend.
This document treats the more constrained scenario with a starting point in both the graph and the schema.
A shape expression is a labeled pattern for a set of RDF Triples with a common subject. Syntactically, it's a pairing of a label, which is an IRI or a blank node, and a rule inside a pair of "{" "}". Typically, this rule is a conjunction of constraints separated by ',':
<IssueShape> { ex:state (ex:unassigned ex:assigned), ex:reportedBy @<UserShape>, ex:reportedOn xsd:dateTime, ( ex:reproducedBy @<EmployeeShape>, ex:reproducedOn xsd:dateTime )?, ex:related @<IssueShape>* }
(Links in the above example detail the interpretation of the above constraints.)
ex:state
The first constraint above, ex:state (ex:unassigned ex:assigned)
, specifies that the ex:state
attribute must be one of the values ex:unassigned
or ex:assigned
.
The first part, ex:state, is called the NameClass
, and identifies a class of RDF predicates.
The second part, (ex:unassigned ex:assigned), is called the ValueClass
, and identifies a class of RDF objects.
Together, they form an ArcRule
.
When used for validation, these combine to say that for some node in a graph to conform to an <IssueShape>
, it must have exactly one ex:state
with a value of ex:unassigned
or ex:assigned
.
When used for interface definition, this constraint could produce an input in a form with a selection for state of either "unassigned" or "assigned", e.g.
ex:reportedBy
The second ArcRule
(constraint) above, ex:reportedBy @<UserShape>
, asserts that object of the ex:reportedBy
property conforms to another labeled shape expression called <UserShape>
.
This is a ValueReference to a shape expression described below.
As with ex:state
above, the cardinality is exactly one.
ex:reportedOn
The third ArcRule
above, ex:reportedOn xsd:dateTime
, asserts that object of the ex:reportedOn
property is of type xsd:dateTime
.
ShEx supports the same set of W3C XML Schema datatypes as does SPARQL.
Unlike SPARQL, ShEx validates the lexical representation of these datatypes, so the object of the ex:reportedOn
property is tested against the XML Schema definition for dateTime.
(rule1, rule2)?
The fourth constraint above, ( ex:reproducedBy @<EmployeeShape>, ex:reproducedOn xsd:dateTime )?
, is a GroupRule
.
The '?' says that the cardinality is 0 or 1.
Together they assert that there may both an ex:reproducedBy
and ex:reproducedOn
, or neither.
The enclosed rules are ArcRule
s and similar to the ex:reportedBy
and ex:reportedOn
ArcRule
s above.
This <EmployeeShape>
shape expression is described below.
ex:related @<IssueShape>
The last constraint above, ex:related @<IssueShape>
, is an example of a cyclic rule.
The '*' says that there may be any number of related issues, including 0.
<UserShape>
The following UserShape is referenced by the IssueShape above. Where the above labeled shape expression describes the data related to issues in some issue tracker, this captures the information about the users of that system.
<UserShape> { (foaf:name xsd:string | foaf:givenName xsd:string+, foaf:familyName xsd:string), foaf:mbox IRI }
(rule1, rule2)?
The first constraint above, foaf:name xsd:string | foaf:givenName xsd:string+, foaf:familyName xsd:string
, is an OrRule
.
The rule on the right side of the '|'
is another conjunction.
Together they assert that a user of the system has either a foaf:name
or at least one foaf:givenName
(+
means a cardinality of one or more) and exactly one foaf:familyName
.
RDF 1.1 established that the datatype of a plain literal is xsd:string
so either [] foaf:name "Bob Smith" .
or [] foaf:name "Bob Smith"^^xsd:string .
would match the first disjoint.
The second disjoint is a conjoint of foaf:givenName xsd:string+, foaf:familyName xsd:string
.
The foaf:givenName
property has a cardinality of one or more so the following graph would match: [] foaf:givenName "Robert", "Bob", "Bobby", "Robbie" ; foaf:familyName "Smith"
.
A disjunction of n disjoints requires that the data match exactly one of the disjoints.
A graph like [] foaf:name "Bob Smith" ; foaf:givenName "Bob" ; foaf:familyName "Smith"
would be invalid because two disjoints match at once.
This implies that interfaces generated from an OrRule
have a choice to, in this example, supply either a full name or a one or more given names and a family name.
ex:mbox IRI
The last constraint above, ex:mbox IRI
, uses a special type called IRI
.
Recall that ex:reportedOn
and ex:reproducedOn
specified that the object was a literal of type of an xsd:dateTime
.
A type of IRI
means that the object is an IRI instead of a literal.
<EmployeeShape>
The EmployeeShape below is referenced by the IssueShape above and included here for completeness. It does not introduce any new features of the language.
<EmployeeShape> { foaf:givenName xsd:string+, foaf:familyName xsd:string, foaf:phone IRI+, foaf:mbox IRI }
The <IssueShape> example also includes a GroupRule with a cardinality of 0 or 1:
( ex:reproducedBy @<EmployeeShape>, ex:reproducedOn xsd:dateTime )?
As explained above, this requires that the data have either neither or both of those triples.
Cardinality constraints may appear on ArcRules and GroupRules.
They may be expressed as one of (?
, +
, *
) or as one or two integers in {}
s.
If there is only one number in {}
s, the minimum cardinality is that number and the maximum is unconstrained.
An employee record which permitted from one to three given names would look like
foaf:givenName xsd:string{1,3}
It is frequently useful to reuse or extend a shape.
For instance, if both the <UserShape> and <EmployeeShape> permitted the same alternatives for specifying a name and email address, these could be factored into a separate shape called a <PersonShape>
:
<PersonShape> { ( foaf:name xsd:string | foaf:givenName xsd:string+, foaf:familyName xsd:string ), foaf:mbox IRI } <UserShape> { & <PersonShape> } <EmployeeShape> { & <PersonShape>, foaf:phone IRI+ }
In this example, the <UserShape>
provides no additional constraints beyond those of the included <PersonShape>
.
We may have several derivatives of <PersonShape>
, any of which could provide an <IssueShape>
's ex:reportedBy
value.
We can signify this by changing <IssueShape>
to have ex:reportedBy @<PersonShape>
and define sub-shapes of <PersonShape>
.
This is done with an inclusion directive before the shape definition.
We may not want the base <PersonShape>
to satisfy any ValueReferences directly, instead requiring only derivates of <PersonShape>
.
This is accomplished by labeling <PersonShape>
VIRTUAL
per the hierarchy example:
<IssueShape> { ex:state (ex:unassigned ex:assigned), ex:reportedBy @<PersonShape> # ... } VIRTUAL <PersonShape> { ( foaf:name xsd:string | foaf:givenName xsd:string+, foaf:familyName xsd:string ), foaf:mbox IRI } <UserShape> & <PersonShape> { # additional User properties } <EmployeeShape> & <PersonShape> { foaf:phone IRI+ # additional Employee properties }
Should there be a "greedy" directive to accept only the variant which touches the most triples?
The <IssueShape> example above includes both ex:reportedOn
and ex:reproducedOn
dateTimes.
It would be reasonable in the interest of data quality to ensure that the ex:reproducedOn
dateTime, if present, were temporally after the ex:reportedOn
dateTime.
While ShEx itself has no built-in functionality for comparing dateTimes, specific extensions may offer that functionality.
The example below (failed semantic action validation) includes semantic actions to test date order in either Javascript or SPARQL:
ex:reportedOn xsd:dateTime %js{ report = _.o; return true; %}, (ex:reproducedBy @<EmployeeShape>, ex:reproducedOn xsd:dateTime %js{ return _.o.lex > report.lex; %} %sparql{ ?s ex:reportedOn ?rpt . FILTER (?o > ?rpt) %} )
Semantic actions may also be used to generate schema-specific parsers or tools. Below is an excerpt of a tool which uses a DOM tree to translate the the Issue example into an XML document:
<IssueShape> { ex:state (ex:unassigned ex:assigned) %js{ doc = _.createDocument('http://ex.example/xml', 'Issue', undefined); issue = doc.documentElement issue.setAttribute('id', _.s.lex.substr(17)); state = doc.createElementNS('http://ex.example/xml', 'state'); state.textContent = _.o.lex.substr(17); issue.appendChild(state); %}, ex:reportedBy @<UserShape>, … } %js{ console.log(new XMLSerializer().serializeToString(doc)); %}
This example relies on a particular invocation for semantic actions, but illustrates the power in the extensibility mechanism.
The IssueShape tutorial above is oriented towards a particular use case where the schema will use a very explicit set of predicates and accept no others. Shape Expressions is also useful for controlling describing interfaces or graph patterns where any predicates are allowed except those in controlled namespaces. For example, some systems like Annotea reserved the assertion of dc:creator arcs for the system to maintain provenance information. The language summary below includes language features to describe such an interface.
feature | example | description |
---|---|---|
Matching a Predicate to a NameClass | ||
NameTerm | ex:state | The predicate of any matching triple is the same as the NameTerm IRI. |
NameStem | ex:~ | The predicate of any matching triple starts with the IRI. |
NameAny | . - rdf:type - ex:~ | A matching triple has any predicate except those terms NameTerms or NameStems excluded by the '-' operator. |
Matching an Object to a ValueClass | ||
ValueType | xsd:dateTime | The object of any matching triple is the same as the ValueType IRI. |
ValueSet | (ex:unassigned ex:assigned) | The object of any matching triple is one of the list of triples in teh ValueSet. |
ValueStem | ex:~ | The object of any matching triple starts with the IRI. |
ValueAny | . - rdf:type - ex:~ | A matching triple has any object except those terms or stems excluded by the '-' operator. |
ValueReference | @<UserShape> | The object of a matching triple is an IRI or blank node and the that node is the subject of triples matching the referenced shape expression. |
Rule Types | ||
ArcRule | foaf:givenName xsd:string+ | A matching triple matches the NameTerm and the ValueTerm. Cardinality constraints apply. |
AndRule | foaf:givenName xsd:string, foaf:familyName xsd:string | Each conjoint matches the input graph. |
OrRule | foaf:givenName xsd:string | foaf:name xsd:string | Exactly one disjoint matches the input graph. |
GroupRule | (x:reproducedBy @<EmployeeShape>, ex:reproducedOn xsd:dateTime) | A matching triple matches the enclosed rule (here an AndRule). Cardinality constraints apply. |
Cardinality | ||
? | foaf:givenName xsd:string? | rule must match 0 or 1 times. |
+ | foaf:givenName xsd:string+ | rule must match 1 or more times. |
* | foaf:givenName xsd:string* | rule must match 0 or more times. |
{m} | foaf:givenName xsd:string{3} | rule must match m times. |
{m,n} | foaf:givenName xsd:string{3,5} | rule must match at least m times and no more than n times. |
Cardinality constraints may appear after an ArcRule. A '?' may also appear after a GroupRule to indicate that it is optional. Any AndRule nested immediately inside the GroupRule must have every rule match or no rule match. | ||
Rule Inclusions | ||
&RuleName | & <PersonShape> | Include the referenced rule in place of the include directive. |
Rule Inclusions may appear before a shape definition inside of a definition. Befor a shape definition, they signify the inclusion of the referenced rule ("included rule") at the beginning of the one being defined, as well as asserting that ValueReferences to the included rule accept the defined shape as well. | ||
Semantic Actions | ||
%lang{ code %} | %js{ return _.o.lex > report.lex; %} %sparql{ ?s ex:reportedOn ?rpt . FILTER (?o > ?rpt) %} | Invoke semantic actions when a rule is satisfied. |
Semantic Actions may appear after an ArcRule, a Group Rule or a named Shape Expression. When used with validation, they are invoked only a valid pairs of a triple and a rule. Their use for interface validation is currently undefined. |