Shape Expressions 1.0 Primer

Shape Expressions Primer is a general introduction to the Shape Expressions language. The concepts in this document are linked to the normative definitions in Shape Expressions Definition.

Status of this Document

This document was developed in response to the W3C RDF Validation Workshop. The contributions are not associated with any W3C Working Group or any Recommendations-track work.

Introduction

Most data expression languages have an associated constraints language. For instance, SQL has DDL, a language with expressions like

CREATE TABLE "Issue" ("state" ENUM("unassigned", "assigned"), "reportedBy" (FOREIGN KEY "reportedBy" REFERENCES "User"("ID")...)

Likewise, XML has W3C XML Schema and RelaxNG to define data structure. Shape Expressions is intended to perform the same function for RDF graphs.

Shape Expressions can be used to validate documents, communicate expected graph patterns for interfaces, and generate user interface forms and interface code. The syntax and semantics of Shape Expressions are designed to be familiar to users of regular expressions. The conspicuous differences are that regular expressions correlate and ordered pattern of atomic characters and logical operators against an ordered sequence of characters. Shape Expressions correlate an ordered pattern of pairs of predicate and object classes (called NameClass and ValueClass) and logical operators against an unordered set of arcs in a graph. The logical operators in Shape Expressions, grouping, conjunction, disjunction and cardinality constraints, are defined to make as closely as possible to their counterparts in regular expressions and grammar languages like BNF.

Namespace Prefixes

Most RDF languages have adopted the SPARQL conventions of BASE and PREFIX declarations. The behavior of these is detailed Turtle section 2.4 IRIs but briefly described here.

The PREFIX directive provides associates a short string with a long URI called a namespace. These are used when writing URIs with the form prefix:localName. The URI denoted by this is the concatonation of namespace associated with the prefix, and the part to the right of the first ':' ("localName", in the above example).

Start Rule and Pointed Graph

Some grammar languages provide some starting point for validating documents or generating forms. In Shape Expressions, the starting point is specified by the start keyword.

This directive says to start with the <IssueShape>, which is really <http://base.example/#IssueShape> because of the BASE directive above.

It is not necessary to identify a particular node in the graph for validation operations. Nor is it necessary to provide a starting point for all opperations. For instance, generating a sequence of forms obviously needs to start somewhere, but some documents can be validated by optimistially testing each shape expression against each node in the graph. This exhaustive search is more expensive and raises the possibility that a document validates in a way that the author of the document did not intend. This document treats the more constrained scenario with a starting point in both the graph and the schema.

Labeled Shape Expression

A shape expression is a labeled pattern for a set of RDF Triples with a common subject. Syntactically, it's a pairing of a label, which is an IRI or a blank node, and a rule inside a pair of "{" "}". Typically, this rule is a conjunction of constraints separated by ',':

<IssueShape> {
    ex:state (ex:unassigned ex:assigned),
    ex:reportedBy @<UserShape>,
    ex:reportedOn xsd:dateTime,
    ( ex:reproducedBy @<EmployeeShape>,
      ex:reproducedOn xsd:dateTime      )?,
    ex:related @<IssueShape>*
}

(Links in the above example detail the interpretation of the above constraints.)

example issue `ex:state`

The first constraint above, ex:state (ex:unassigned ex:assigned), specifies that the ex:state attribute must be one of the values ex:unassigned or ex:assigned. The first part, ex:state, is called the NameClass, and identifies a class of RDF predicates. The second part, (ex:unassigned ex:assigned), is called the ValueClass, and identifies a class of RDF objects. Together, they form an ArcRule.

When used for validation, these combine to say that for some node in a graph to conform to an <IssueShape>, it must have exactly one ex:state with a value of ex:unassigned or ex:assigned. When used for interface definition, this constraint could produce an input in a form with a selection for state of either "unassigned" or "assigned", e.g.

example issue `ex:reportedBy`

The second ArcRule (constraint) above, ex:reportedBy @<UserShape>, asserts that object of the ex:reportedBy property conforms to another labeled shape expression called <UserShape>. This is a ValueReference to a shape expression described below. As with ex:state above, the cardinality is exactly one.

example issue `ex:reportedOn`

The third ArcRule above, ex:reportedOn xsd:dateTime, asserts that object of the ex:reportedOn property is of type xsd:dateTime. ShEx supports the same set of W3C XML Schema datatypes as does SPARQL. Unlike SPARQL, ShEx validates the lexical representation of these datatypes, so the object of the ex:reportedOn property is tested against the XML Schema definition for dateTime.

example issue `(rule1, rule2)?`

The fourth constraint above, ( ex:reproducedBy @<EmployeeShape>, ex:reproducedOn xsd:dateTime )?, is a GroupRule. The '?' says that the cardinality is 0 or 1. Together they assert that there may both an ex:reproducedBy and ex:reproducedOn, or neither. The enclosed rules are ArcRules and similar to the ex:reportedBy and ex:reportedOn ArcRules above. This <EmployeeShape> shape expression is described below.

example issue `ex:related @<IssueShape>`

The last constraint above, ex:related @<IssueShape>, is an example of a cyclic rule. The '*' says that there may be any number of related issues, including 0.

example `<UserShape>`

The following UserShape is referenced by the IssueShape above. Where the above labeled shape expression describes the data related to issues in some issue tracker, this captures the information about the users of that system.

<UserShape> {
    (foaf:name xsd:string
     | foaf:givenName xsd:string+,
       foaf:familyName xsd:string),
    foaf:mbox IRI
}

example issue `(rule1, rule2)?`

The first constraint above, foaf:name xsd:string | foaf:givenName xsd:string+, foaf:familyName xsd:string, is an OrRule. The rule on the right side of the '|' is another conjunction. Together they assert that a user of the system has either a foaf:name or at least one foaf:givenName (+ means a cardinality of one or more) and exactly one foaf:familyName.

RDF 1.1 established that the datatype of a plain literal is xsd:string so either [] foaf:name "Bob Smith" . or [] foaf:name "Bob Smith"^^xsd:string . would match the first disjoint.

The second disjoint is a conjoint of foaf:givenName xsd:string+, foaf:familyName xsd:string. The foaf:givenName property has a cardinality of one or more so the following graph would match: [] foaf:givenName "Robert", "Bob", "Bobby", "Robbie" ; foaf:familyName "Smith".

A disjunction of n disjoints requires that the data match exactly one of the disjoints. A graph like [] foaf:name "Bob Smith" ; foaf:givenName "Bob" ; foaf:familyName "Smith" would be invalid because two disjoints match at once. This implies that interfaces generated from an OrRule have a choice to, in this example, supply either a full name or a one or more given names and a family name.

example issue `ex:mbox IRI`

The last constraint above, ex:mbox IRI, uses a special type called IRI. Recall that ex:reportedOn and ex:reproducedOn specified that the object was a literal of type of an xsd:dateTime. A type of IRI means that the object is an IRI instead of a literal.

example `<EmployeeShape>`

The EmployeeShape below is referenced by the IssueShape above and included here for completeness. It does not introduce any new features of the language.

<EmployeeShape> {
    foaf:givenName xsd:string+,
    foaf:familyName xsd:string,
    foaf:phone IRI+,
    foaf:mbox IRI
}

Inheritance and Inclusion

It is frequently useful to reuse or extend a shape. For instance, if both the <UserShape> and <EmployeeShape> permitted the same alternatives for specifying a name and email address, these could be factored into a separate shape called a <PersonShape>:

<PersonShape> {
    ( foaf:name xsd:string
      | foaf:givenName xsd:string+,
        foaf:familyName xsd:string
    ),
    foaf:mbox IRI
}

<UserShape> {
    & <PersonShape>
}

<EmployeeShape> {
    & <PersonShape>,
    foaf:phone IRI+
}

In this example, the <UserShape> provides no additional constraints beyond those of the included <PersonShape>.

We may have several derivatives of <PersonShape>, any of which could provide an <IssueShape>'s ex:reportedBy value. We can signify this by changing <IssueShape> to have ex:reportedBy @<PersonShape> and define sub-shapes of <PersonShape>. This is done with an inclusion directive before the shape definition. We may not want the base <PersonShape> to satisfy any ValueReferences directly, instead requiring only derivates of <PersonShape>. This is accomplished by labeling <PersonShape> VIRTUAL per the hierarchy example:

<IssueShape> {
    ex:state (ex:unassigned ex:assigned),
    ex:reportedBy @<PersonShape>
    # ...
}

VIRTUAL <PersonShape> {
    ( foaf:name xsd:string
      | foaf:givenName xsd:string+,
        foaf:familyName xsd:string
    ),
    foaf:mbox IRI
}

<UserShape> & <PersonShape> {
    # additional User properties
}

<EmployeeShape> & <PersonShape> {
    foaf:phone IRI+
    # additional Employee properties
}

Should there be a "greedy" directive to accept only the variant which touches the most triples?

Language Summary

The IssueShape tutorial above is oriented towards a particular use case where the schema will use a very explicit set of predicates and accept no others. Shape Expressions is also useful for controlling describing interfaces or graph patterns where any predicates are allowed except those in controlled namespaces. For example, some systems like Annotea reserved the assertion of dc:creator arcs for the system to maintain provenance information. The language summary below includes language features to describe such an interface.

feature	example	description
		Matching a Predicate to a NameClass
NameTerm	ex:state	The predicate of any matching triple is the same as the NameTerm IRI.
NameStem	ex:~	The predicate of any matching triple starts with the IRI.
NameAny	. - rdf:type - ex:~	A matching triple has any predicate except those terms NameTerms or NameStems excluded by the '-' operator.
		Matching an Object to a ValueClass
ValueType	xsd:dateTime	The object of any matching triple is the same as the ValueType IRI.
ValueSet	(ex:unassigned ex:assigned)	The object of any matching triple is one of the list of triples in teh ValueSet.
ValueStem	ex:~	The object of any matching triple starts with the IRI.
ValueAny	. - rdf:type - ex:~	A matching triple has any object except those terms or stems excluded by the '-' operator.
ValueReference	@<UserShape>	The object of a matching triple is an IRI or blank node and the that node is the subject of triples matching the referenced shape expression.
		Rule Types
ArcRule	foaf:givenName xsd:string+	A matching triple matches the NameTerm and the ValueTerm. Cardinality constraints apply.
AndRule	foaf:givenName xsd:string, foaf:familyName xsd:string	Each conjoint matches the input graph.
OrRule	foaf:givenName xsd:string \| foaf:name xsd:string	Exactly one disjoint matches the input graph.
GroupRule	(x:reproducedBy @<EmployeeShape>, ex:reproducedOn xsd:dateTime)	A matching triple matches the enclosed rule (here an AndRule). Cardinality constraints apply.
		Cardinality
?	foaf:givenName xsd:string?	rule must match 0 or 1 times.
+	foaf:givenName xsd:string+	rule must match 1 or more times.
*	foaf:givenName xsd:string*	rule must match 0 or more times.
{m}	foaf:givenName xsd:string{3}	rule must match m times.
{m,n}	foaf:givenName xsd:string{3,5}	rule must match at least m times and no more than n times.
	Cardinality constraints may appear after an ArcRule. A '?' may also appear after a GroupRule to indicate that it is optional. Any AndRule nested immediately inside the GroupRule must have every rule match or no rule match.
		Rule Inclusions
&`RuleName`	& <PersonShape>	Include the referenced rule in place of the include directive.
	Rule Inclusions may appear before a shape definition inside of a definition. Befor a shape definition, they signify the inclusion of the referenced rule ("included rule") at the beginning of the one being defined, as well as asserting that ValueReferences to the included rule accept the defined shape as well.
		Semantic Actions
%lang{ code %}	%js{ return _.o.lex > report.lex; %} %sparql{ ?s ex:reportedOn ?rpt . FILTER (?o > ?rpt) %}	Invoke semantic actions when a rule is satisfied.
	Semantic Actions may appear after an ArcRule, a Group Rule or a named Shape Expression. When used with validation, they are invoked only a valid pairs of a triple and a rule. Their use for interface validation is currently undefined.

Shape Expressions 1.0 Primer

Status of this Document

Introduction

Namespace Prefixes

Start Rule and Pointed Graph

Labeled Shape Expression

example issue ex:state

example issue ex:reportedBy

example issue ex:reportedOn

example issue (rule1, rule2)?

example issue ex:related @<IssueShape>

example <UserShape>

example issue (rule1, rule2)?

example issue ex:mbox IRI

example <EmployeeShape>

Cardinality

Inheritance and Inclusion

Semantic Actions

Language Summary

example issue `ex:state`

example issue `ex:reportedBy`

example issue `ex:reportedOn`

example issue `(rule1, rule2)?`

example issue `ex:related @<IssueShape>`

example `<UserShape>`

example issue `(rule1, rule2)?`

example issue `ex:mbox IRI`

example `<EmployeeShape>`