W3C W3C Member Submission

Shape Expressions 1.0 Primer

W3C Member Submission 2 June 2014

This version:
http://www.w3.org/submissions/2014/SUBM-shex-primer-20140602/
Latest published version:
http://www.w3.org/submissions/shex-primer/
Previous version:
Editor:
Eric Prud'hommeaux,

Abstract

Shape Expressions associate RDF graphs with labeled patterns called "shapes". Shapes can be used for validation, documentation and transformation of RDF data.

Shape Expressions Primer is a general introduction to the Shape Expressions language. The concepts in this document are linked to the normative definitions in Shape Expressions Definition.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document describes the Shape Expressions langauge developed as a community effort. It is being submitted to W3C so that it can inform the development of a future RDF Data Shape specification.

By publishing this document, W3C acknowledges that the Submitting Members have made a formal Submission request to W3C for discussion. Publication of this document by W3C indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. This document is not the product of a chartered W3C group, but is published as potential input to the W3C Process. A W3C Team Comment has been published in conjunction with this Member Submission. Publication of acknowledged Member Submissions at the W3C site is one of the benefits of W3C Membership. Please consult the requirements associated with Member Submissions of section 3.3 of the W3C Patent Policy. Please consult the complete list of acknowledged W3C Member Submissions.

Introduction

Most data expression languages have an associated constraints language. For instance, SQL has DDL, a language with expressions like CREATE TABLE "Issue" ("state" ENUM("unassigned", "assigned"), "reportedBy" (FOREIGN KEY "reportedBy" REFERENCES "User"("ID")...) Likewise, XML has W3C XML Schema and RelaxNG to define data structure. Shape Expressions is intended to perform the same function for RDF graphs.

Shape Expressions can be used to validate documents, communicate expected graph patterns for interfaces, and generate user interface forms and interface code. The syntax and semantics of Shape Expressions are designed to be familiar to users of regular expressions. The conspicuous differences are that regular expressions correlate an ordered pattern of atomic characters and logical operators against an ordered sequence of characters. Shape Expressions correlate an ordered pattern of pairs of predicate and object classes (called NameClass and ValueClass) and logical operators against an unordered set of arcs in a graph. The logical operators in Shape Expressions, grouping, conjunction, disjunction and cardinality constraints, are defined to behave as closely as possible to their counterparts in regular expressions and grammar languages like BNF.

The examples in this document can be used in an online demo. Links to the demo are indicated with a demoref class. Most of the document will focus on an annotated issue tracking example. An accompanying Examples document lists the pre-built examples and describes the demo user interface.

Namespace Prefixes

Most RDF languages have adopted the SPARQL conventions of BASE and PREFIX declarations. The behavior of these is detailed Turtle section 2.4 IRIs but briefly described here.

The PREFIX directive provides associates a short string with a long URI called a namespace. These are used when writing URIs with the form prefix:localName. The URI denoted by this is the concatonation of namespace associated with the prefix, and the part to the right of the first ':' ("localName", in the above example).

Our example uses BASE and PREFIX directives:

BASE <http://base.example/#>
PREFIX ex: <http://ex.example/#>
PREFIX foaf: <http://foaf.example/#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

Start Rule and Pointed Graph

Some grammar languages provide some starting point for validating documents or generating forms. In Shape Expressions, the starting point is specified by the start keyword.

start = <IssueShape>

This directive says to start with the <IssueShape>, which is really <http://base.example/#IssueShape> because of the BASE directive above.

It is not necessary to identify a particular node in the graph for validation operations. Nor is it necessary to provide a starting point for all operations. For instance, generating a sequence of forms obviously needs to start somewhere, but some documents can be validated by optimistically testing each shape expression against each node in the graph. This exhaustive search is more expensive and raises the possibility that a document validates in a way that the author of the document did not intend. This document treats the more constrained scenario with a starting point in both the graph and the schema.

Labeled Shape Expression (defn)

A shape expression is a labeled pattern for a set of RDF Triples with a common subject. Syntactically, it's a pairing of a label, which is an IRI or a blank node, and a rule inside a pair of "{" "}". Typically, this rule is a conjunction of constraints separated by ',':

<IssueShape> {
    ex:state (ex:unassigned ex:assigned),
    ex:reportedBy @<UserShape>,
    ex:reportedOn xsd:dateTime,
    ( ex:reproducedBy @<EmployeeShape>,
      ex:reproducedOn xsd:dateTime      )?,
    ex:related @<IssueShape>*
}

The rules in the above example have links to sections describing their interpretation.

example issue ex:state

The first constraint above, ex:state (ex:unassigned ex:assigned), specifies that the ex:state attribute must be one of the values ex:unassigned or ex:assigned. The first part, ex:state, is called the NameClass, and identifies a class of RDF predicates. The second part, (ex:unassigned ex:assigned), is called the ValueClass, and identifies a class of RDF objects. Together, they form an ArcRule. In this ArcRule, the NameClass is a pfIRI (defn) and the ValueClass is a ValueSet (defn).

When used for validation, these combine to say that for some node in a graph to conform to an <IssueShape>, it must have exactly one ex:state with a value of ex:unassigned or ex:assigned. When used for interface definition, this constraint could produce an input in a form with a selection for state of either "unassigned" or "assigned", e.g.

state:

unassigned
assigned

example issue ex:reportedBy

The second ArcRule (constraint) above, ex:reportedBy @<UserShape>, asserts that object of the ex:reportedBy property conforms to another labeled shape expression called <UserShape>. This is a ValueReference (defn) to a shape expression described below. As with ex:state above, the cardinality is exactly one.

example issue ex:reportedOn

The third ArcRule above, ex:reportedOn xsd:dateTime, asserts that object of the ex:reportedOn property is of type xsd:dateTime. ShEx supports the same set of W3C XML Schema datatypes as does SPARQL. Unlike SPARQL, ShEx validates the lexical representation of these datatypes, so the object of the ex:reportedOn property is tested against the XML Schema definition for dateTime.

example issue (rule1, rule2)?

The fourth constraint above, ( ex:reproducedBy @<EmployeeShape>, ex:reproducedOn xsd:dateTime )?, is a GroupRule (defn). The '?' says that the cardinality is 0 or 1. Together they assert that there may both an ex:reproducedBy and ex:reproducedOn, or neither. The enclosed rules are ArcRules and similar to the ex:reportedBy and ex:reportedOn ArcRules above. This <EmployeeShape> shape expression is described below.

The last constraint above, ex:related @<IssueShape>, is an example of a cyclic rule. The '*' says that there may be any number of related issues, including 0.


example <UserShape>

The following UserShape is referenced by the IssueShape above. Where the above labeled shape expression describes the data related to issues in some issue tracker, this captures the information about the users of that system.

<UserShape> {
    (foaf:name xsd:string
     | foaf:givenName xsd:string+,
       foaf:familyName xsd:string),
    foaf:mbox rdf:Resource
}

example issue (rule1, rule2)?

The first constraint above, foaf:name xsd:string | foaf:givenName xsd:string+, foaf:familyName xsd:string, is an XorRule. The rule on the right side of the '|' is another conjunction. Together they assert that a user of the system has either a foaf:name or at least one foaf:givenName (+ means a cardinality of one or more) and exactly one foaf:familyName.

RDF 1.1 established that the datatype of a plain literal is xsd:string so either [] foaf:name "Bob Smith" . or [] foaf:name "Bob Smith"^^xsd:string . would match the first disjoint.

The second disjoint is a conjoint of foaf:givenName xsd:string+, foaf:familyName xsd:string. The foaf:givenName property has a cardinality of one or more so the following graph would match: [] foaf:givenName "Robert", "Bob", "Bobby", "Robbie" ; foaf:familyName "Smith".

A disjunction of n disjoints requires that the data match exactly one of the disjoints. A graph like [] foaf:name "Bob Smith" ; foaf:givenName "Bob" ; foaf:familyName "Smith" would be invalid because two disjoints match at once. This implies that interfaces generated from an XorRule have a choice to, in this example, supply either a full name or a one or more given names and a family name.

example issue ex:mbox rdf:Resource

The last constraint above, ex:mbox rdf:Resource, uses a special type called rdf:Resource. Recall that ex:reportedOn and ex:reproducedOn specified that the object was a literal of type of an xsd:dateTime. A type of rdf:Resource means that the object is an IRI instead of a literal.


example <EmployeeShape>

The EmployeeShape below is referenced by the IssueShape above and included here for completeness. It does not introduce any new features of the language.

<EmployeeShape> {
    foaf:givenName xsd:string+,
    foaf:familyName xsd:string,
    foaf:phone rdf:Resource+,
    foaf:mbox rdf:Resource
}

Cardinality

The <IssueShape> example also includes a GroupRule with a cardinality of 0 or 1:

    ( ex:reproducedBy @<EmployeeShape>,
      ex:reproducedOn xsd:dateTime      )?

As explained above, this requires that the data have either neither or both of those triples.

Cardinality constraints may appear on ArcRules and GroupRules. They may be expressed as one of (?, +, *) or as one or two integers in {}s. If there is only one number in {}s, the minimum cardinality is that number and the maximum is unconstrained. An employee record which permitted from one to three given names would look like

    foaf:givenName xsd:string{1,3}

Inheritance and Inclusion

It is frequently useful to reuse or extend a shape. For instance, if both the <UserShape> and <EmployeeShape> permitted the same alternatives for specifying a name and email address, these could be factored into a separate shape called a <PersonShape>:

<PersonShape> {
    ( foaf:name xsd:string
      | foaf:givenName xsd:string+,
        foaf:familyName xsd:string
    ),
    foaf:mbox rdf:Resource
}

<UserShape> {
    & <PersonShape>
}

<EmployeeShape> {
    & <PersonShape>,
    foaf:phone rdf:Resource+
}

In this example, the <UserShape> provides no additional constraints beyond those of the included <PersonShape>.

We may have several derivatives of <PersonShape>, any of which could provide an <IssueShape>'s ex:reportedBy value. We can signify this by changing <IssueShape> to have ex:reportedBy @<PersonShape> and define sub-shapes of <PersonShape>. This is done with an inclusion directive before the shape definition. We may not want the base <PersonShape> to satisfy any ValueReferences directly, instead requiring only derivates of <PersonShape>. This is accomplished by labeling <PersonShape> VIRTUAL per the hierarchy example:

<IssueShape> {
    ex:state (ex:unassigned ex:assigned),
    ex:reportedBy @<PersonShape>
    # ...
}

VIRTUAL <PersonShape> {
    ( foaf:name xsd:string
      | foaf:givenName xsd:string+,
        foaf:familyName xsd:string
    ),
    foaf:mbox rdf:Resource
}

<UserShape> & <PersonShape> {
    # additional User properties
}

<EmployeeShape> & <PersonShape> {
    foaf:phone rdf:Resource+
    # additional Employee properties
}

Should there be a "greedy" directive to accept only the variant which touches the most triples? An alternative is to say that parent classes are "closed" in the sense that no other properties may appear on a subject matched by that shape.

Semantic Actions

The <IssueShape> example above includes both ex:reportedOn and ex:reproducedOn dateTimes. It would be reasonable in the interest of data quality to ensure that the ex:reproducedOn dateTime, if present, were temporally after the ex:reportedOn dateTime. While ShEx itself has no built-in functionality for comparing dateTimes, specific extensions may offer that functionality. The example below (failed semantic action validation) includes semantic actions to test date order in either Javascript or SPARQL:

    ex:reportedOn xsd:dateTime
        %js{ report = _.o; return true; %},
    (ex:reproducedBy @<EmployeeShape>,
     ex:reproducedOn xsd:dateTime
        %js{ return _.o.lex > report.lex; %}
        %sparql{ ?s ex:reportedOn ?rpt . FILTER (?o > ?rpt) %}
    )

Semantic actions may also be used to generate schema-specific parsers or tools. Below is an excerpt of a tool that uses a DOM tree to translate the the Issue example into an XML document:

<IssueShape> {
    ex:state (ex:unassigned ex:assigned)
        %js{ doc = _.createDocument('http://ex.example/xml', 'Issue', undefined);
             issue = doc.documentElement
             issue.setAttribute('id', _.s.lex.substr(17));
             state = doc.createElementNS('http://ex.example/xml', 'state');
             state.textContent = _.o.lex.substr(17);
             issue.appendChild(state); %},
    ex:reportedBy @<UserShape>,
    …
} %js{ console.log(new XMLSerializer().serializeToString(doc)); %}

This example relies on a particular invocation for semantic actions, but illustrates the power in the extensibility mechanism.

Language Summary

The IssueShape tutorial above is oriented towards a particular use case where the schema will use a very explicit set of predicates and accept no others. Shape Expressions is also useful for controlling describing interfaces or graph patterns where any predicates are allowed except those in controlled namespaces. For example, some systems like Annotea reserved the assertion of dc:creator arcs for the system to maintain provenance information. The language summary below includes language features to describe such an interface.

featureexampledescription
Matching a Predicate to a NameClass
pfIRIex:stateThe predicate of any matching triple is the same as the pfIRI IRI.
pfIRIex:~The predicate of any matching triple starts with the IRI.
pfWild. - rdf:type - ex:~A matching triple has any predicate except those terms pfIRIs or pfIRIs excluded by the '-' operator.
Matching an Object to a ValueClass
ValueTypexsd:dateTimeThe object of any matching triple is the same as the ValueType IRI.
ValueSet(ex:unassigned ex:assigned)The object of any matching triple is one of the list of triples in the ValueSet.
ValueStemex:~The object of any matching triple starts with the IRI.
ValueWild. - rdf:type - ex:~A matching triple has any object except those terms or stems excluded by the '-' operator.
ValueReference@<UserShape>The object of a matching triple is an IRI or blank node and the that node is the subject of triples matching the referenced shape expression.
Rule Types
ArcRulefoaf:givenName xsd:string+A matching triple matches the pfIRI and the ValueTerm. Cardinality constraints apply.
AndRulefoaf:givenName xsd:string,
foaf:familyName xsd:string
Each conjoint matches the input graph.
XorRulefoaf:givenName xsd:string
| foaf:name xsd:string
Exactly one disjoint matches the input graph.
GroupRule(x:reproducedBy @<EmployeeShape>,
 ex:reproducedOn xsd:dateTime)
A matching triple matches the enclosed rule (here an AndRule). Cardinality constraints apply.
Cardinality
?foaf:givenName xsd:string?rule must match 0 or 1 times.
+foaf:givenName xsd:string+rule must match 1 or more times.
*foaf:givenName xsd:string*rule must match 0 or more times.
{m}foaf:givenName xsd:string{3}rule must match m times.
{m,n}foaf:givenName xsd:string{3,5}rule must match at least m times and no more than n times.
Cardinality constraints may appear after an ArcRule. A '?' may also appear after a GroupRule to indicate that it is optional. Any AndRule nested immediately inside the GroupRule must have every rule match or no rule match.
Rule Inclusions
&RuleName& <PersonShape>Include the referenced rule in place of the include directive.
Rule Inclusions may appear before a shape definition inside of a definition. Before a shape definition, they signify the inclusion of the referenced rule ("included rule") at the beginning of the one being defined, as well as asserting that ValueReferences to the included rule accept the defined shape as well.
Semantic Actions
%lang{ code %}%js{ return _.o.lex > report.lex; %}
%sparql{ ?s ex:reportedOn ?rpt . FILTER (?o > ?rpt) %}
Invoke semantic actions when a rule is satisfied.
Semantic Actions may appear after an ArcRule, a Group Rule or a named Shape Expression. When used with validation, they are invoked only a valid pairs of a triple and a rule. Their use for interface validation is currently undefined.