On RDF Validation, Stardog ICV, and Assorted Remarks

Kendall Clark & Evren Sirin, Clark & Parsia LLC

30 June 2013

We begin with a brief overview of the existing RDF validation landscape. Then we discuss Stardog's Integrity Constraint Validation in more detail. We conclude with some general considerations about what an RDF validation spec should and shouldn't do.

Existing Systems: SPIN, IBM Resource Shapes, Stardog ICV

There are three primary systems to consider: SPIN, IBM's Resource Shapes, and Stardog ICV. Each of these systems can be used to validate RDF (i.e., Linked Data). SPIN works with TopBraid's toolchain; our ICV works with Stardog, our RDF database, and with any other system that can evaluate SPARQL queries; and IBM's Resource Shapes works (or will work) with parts of IBM's Rational suite of OSLC tools.

The differences between these three RDF validation tools are largely superficial, i.e., a matter of syntax. The most obvious difference, from a user's point of view, is surface syntax; that is, the syntax that is used to capture the constraints.

  1. IBM Resource Shapes is a grammar approach: users write RDF triples using the Resource Shapes grammar or vocabulary to define constraints against RDF data to be executed by systems that support Resource Shapes.
  2. SPIN uses SPARQL for its syntax (plus some tool support). Users write SPARQL queries to define constraints against RDF data to be executed by systems that support SPIN.
  3. Stardog ICV is a polyglot approach. Users write SPARQL queries, or OWL axioms, or SWRL rules---or a mix of all three---to define constraints against RDF data to be executed by Stardog. Stardog translates SPARQL, OWL, and SWRL into equivalent SPARQL queries to be executed by any system that can evaluate SPARQL queries.

There are other, older systems that do RDF validation but they are (at this point) of primarily historical significance.

Stardog ICV

Stardog can provide ICV services even for other RDF databases that don't support ICV natively by converting user's constraints into SPARQL queries (normal, ordinary SPARQL queries) that other RDF databases may evaluate.

Why High-level Languages?

Using high-level languages to represent RDF validation constraints is largely about concision and abstraction. Constraint languages should be syntactically expressive and (to the greatest degree feasible, technically) independent of or abstracted from graph details. Consider some examples from the Stardog ICV documentation which use Manchester OWL syntax to great effect:

Class: Employee
    SubClassOf: works_on some Project or 
                supervises some (Employee and works_on some Project) or
                manages some Department

Translating this constraint into natural language:

Each employee either works on at least one project, supervises at least one employee that works 
on at least one project, or manages at least one department.

Because of the complexity of the RDF triples-level representation, a low-level syntax is necessarily messy with respect to those triples, i.e., fails to shield users from that messiness. Consider,

Consider some further examples:

The manager of a department must work for that department.

In a high-level syntax like OWL's Manchester syntax, this becomes:

manages subPropertyOf worksFor

It's hard to imagine anything simpler. But in all fairness the equivalent SPARQL isn't difficult, particularly for people who already know SPARQL, which we anticipate to be the primary users of an RDF validation technology.

SELECT * WHERE {
   ?x manages ?y .
   FILTER NOT EXISTS {
      ?x worksFor ?y.
   }
}

The OWL version is obviously much shorter than the SPARQL version; we expect an equivalent SPARQL encoding to typically be more verbose, but is still easy to read and understand.

Let's look at complex example.

If a project is funded by only internal funding sources then it should be approved by 
the internal budget office.

In OWL as interpreted by Stardog ICV that becomes

Project and fundedBy only InternalFundingSource subClassOf (approvedBy value InternalBudgetOffice)

And the same constraint in SPARQL:

SELECT * WHERE {
   ?x a :Project .
   FILTER NOT EXISTS {
      ?x :fundedBy ?y .
      FILTER NOT EXISTS {
         ?y a :InternalFundingSource .
      }
   }
   FILTER NOT EXISTS {
      ?x0 :approvedBy :InternalBudgetOffice .
   }
}

Now the SPARQL version is harder to understand as we have nested negations whereas the OWL version is very close to the natural language rendering. Admittedly some of the terms here are a bit artificial, but not in a way that makes much difference in the different encodings.

Reasoning

We think RDF validation conceptually makes the most sense as if it were being applied orthogonally to either an explicit RDF graph or to an RDF graph under the semantics of a SPARQL 1.1 entailment regime. That's how Stardog ICV works: an explicit triple or triples may violate (or satisfy) one or more constraints; likewise, an inferred (that is, implicit) triple or triples may violate (or satisfy) one or more constraints.

But, note, too that this issue is orthogonal to which syntax or syntaxes are used to represent the constraints themselves. Stardog ICV works with SPARQL 1.1 entailment regimes whether the constraints themselves are in OWL, SWRL, or SPARQL.

Why Polyglot?

We think a polyglot approach makes the most sense because

  1. internally in Stardog ICV every constraint gets turned into a SPARQL query, so the surface syntax is strictly for usability
  2. there are more things in heaven and earth than are dreamt of in our use case and requirements documents; which is to say that it's not easy to predict usability results and several surface syntaxes have obvious appeal in different cases
  3. this issue comes down to what is the "unit of exchange": the constraints themselves or the resulting SPARQL queries. Both? Neither?

Of course in a W3C spec one of the cost drivers would be multiple surface syntaxes. While we don't require multiple syntaxes, we do very much support the idea that multiple syntaxes be permitted (even if not specified) in the sense that the resulting SPARQL translation is the canonical representation of constraints from the point of view of execution and exchange.

ICV History

This section is intended to establish the maturity of ICV as an approach to RDF validation. As such, you may skip it with no great loss.

A few words about Stardog ICV's history. We described the idea in a research proposal to NIST, which they funded, in early 2008. That was the culmination of about 18 months of behind-the-scenes conversations in the OWL research community about how to do RDF validation. At that early stage, we were already focused on how to re-use OWL syntax to provide a high-level constraint language. Which we eventually generalized to using SPARQL and SWRL syntaxes, too.

The earliest published (peer reviewed, no less) description of this work from us came at OWLED 2008: Opening, Closing Worlds: On Integrity Constraints.

We delivered the first prototype to NIST in early 2009; that prototype was based on the SPARQL query engine in Pellet. So the ICV work that's in Stardog now is based on work that was done before Stardog development even started. Sometimes research to market is a series of long lines between vague dots.

We released the first version of ICV integrated with Stardog in 2011 and have been working on extending it since then, including the ability to explain ICV results automatically. That explanation work is ongoing today as we're working on automated repair plans for ICV violations. That means RDF validation in Stardog ICV isn't merely a system that tells a user that data is wrong in some way, but tells users why it's wrong and what they can do to repair it.

What Matters and What Doesn't

Things we care about with respect to a future standard in this area:

From the user's point of view, you can

  1. write constraints in SPARQL using SPIN
  2. write constraints in an RDF vocabulary using Resource Shapes
  3. write constraints in OWL, SWRL, or SPARQL using Stardog ICV

Some constraints are easier to write in one syntax than in the others. There isn't any particular reason to force users to use one and only one syntax for writing all constraints since the only reasonable basis of interoperability is SPARQL queries. The expressivity of RDF validation should be precisely the expressivity of SPARQL query evaluation against RDF data (including, optionally, as SPARQL 1.1 does, entailment regimes). No more, no less. By and large, it will be RDF databases that provide RDF validation services and the lingua franca of RDF databases is SPARQL: not nested for-loops in Jena, or Sesame SAILs, or OWL axioms, or SWRL rules, or RDF vocabularies. An RDF validation spec should use SPARQL (SPARQL 1.1) queries as the basis of interoperability and exchange and as many surface syntaxes as the market cares to support.

Depending on how the market turns and how the W3C takes up these matters in a future standardization effort, Stardog will add support for the standard. In fact Stardog is very likely to support any constraint syntax that can be efficiently translated into legal, valid SPARQL because life is too short to obsess about syntax.

An RDF validation spec should not focus on

  1. how constraints are evaluated because we already have a spec for that (SPARQL 1.1)
  2. non-structural, non-logical graph minutiae; the danger here is the same as a winter invasion of Russia: getting hopelessly bogged down without a chance to win. (Canonical example: constraints about lexical details about edge or node labels.). In other words, there are some meta-syntactic things that people want to constrain...We think that should be handled out of band.