RDF Validation Workshop: Practical Assurances for Quality RDF Data

Host

W3C/MIT

Executive Summary

The RDF Validation Workshop drew 27 participants from industry, government and academia. There was consensus on the need for

Declarative definition of the structure of a graph for validation and description.
Extensible to address specialized use cases.
A mechanism to associate descriptions with data.

The presentations covered many different topics, ranging from use cases and user experiences on various projects to specific approaches and technologies used to validate RDF data against a set of integrity constraints as well as to communicate these constraints so that a client can discover what a service expects.

The participants agreed that there is currently no standard addressing these needs and the W3C should do something about it. The majority of participants agreed that the W3C should create a Working Group to develop a Recommendation that provides a declarative way of expressing a set of basic integrity constraints governing an RDF graph with an extension mechanism to handle more complex scenarios.

IBM indicated that they would submit a specification to W3C to be used as input to this work.

A public mailing list public-rdf-shapes@w3.org (archive) was created to provide a forum to develop a charter for the WG to be proposed. (Participants often used the term "shape" to refer to the set of constraints governing an RDF resource.)

Workshop Overview

Format

17 submissions were received and presented at the workshop, either in person or remotely, some at length and some briefly. The workshop was attended by 18 registered participants, 9 guests including 2 Team members (see their introductions).

The presentations were organized in three general themes:

Use Cases and Requirements
User Experience
Tools and Technologies

Most of the first day and the morning of the second day were allocated to presentations and Q&A. While the rest of the first day and the afternoon of the second day were dedicated to discussions.

Workshop Day 1

After a round of introductions, Eric Prud'hommeaux started the day (minutes) with a short presentation on the state of the art and the intent of the workshop. There was general agreement on the problem space.

The presentation sessions started with Mark Harrison (minutes) who presented the needs and challenges associated with validating RDF data produced by millions of providers related to products and services in the supply chain.

The second presentation by Harold Solbrig, Mayo Clinic, (minutes) began by describing experience with the PTXT system as used by Intermountain Health Care in the 1990's. The system was based on collections of statements, akin to RDF triples and one of the major issues became what could you expect / not expect in any collection. Solbrig then presented a proposal for an RDF "Store" - an RDF Graph with Identity, declared invariant, an update function and a set of transition rules. Requirements for this graph included:

Standard syntax and semantics for Invariants
Ability to publish and discover rules in registries
Ability to describe transition rules as preconditions and postconditions
A way of handling invariant and rule evolution (future)

Solbrig then outlined a set of additional requirements that, while not absolutely needed, should be considered by the working group

Miguel Esteban-Gutiérrez (minutes) brought new light to the problem of RDF validation by describing the various aspects that should be considered to fully address the needs of enterprise application integration. Miguel explained that in that context validation goes beyond traditional structure and data range validation issues. The validation process needs to take into account several factors related to the data source, the procedure, and the context, and account for several aspects such as temporal and operational. Miguel suggested that only a customizable validation process could accommodate for the particularities of each scenario.

Paul Davidson then kicked off a series of short presentations with a video recording (minutes) in which he described the challenges faced by the many organizations related to the UK government in trying to build a coherent eco-system of data. Paul explained how ontologies are useful but not sufficient because they do not tell you which properties to use. He suggested defining something like Linked Data Profiles that provide a way to describe the shape of the data, specifying the classes and properties to use so that data can be exposed in a consistent manner by the various organizations. Paul called for a simple solution that is usable by all and not just by the experts.

Roger Menday presented some background on LDP and a vocabulary f:parameterSet used in Fujitsu to enable robots to fill in forms (minutes).

Antoine Isaac presented on the requirements on validating RDF in the context of the Europeana project and from the point of view of a vocabulary owner (minutes).

David Booth shared his thoughts on RDF validation in the context of Healthcare Data (minutes).

Dave Reynolds reported on his experience with RDF validation in the context of data.gov.uk (minutes). Dave described the need for an interoperable solution that allows specifying the shape of data publishers must produce and consumers must expect which is declarative, "accessible to mortals", and can be used to automatically produce forms. David described his experience with validating RDF Data Cubes and reported that while SPARQL's expressivity was found sufficient, some queries were difficult and inefficient. David suggested that a useful compromise would be to use a Data Structure Definition that tools and humans can understand along with SPARQL which isn't inspectable and easy to understand.

The morning session ended with a discussion (minutes) focused on what types of requirements the participants thought a solution in this space ought to address. This discussion highlighted the need for declarative as well as procedural capabilities and highlighted the need for not only validating RDF but also describing what data a service expects.

The afternoon started with Guoquin Jiang (minutes). Dr. Jiang described a number of overlapping initiatives to enable secondary use of clinical data, and the normalization and validation steps needed to harmonize the content for query and analysis. He then presented an architecture that included RDF and SPIN based models, templates and constraints that enabled the merger of clinical data from disparate sources into an RDF based triple store. He described a general need to have a mechanism that would allow UML/OCL and XML/XML Schema to be merged into collection of SWRL/SPARQL rules built on an RDF back end.

Shawn Simister continued(minutes) with an explanation of Google's need to communicate to users what type of RDF data their service will recognize. He described their current system where app developers add paths to a database of tests.

Tim Cole ended the series of presentations for the day (minutes) with a report on the W3C Open Annotation Community Group's experience in using SPARQL to validate their RDF graphs. Tim presented the Open Annotation data model and ontology, as well as their approach to asserting conformance based on a combination of a precondition and query that must pass. Their experience revealed a need for the ability to specify different levels of severity (e.g., warning vs error) as well as for extensibility to provide for an evolution of the ontology.

The rest of that first day was spent on capturing a first set of requirements the participants deemed important using a shared pad online. After a free for all period, requirements were sorted and consolidated although no real attempts to prioritize them was made.

After the meeting closed, many of the participants joined the Semantic Web Gathering that was taking place next door and where Eric Prud'hommeaux presented to the audience his work on Shape Expressions, an RDF grammar/validation language with a syntax similar to RelaxNG Compact Syntax. Some in the audience pushed back on the need for RDF validation at all.

Workshop Day 2

Day 2 started with a presentation from Jose Emilio Labra Gayo (minutes) on the experience of using SPARQL to validate data using the RDF Data Cube vocabulary in the WEBINDEX project. Raw and computed data were transformed from Excel spreadsheets into an RDF Datastore and used for data visualization. Transformed data was annotated with metadata about its origin, applied formula, MD5 checksum of the original source, etc. Validation used CONSTRUCT rather than ASK queries to allow the general "CONSTRUCT (error message) WHERE (fail condition)" approach. Validation could involve complex statistical computations such as average, mean, etc., which depended on functions that may not be universally available. Using the WebIndex as the use case, SPARQL was proposed as an implementation language (with caveats about complexity/expressivity) accompanied by a declarative RDF profile as the intermediate language between OWL and SPARQL.

Then Evren Sirin (minutes) identified the different types of integrity constraints and validations one needs to consider with regard to RDF data and the differences between validation and reasoning. Evren presented a solution adopted in the Stardog RDF database which is based on reusing the OWL syntax with a different semantics designed for validation rather than reasoning. Evren ended with a list of things to consider when trying to solve the problem at hand. He suggested to have a simple syntax for most common use cases along with using SPARQL as a fallback mechanism for more complex task.

Martin Stolpe (minutes) presented a model for a different aspect of validation -- validation of incoming data based on the content of an existing store. Examples included constraints disallowing new superclasses, restricting foaf:knows relationships to only be between new people, write protecting sections of the database, etc. Stolpe presented a formal model for the Boundz vocabulary and an example use case as applied to the BBC music dataset.

Arthur Ryman (minutes) explained IBM's use of Linked Data as a way to integrate applications and the need in this context to not only validate data but also to communicate the structure and constraints governing the data applications can send and receive. Arthur presented OSLC Resource Shape: an RDF vocabulary which provides a fairly limited but simple declarative solution to this problem.

Then Thomas Baker (minutes) presented Dublin Core's Application Profiles, designed to produce editors and validators for use by non-expert users. Application Profiles use an expressivity similar to Resource Shapes to communicate interface definitions.

Finally, Noah Mendhelson (minutes) concluded the presentation section of the workshop with a talk on what he learned from his experience with the development of the XML Schema language. He described the decisions facing a group working on validation/description, including document boundaries, expressivity, reporting (e.g. PSVI).

The rest of the day was spent on further discussing requirements and uses cases (minutes). They were captured again on a shared pad online and discussing what the participants thought the W3C ought to do next. 13 participants expressed interest in participating in a Working Group.

Results

The workshop made it clear that there is a gap in the current standards offering and the industry needs a standard way of dealing with validation of RDF data. The term "validation" does not however effectively capture the full scope of the problem. In addition to being able to validate data, the workshop revealed the need for being able to communicate the constraints against which data is to be validated in a way which is both easy to understand by human beings and discoverable by programs. For this reason, while SPARQL plays a prominent role in how people tackle the validation problem today, participants agreed that SPARQL does not constitute a complete solution. Constraints checking can be performed using SPARQL quite effectively, but SPARQL queries cannot easily be inspected and understood, either by human beings or by machines, to uncover the constraints that are to be respected. The term "shape" emerged as a popular label for these constraints.

Scope of Future Work

The participants agreed that the W3C should launch an activity to develop a human and machine-readable description of the "shape" of the RDF graphs that a service produces or consumes. This description should be usable for validation, form-generation, as well as human-readable documentation. The participants further agreed that the solution must provide a declarative way of describing simple integrity constraints along with an extension mechanism that allows using technologies such as SPARQL to specify more complex constraints.

While some participants, affiliated with organizations that are not W3C members, expressed a preference for a Community Group, the majority of participants supported the idea of having this activity take place within a W3C Working Group which would produce a Recommendation.

13 participants indicated an intent to participate in such a WG if one were to be created.

A set of use cases and requirements was identified.

Recommended Next Steps

Parties interested in submitting material for consideration are encouraged to formally submit their material through the W3C Member submission process to clear up any possible IP encumbrance. IBM indicated that they intend to work on such a submission based on OSLC Resource Shape.

A WG charter will be developed to be put before the W3C management and membership.

The public mailing list public-rdf-shapes@w3.org will serve as a forum for discussion of a draft charter.

Acknowledgements

The RDF Validation workshop was hosted by W3C/MIT which provided facilities, refreshments and meals for both days of the workshop. The organizers and W3C are grateful for the help of Amy van der Hiel and Eric Prud'hommeaux who made all this possible by organizing the workshop logistics.

The workshop was co-chaired by Arnaud Le Hors, and Harold Solbrig , assisted by Eric Prud'hommeaux. Thanks to all the participants for their dedication and for volunteering their time to further RDF and Linked Data.