The provenance data model (third working draft)

The Provenance Working Group began its activities with a charter naming some 17 concepts relevant to provenance, such as resource, process execution, use, derivation, version, etc

For the first 3 months leading to our first face to face meeting, we debated definitions for these concepts. Importantly, for the social cohesion of the group, we developed a common vocabulary shared by members to communicate.

Following the first face to face meeting, editors were tasked to produce a concrete document, against which the group could formally raise issues and make concrete proposals. In October, this document was released as a first public working draft. We were aware of its limitations, but it served an important purpose: it was setting the direction and scope of the model we were proposing to standardize.

Since then, the group has worked really hard at rationalising concepts of the PROV data model. Key hilights include:

  • introduction of the notion of responsibility, which may be assigned to agents, for the activities they participated in
  • a better characterisation of derivation, which represents, for example, the transformation of a raw data set into linked data
  • ability for the model to track how collections of data evolved
  • a relation which expresses that two different descriptions relate in some way to a same thing in the world
  • definition of a set of constraints, which allow humans and reasoners to determine whether a set of provenance assertions makes sense

The third working draft includes these changes, and we feel that the data model has reached some level of stability, and that from now on any release should be synchronised with PROV ontology definition and the PROV primer.

At our second face to face meeting, we debated intensively what identifiers of the model denote. A challenge one faces with provenance (as well as any form of metadata) is that provenance may no longer be valid if the subject of provenance changes. To make provenance assertions robust, a partial state of the subject has to be characterised in terms of time and attributes, and its provenance expressed.

However, a lot of current practice simply identifies the subject of provenance with a URI where nothing is said about the identified resource state. Thus, the prov-wg has decided that it will present the data model, to support this common usage. In a separate document, an upgrade path will be proposed: to produce a more robust form of provenance, extra assertions can make explicit the extent to which provenance assertions keep an interpretation when changes in subjects occur.

Work on the fourth working draft has already begun; when complete, I will blog again about it.