From Provenance WG Wiki
Revision as of 12:10, 17 October 2011 by Kbelhajj (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search



The purpose of this document is to initiate a debate about what is meant by inter-operability in the context of the PROV-WG. The debate should help the PROV-WG to determine what makes implementations "conformant to the standard".

Suggestions, alternative views or approaches, are welcome!

The context: the PROV-WG Charter

The focus of the PROV-WG charter is on interchange of provenance, as indicated below.

As indicated in the Incubator Group´s report, however, many provenance models exist with significantly different expressivity, fundamentally different assumptions about the system they are embedded in, and radically different performance impact. The idea that a single way of representing and collecting provenance could be adopted internally by all systems does not seem to be realistic today.

A pragmatic approach is to consider a core provenance language with an extension mechanisms that allow any provenance model to be translated into such a lingua franca and exchanged between systems. Heterogeneous systems can then export their provenance into such a core language, and applications that need to make sense of provenance in heterogeneous systems can then import it and reason over it.


PROV-DM is the data model for provenance specified by the PROV-WG, and PROV-ASN, the Provenance Abstract Syntax Notation, to express is a language to express instances of that data model. PROV-ASN is an abstract syntax, whose goals are:

  • allow serializations of PROV-DM instances in a technology independent manner, which makes it more readable for human consumption;
  • facilitate the mapping of PROV-DM to concrete syntax;
  • provide the basis for a formal semantics.

Technological mappings

The charter identifies two forms of serialization, one leveraging the Semantic Web technology stack, and the second XML. For the first one, PROV-WG has begun work to map the data model to the OWL2 Web Ontology language. Work on the XML serialization has not started yet. Some groups have also indicated that a JSON serialization is desirable.

Here is a blog post about the merits of JSON over XML. --khalid

Forms of Inter-Operability

The IEEE Glossary defines interoperability as the ability of two or more systems or components to exchange information and to use the information that has been exchanged. In other words, for systems to inter-operate, they need to be capable of meaningful exchange of information, which require both syntactic and semantic understanding of information.

To initiate the debate, below, we list various forms of inter-operability. It is not claimed that this list is complete, nor that the WG will subscribe to any of these notions of inter-operability.

1. Lossless bidirectional conversion between representations

Let PROV-X and PROV-Y be two concrete syntax of PROV-DM, the following properties are desirable:

  1. Any serialization x in PROV-X MUST be convertible to a serialization y in PROV-Y (conventionally written y=toY(x) )
  2. Any serialization y in PROV-Y MUST be convertible to a serialization x in PROV-X (conventionally written x=toX(y) )
  3. Conversions are lossless:
    1. For any serialization x in PROV-X, toX(toY(x)) =x x where =x denotes equivalent representation in syntax PROV-X
    2. For any serialization y in PROV-Y, toY(toX(y)) =y y where =y denotes equivalent representation in syntax PROV-Y

The PROV-WG has to ensure that all normative serializations are compatible with PROV-DM. Hence, lossless bidirectional conversion with PROV-ASM is also desirable.

Note that PROV-X/PROV-Y can be any implementation of the PROV-DM, including a proprietary solution, importing/exporting provenance.

2. Inference Preservation

Let PROV-X and PROV-Y, two technological mappings of the data model, with inference capabilities.

For any set of assertions X in PROV-X, if X |-x X' if only if toY(X) |-y toY(X')

For any set of assertions Y in PROV-Y, if Y |-y Y' if only if toX(Y) |-x toX(Y')

3. End-to-End Multi-System Inter-Operability

Let us consider, three different systems with technologies S1, S2 and S2, and exchanging information in a manner conformant to PAQ Document,

 Given a data result produced, by S3, can all the relevant provenance generated 
 from S1, S2, S3 be accumulated in a single repository?

Interoperablity queries:

  1. Can all the entity expressions, with a given type, it dependedOn be retrieved? Can resources they denote by identified?
  2. Can all agents involved be returned?

4. Reuse of URLs for terminology

To be interoperable, serializations should use the same set of urls to define terms. Thus, when looking up a url in any serialization one is lead to the same definition. This means that no matter what serialization a developer can be sure that the terminology has the same meaning. Can we do this? Does this prevent us from doing things correctly in RDF?

5. Validator

A validator could check that a PROV ProvenanceContainer in one of the supported formats (say PROV-ASN, PROV-O)

a) Is syntactically valid b) Is semantically valid

Testing a) depends on the format.

For PROV-O the syntactic check can be a multi-step process: 1. Document is valid RDF in supported serialisation (RDF/XML, Turtle, TriG ?) 2. Referenced ontologies and namespaces (if any in addition to prov, rdf, rdfs, owl) can be imported 3) Run OWL 2 inferencing to ensure the document is not inconsistent according to PROV and 3rd party ontologies (such as :e1 prov:wasDerivedFrom :e1 - remember prov:wasDerivedFrom is irreflexive) 4) Check at least one statement with a prov:predicate or prov:Class exists. 5) Give warnings if inferencing adds additional prov:predicates or prov:Class memberships compared to the explicitly stated assertions

Then the document is a syntactically valid PROV-O document.

For semantic validity one can check that the provenance assertions of an account are valid according to the constraints in PROV-DM. Some of these are easier to check than others, but it should at least be able to find impossibilities such as a:

  •  :e1 prov:wasDerivedFrom :e2 .
     :e2 prov:wasDerivedFrom :e1 .
  •  :pe1 prov:wasScheduledAfter :pe2 .
     :pe2 prov:wasScheduledAfter :pe1.

From this it might be difficult to say that a provenance account is semantically valid, but it should in many cases be possible to say that it is invalid.

6. PROV-DM as an interchange format Between Existing Systems

(by khalid)

Existing systems may internally use provenance models others than PROV-DM. For example, they may use OPM, Provenir, PML, etc. To promote the interoperability between such systems, it is not realistic to expect their providers to rebuild their systems to use PROV-DM as a core provenance model. Instead, PROV-DM, or more specifically a serialization of thereof, is likely to be used as an interchange format that allows existing systems to communicate and, hopefully, interoperate.

To verify that two or more systems can interoperate using PROV-DM, we can check that the provenance information communicated between the systems is syntactically conform to PROV-DM. This can easily be done with little or no effort. For example, if an XML serialization of PROV-DM is adopted, provenance information interchanged between the systems can be syntactically checked using XML Schema validators.

Note however that that the above test is by no means sufficient to state that the systems in question can interoperate. Indeed, a property that needs to hold to reach interoperability is the ability of the systems to use (an produce) provenance information in line with the semantics of PROV-DM. Specifying generic tests for verifying the validity of such property is a much more difficult. In what follows, we list examples of tests that can be performed to (partially) verify interoperability. To do so, we distinguish between two kinds of systems: provenance producers and provenance consumers.

Provenance producers refer to systems that produce provenance information. A typical example is that of scientific workflow systems. To verify that a workflow system produces provenance information according to the semantics of PROV-DM, we can perform the following test. Consider a workflow wf, and the provenance of the results, wf_prov-dm, produced by such a workflow according to PROV-DM using given input data, in. In other words, wf_prov-DM refers to the provenance of the results output by the workflow wf when enacted by a workflow system, that is known to correctly use PROV-DM, given some input data, in. To check that a given workflow system WF-X produces provenance information that is conform to PROV-DM, we can enact the workflow wf using the workflow system WF-X, using the dataset, in, as input. We then compare the provenance information produced by such a system, wf_prov_WF-X, to wf_prov-dm. Suppose for example that WF-X uses OPM as a core provenance model. Before comparing wf_prov_WF-X with wf_prov_dm, we need to translate wf_prov_WF-X to a PROV-DM serialization. The two provenance traces can then be compared to, e.g., check that the they contain the same entities and relationships. The two provenance traces can also be compared to check that the inferences that can be made under wf_prov-dm can also be made under wf_prov_WF-X, and vice-versa.

Provenance consumers refer to systems that use provenance information as input to the functionalities they provide. To check that such systems correctly use provenance information, we need to examine the functionalities they provide. This task can be difficult as it may involve some detective work that examines both the specification and implementation of such functionalities. Certain tests can, however, be performed without doing so. To illustrate this, consider a system that provides querying capabilities over a repository that store PROV-DM compatible provenance information. To check that such a system correctly uses PROV-DM, we can issue queries for which the answers is known and compare the results obtained using the system in question to the expected results. As well as checking that the entities and relationships are being correctly used, such queries may involve checking that the inferences made by the system are correct.