TavernaProvenance

From Provenance WG Wiki
Revision as of 09:11, 3 October 2011 by Ssoiland (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Representing Taverna's workflow provenance

by Stian Soiland-Reyes


I am one of the developers of the Taverna workflow system. Taverna captures the provenance of workflow runs, which can be accessed within the software, but also exported to OPM and the internal RDF format Janus.

As an experiment I put together a quick plugin for Taverna which can export Taverna provenance according to the W3C Provenance ontology.

To run:

  • git clone https://github.com/stain/taverna-prov.git
  • Import into Eclipse with m2eclipse
  • Run TavernaWorkbenchWithExamplePlugin.main()
  • Run example/concatsha1.t2flow
  • From Run perspective, click "Save all" and then "Save provenance"

The plugin has not yet been fully prepared for installation into a standard Taverna installation, but when it has, the plugin site is:

Implementation

Example workflow

The example workflow is quite straight forward:

Concatsha1.png

In the example run, you can observe:

  • A workflow input "input" set to "Some input data goes here"
  • A string constant returning "some string"
  • "Concatenate_two_strings" consuming these and returning "Some input data goes heresome string"
  • "sha1" consuming this string and returning "2180ba804a22055b1e9e326eb93ea05d594e5db5 -"
  • Two workflow outputs "sha1" and "combined" returning these last two values

Example run outputs:

Issues

  • Unclear why used and wasControlledBy are not subproperties of hadParticipant, as the model defines.
  • Can't link to or describe Time without additional ontology
  • Can't link to or describe Location without additional ontology
  • Can't link to or describe Revision without additional ontology
  • Can't link to or describe Role easily (why is the role an entity, while Time/Location/Revision is not?)
  • Can't link a ProcessExecution to a definition or specification (except as 'uses' or 'controls' - but that feels odd)
  • It is not possible to see which input or output port the entity was generated or used with. This is because there is no link to the Role - and because the relation wasGeneratedBy and used are direct and don't allow any additional data. Note that although one can do Entity wasGeneratedAt Time' and similar for Role, if you do entity wasUsedAt Time / entity hadRole Role you don't know in which process execution the entity was used or had a given role.

Observations

  • zip-prov-abstract-ideal.txt compared to zip-prov-abstract.txt shows the information that should ideally be captured in the RDF and be described by the ontology
  • The ontology can fairly well describe a workflow execution, but misses some essential relations and properties
  • Without inverse properties I am forced to use the beans in one direction. With inverse properties and reasoning enabled Elmo would let me define them in any order. On the other hand this helps keeping the thinking about "going further into the past", although my raw data is provided in the opposite direction as a log of what happened from the beginning.
    • My natural transcription would be:
    • There was an process execution, started and ended, recipe (workflow definition)
    • Process execution is part of larger workflow execution (controlled by workflow process)
    • The process used entity A, B, C in these roles (input ports)
    • The process generated X, Y, Z in these roles (output ports
  • By analysing the workflow definition, the plugin could also have provided "preceeded" or "wasInformedBy" links to "upstream" process executions
  • Although ProvenanceContainer is an Entity (which allowed me to include the meta-provenance) - it is not linked to any of the other relations. I assume that the document should describe itself as the ProvenanceContainer - and if you had two ProvenanceContainers talking about the same set of events (with shared URIs) you would need two resources (or named graphs). Sesame however struggled to define RDF resources with the URI "" to refer to the document itself.
  • Execution of the workflow can be seen as one large process which uses the workflow inputs and generates the workflow outputs. This can be viewed as "controlled by" the agent "Taverna"
  • Execution of each processor in the workflow (like "Concatenate_two_strings" and "sha") are also process executions. These can be viewed as "controlled by" the workflow process execution - but this no longer shows the composition. (a processor might also be controlling another processor through scheduledAfter if there is a "Run after"-link in the workflow definition)
  • Taverna does not generate new data identifiers for the same value passed through several processors - so "output of concatenate" is the same as "input to sha", but also the same as "output from workflow". This means there are two wasGeneratedBy statements in the graph - not sure if this makes sense or not, but the ontology does luckily allow multiple wasGeneratedBy. (If the workflow is viewed as a black box, then it's true that the value was 'generated by' that. You open it up, and see that the process controlled another process which generated the same value. This could potentially go even deeper, for instance you could say that the "sha" processor controlled the process execution of the command line tool "/usr/bin/sha1sum". )
  • Data values themselves (the strings in this case) are not embedded in the provenance graph, and are not dereferencable from the generated URIs. It is however the intention to embed small strings using something like Content in RDF, and larger values with relative URI references to files stored along-side the provenance graph to the output folder. I somehow doubt the provenance ontology should provide anything for this purpose, rather the entities can just be web resources and/or have their properties.
  • The plugin should include Taverna-specific properties such as what is a workflow, processor, etc
  • The plugin can in theory handle nested workflows, but I did not test this
  • Iterations over lists are implicit in Taverna and could probably be modelled as a process execution with extracted/inserted elements with derivedFrom relation ships - otherwise you might see values appearing out of thin air because they were contained in a list or is a list of existing items. Also should use Collection structures from the model.
  • No details of the actual services invoked are included, like the command line tool "sha1sum". This could be provided through recipe links to the Taverna workflow definition. (which can be expressed in RDF using SCUFL2)