Warning:
This wiki has been archived and is now read-only.

TavernaProvenance

From Provenance WG Wiki

Jump to: navigation, search

1 Representing Taverna's workflow provenance

Representing Taverna's workflow provenance

by Stian Soiland-Reyes

Note: A more general version of this scientific workflow example has been hand-coded and included in the PROV ontology as http://dvcs.w3.org/hg/prov/raw-file/tip/ontology/ProvenanceFormalModel.html#modeling-an-example-scientific-workflow-scenario

I am one of the developers of the Taverna workflow system. Taverna captures the provenance of workflow runs, which can be accessed within the software, but also exported to OPM and the internal RDF format Janus.

As an experiment I put together a quick plugin for Taverna which can export Taverna provenance according to the W3C Provenance ontology.

To run:

git clone https://github.com/stain/taverna-prov.git
Import into Eclipse with m2eclipse
Run TavernaWorkbenchWithExamplePlugin.main()
Run example/concatsha1.t2flow
From Run perspective, click "Save all" and then "Save provenance"

The plugin has not yet been fully prepared for installation into a standard Taverna installation, but when it has, the plugin site is:

http://stain.github.com/taverna-prov/

Implementation

Plugin uses Taverna's ProvenanceAccess API to retrieve the run data after the workflow has finished.
OpenRDF Elmo parses http://dvcs.w3.org/hg/prov/raw-file/default/ontology/ProvenanceOntology.owl and generates corresponding Elmo JavaBeans to be used with a Sesame store.
Plugin generates URIs for process executions and data items based on the internal identifiers in the provenance database
Plugin creates provenance beans and adds them to the Elmo store
Plugin exports the W3 provenance graph as an RDF/XML file

Example workflow

The example workflow is quite straight forward:

Error creating thumbnail: Unable to save thumbnail to destination

In the example run, you can observe:

A workflow input "input" set to "Some input data goes here"
A string constant returning "some string"
"Concatenate_two_strings" consuming these and returning "Some input data goes heresome string"
"sha1" consuming this string and returning "2180ba804a22055b1e9e326eb93ea05d594e5db5 -"
Two workflow outputs "sha1" and "combined" returning these last two values

Example run outputs:

Exported provenance in RDF/XML
Turtle format
manually transcribed to the abstract provenance syntax
manually edited to an ideal abstract provenance syntax
with Taverna's existing OPM export

Issues

Unclear why used and wasControlledBy are not subproperties of hadParticipant, as the model defines.
Can't link to or describe Time without additional ontology
Can't link to or describe Location without additional ontology
Can't link to or describe Revision without additional ontology
Can't link to or describe Role easily (why is the role an entity, while Time/Location/Revision is not?)
Can't link a ProcessExecution to a definition or specification (except as 'uses' or 'controls' - but that feels odd)
It is not possible to see which input or output port the entity was generated or used with. This is because there is no link to the Role - and because the relation wasGeneratedBy and used are direct and don't allow any additional data. Note that although one can do Entity wasGeneratedAt Time' and similar for Role, if you do entity wasUsedAt Time / entity hadRole Role you don't know in which process execution the entity was used or had a given role.

Observations

zip-prov-abstract-ideal.txt compared to zip-prov-abstract.txt shows the information that should ideally be captured in the RDF and be described by the ontology
The ontology can fairly well describe a workflow execution, but misses some essential relations and properties
Without inverse properties I am forced to use the beans in one direction. With inverse properties and reasoning enabled Elmo would let me define them in any order. On the other hand this helps keeping the thinking about "going further into the past", although my raw data is provided in the opposite direction as a log of what happened from the beginning.
- My natural transcription would be:
- There was an process execution, started and ended, recipe (workflow definition)
- Process execution is part of larger workflow execution (controlled by workflow process)
- The process used entity A, B, C in these roles (input ports)
- The process generated X, Y, Z in these roles (output ports
By analysing the workflow definition, the plugin could also have provided "preceeded" or "wasInformedBy" links to "upstream" process executions
Although ProvenanceContainer is an Entity (which allowed me to include the meta-provenance) - it is not linked to any of the other relations. I assume that the document should describe itself as the ProvenanceContainer - and if you had two ProvenanceContainers talking about the same set of events (with shared URIs) you would need two resources (or named graphs). Sesame however struggled to define RDF resources with the URI "" to refer to the document itself.
Execution of the workflow can be seen as one large process which uses the workflow inputs and generates the workflow outputs. This can be viewed as "controlled by" the agent "Taverna"
Execution of each processor in the workflow (like "Concatenate_two_strings" and "sha") are also process executions. These can be viewed as "controlled by" the workflow process execution - but this no longer shows the composition. (a processor might also be controlling another processor through scheduledAfter if there is a "Run after"-link in the workflow definition)
Taverna does not generate new data identifiers for the same value passed through several processors - so "output of concatenate" is the same as "input to sha", but also the same as "output from workflow". This means there are two wasGeneratedBy statements in the graph - not sure if this makes sense or not, but the ontology does luckily allow multiple wasGeneratedBy. (If the workflow is viewed as a black box, then it's true that the value was 'generated by' that. You open it up, and see that the process controlled another process which generated the same value. This could potentially go even deeper, for instance you could say that the "sha" processor controlled the process execution of the command line tool "/usr/bin/sha1sum". )
Data values themselves (the strings in this case) are not embedded in the provenance graph, and are not dereferencable from the generated URIs. It is however the intention to embed small strings using something like Content in RDF, and larger values with relative URI references to files stored along-side the provenance graph to the output folder. I somehow doubt the provenance ontology should provide anything for this purpose, rather the entities can just be web resources and/or have their properties.
The plugin should include Taverna-specific properties such as what is a workflow, processor, etc
The plugin can in theory handle nested workflows, but I did not test this
Iterations over lists are implicit in Taverna and could probably be modelled as a process execution with extracted/inserted elements with derivedFrom relation ships - otherwise you might see values appearing out of thin air because they were contained in a list or is a list of existing items. Also should use Collection structures from the model.
No details of the actual services invoked are included, like the command line tool "sha1sum". This could be provided through recipe links to the Taverna workflow definition. (which can be expressed in RDF using SCUFL2)

Retrieved from "https://www.w3.org/2011/prov/wiki/index.php?title=TavernaProvenance&oldid=3381"

TavernaProvenance

Contents

Representing Taverna's workflow provenance

Implementation

Example workflow

Issues

Observations

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Navigation

Tools