Use Case Result Differences

From XG Provenance Wiki
Revision as of 15:55, 5 January 2010 by Jcheney (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Owner

Simon Miles

Provenance Dimensions

  • Primary: Commonality (Use)
  • Secondary: Process (Content), Scale (Management), Imperfections/Debugging (Use)

Background and Current Practice

This use case is taken from the following paper: The Requirements of Using Provenance in e-Science Experiments, and comes from interviews with Klaus-Peter Zauner. Please see the paper for more background details.

A bioinformatics experiment, encoded as a workflow, uses a number of services, some externally provided, some written by the biologist, that analyse data drawn from publicly accessible databases. When a potentially interesting result is found, the biologist re-runs parts of the workflow with different configuration parameters to try and determine why that result was produced.

Goal

Determine or disambiguate why two processes produced different results.

In order to do this, we use the provenance of the results to examine the processes producing them.

Use Case Scenario

A user, B, downloads data from source D and performs a process using D as input. B downloads data from D using the same query and performs the same process. B compares the two process outputs and notices a difference. B determines whether the difference was caused by the process or its configuration having been changed, or by the downloaded data being different (or both).

In the original bioinformatics use case, the data was that of a human chromosome, D was GenBank, and the process was the process was the experiment itself (largely written in Tcl scripts). However, this scenario applies wherever differences in outcome (where equivalence was expected) may be due to differences in parts of the process producing the outcome or the inputs to that process.

Problems and Limitations

Without having a record of the salient difference between workflow runs, i.e. the provenance of each outcome, it becomes hard or impossible to determine the difference. Where the experiments involve large amounts of complex data, as was the case in the bioinformatics experiment, human records in a lab book are not a feasible way to provide these records.

Technical Challenges:

  • Representing the full record of what occurred in a process
  • Extracting the above record
  • Representing the provenance of data which is derived from large and complex data
  • Determining the differences between two complex provenance records

Existing Work

An approach to addressing this use case is discussed in the paper Recording and Using Provenance in a Protein Compressibility Experiment