Use Case Provenance for IQ

From XG Provenance Wiki
Jump to: navigation, search

Owner /Curator

Paolo Missier

Provenance Dimensions

  • Primary: Trust (Use)
  • Secondary: Attribution (Content),

Background and Current Practice

Assessing the quality of Information for specific application domains, notably in eScience, is predominantly an information consumer task, aimed at establishing whether a piece of information is fit for use in the context of an application. Unfortunately, a quantitative assessment based on well-defined quality metrics is not always possible, indeed for many common but complex types of scientific data, no agreed upon metrics are available. One reason is that many variables (indicators) contribute to the accuracy, precision, and ultimately, reliability and trustworthiness of a data product that emerges from a complex experimental pipeline, and analytical models that explain overall quality in terms of those variables are difficult to develop.

A promising alternative to analytical models is to learn (heuristic) models that correlate indicator variables obtained from a large collections of datasets, with their user-perceived quality levels. Namely, such machine learning approach relies on large number of examples and corresponding user decisions, i.e., accept vs reject, to establish correlations between the state of the indicator variables and the outcome.

Information Quality assessment is rarely performed at all on the output of complex eScience processes. We have described some of the information quality problems that arise in a specific area of the life science, namely qualitative proteomics, in a recent survey [1] and a partial, non-inductive but practical approach to the problem is described in a recent demo [2].


This scenario illustrates a case where such correlation between indicator variables and perceived quality of a piece of information is computed. This is relevant to process provenance in that the indicator variables represent statements about the history (for example derivation, attribution) of information.

Use Case Scenario

Suppose a scientists runs a workflow to identify some of the proteins that are manifested on a 2D gel. A number of technologies are routinely used for this purpose. Protein identification by mass spectrometry, for example, relies on a "mass fingerprint" that describes the protein peptides found on the gel, and works by matching such fingerprint using a database large set of pre-computed fingerprints for known proteins. In this example, a web service may be used to match a fingerprint against a database. Throughout the experimental pipeline, a number of experimental problems may contribute to a poor outcome. These may include some environmental parameters in the lab, details of the wet lab portion of the experiment, as well as the parameters used for the match, the type and version of the database, and more. Each of these details can potentially be captured as the experiment is performed, and later used as a source of quality indicators.

The user's goal in this case is twofold: (a) to manually label a set of experiments with a personal assessment of the quality of the outcome, and (b) to use such labelling, in combination with the available quality indicators, captured as described above, in order to induce quality models to be made available to the community and later applied to further, unlabelled experiments.

Problems and Limitations

There are two types of problems. Firstly, those that are common to all inductive methods: the method inherently heuristic, in that it induces a model that is subject to errors (misclassifications), and the minimal amount and variety of labelled examples needed to generate a useful model varies widely, as it depends upon the amounts of correlation that can be established.

Secondly, and more to the point in the context of provenance, the types of indicator variables that can be captured during the experiment can be very heterogeneous, i.e., specific to the type of data and the type of experiment, making it difficult to generalize to a commonly useful provenance model. It should be clear that a generic provenance model, used for example to describe causal relationships across data elements consumed and produced by a workflow, is not sufficient. Domain-specific annotations on such causal graph are also needed in order to use provenance as a valuable source of quality indicators.

Existing Work

  1. D. Stead, N. Paton, P. Missier, S. Embury, C. Hedeler, B. Jin, A. Brown, and A. Preece, "Information Quality in Proteomics," Briefings in Bioinformatics, vol. 9, 2008, pp. 174-188.
  2. P. Missier, S.M. Embury, R.M. Greenwood, A.D. Preece, and B. Jin, "Managing information quality in e-science: the Qurator workbench," SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, New York, NY, USA: ACM, 2007, pp. 1150-1152.