Use Case Provenance in biomedicine

From XG Provenance Wiki
Jump to: navigation, search


Satya Sahoo, Brent Weatherly, Amit Sheth

Provenance Dimensions

  • Primary: Process (Content), Agent (Content), Justification for Decisions (Content)
  • Secondary: Attribution (Content), Query (Management), Scale (Management), Trust (Use), Commonality (Use)

Background and Current Practice

This use case is derived from the biomedicine domain. The research objective is to identify vaccine, diagnostic, and chemotherapeutic targets in the human pathogen Trypanosoma Cruzi (T.cruzi) [1]. Parasite researchers generate data using different experiment protocols such as expression profiling, proteome analysis, and creation of new strains of pathogens through gene knockout. These experiment datasets are also combined with data from external sources such as biology databases (NCBI Entrez Gene, TriTrypDB) and information from biomedical literature (PubMed) that have different curation methods and quality associated with them.

The biologists issue queries over the integrated datasets and the results are interpreted according to the source details including curation method used in external databases, confidence measures associated with experiment material and methods, and research personnel/institution.


Compare, integrate, and process large volumes of biomedical data from different experiment processes (using heterogeneous materials, equipment, protocols, and parameters) and external sources including databases and published literature.

In this use case, the goal is to capture and store domain-specific provenance to support both biology and administrative queries including:

1) Enable researchers in the parasite research community to infer phenotype of the related organisms from the work done on T. cruzi.

2) Enable project managers or principal investigators to track progress and/or view successful creation of pathogen strains.

3) Allow new researchers such as visiting faculty or post-docs to learn the lab-specific methods by studying existing results and the associated experiment protocols

Use Case Scenario

Biologists interpret two result sets A and B according to the source of the data. In A, the experiment data is combined with data from a database with human curated information. In B, the experiment data is combined with data from a database with results of a prediction algorithm. The result set A has higher confidence value and is used in further analysis.

A set queries in context of this use case:

Query 1: List all groups using a target region plasmid X?

Query 2: Find the name of the researcher who created a strain Y of the T.cruzi parasite?

Query 3: Which gene was used to create a cloned sample Z?

Problems and Limitations

The provenance information in life sciences in general is difficult to capture and represent. The provenance of biomedical data is essential to accurately understand the significance of results that integrates information from multiple sources.

Technical Challenges:

  • Capturing and modeling domain-specific provenance in life sciences is essential (for example, instruments and sample types used in an experiment)
  • The scale of provenance information in life sciences is extremely large and effective storage mechanisms are needed
  • Dedicated query mechanism is required that takes into consideration provenance query and data characteristics (for example, efficient pattern matching algorithm to compare provenance of biomedical data)

Existing Work

[1] Semantic Provenance for eScience: ‘Meaningful’ Metadata to Manage the Deluge of Scientific Data

[2] Ontology-driven Provenance Management in eScience: an Application in Parasite Research