Domain Specific Provenance 1

From XG Provenance Wiki
Jump to: navigation, search


Answering domain-specific queries by end users


Paolo Missier, Jun Zhao, Marco Roos, M. Scott Marshall

Provenance Dimensions

  • Primary
    • Use: Understanding

Background and Current Practice Scenario

Provenance metadata bears the potential of helping users achieve a better understanding of data products, as well as of the processes that led to them. An example is given in the scenario section below. Answering questions that user scientists have regarding their data products requires the ability to store and later query domain-specific information. Other use cases in this collection articulate this need in various ways. The need for domain-specific metadata has been clear to the information retrieval community for a long time. More recently, the Semantic Web community has been active in addressing this need through the development of a wealth of domain ontologies for a broad variety of application domains, as well as of languages for expressing such ontologies in a standard way, and conventions for sharing and exchanging them within and across communities.


The goal of this use case is to illustrate the role of semantics and domain-specific metadata in answering a variety of users' questions regarding data products that have been created through a known and documented process.

Use Case Scenario

Paul, a bioinformatician, uses a workflow to match an input list of his genes to gene identifiers from both the UniProt database and the Extrez gene databases. Using these genes, he then goes on to search for encoded proteins and protein pathways associated with these genes, using the KEGG Pathway database. Paul would like to be able to find out:

  1. all the genes that participate in some pathway p;
  2. all the pathways derived from UniProt genes;
  3. how a particular data product (such as a pathway p) was derived from other specific data products (say a collection of genes).

(More queries of a similar nature can be devised if needed)

The question is, what kind of domain-specific metadata is needed, or potentially useful, in order for a system that is aware of Paul's workflows to answer Paul's questions.

Problems and Limitations

Provenance captured during workflow execution, and more generally, provenance that describes the users' interaction with a number of databases, is a natural source of metadata that should be leveraged to answer the users' questions. Specifically, we need:

  1. firstly, the "raw" provenance trail for the workflow execution, as well as its structure. OPM provenance graphs can be used for this purpose.
  2. Annotations of the nodes in the provenance graph, providing semantic descriptions of the services, data products, parameters, that appear in the scenario

The technical challenges include the need for a better understanding of how best to associate annotations with structural lineage layer, and how to preserve the intrinsic identifiers of data products, for example, the UniProt gene IDs associated with the UniProt genes.

Existing work

We are aware of an early prototype where domain-specific provenance is added to OPM, and such semantics-augmented OPM is represented using RDF. This is described in a paper presented at the SWPM'09 workshop (ISWC'09): SWPM'09 paper

Semantic extensions to OPM have also been recently proposed in this paper, presented at the 2009 All Hands Meeting, Oxford, UK.

Additionally, [KDG+08] describes reasoning about semantic properties of datasets in the workflow as part of provenance records. [GGR+09] describes how this is done for the case of data collections.