HCLSIG BioRDF Subgroup/MicroarrayProvenanceUseCase

This document is created for sharing provenance-related use case with the W3C Provenance Incubator Group.

Owner

Background

This use case comes from the Microarray Use case of the BioRDF Task force of the W3C HCLS Interest Group (http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/QueryFederation2) and is based on the initial experiment and documentation by SatyaSahoo.

Only a limited number of patient samples can be studied in one microarray experiment. Hence, it is very important to merge together data from different studies in order to increase the sample size and improve the results from statistical analysis of the microarray data. Provenance information about each microarray experiment will facilitate the integration of data while retaining the ownership information. Details of the experiment conditions can be used to infer the dataset quality and to improve the reproducibility of experiments.

Several standards have been developed by the microarray community. There are two MIAME (Minimum Information about A Microarray Experiment) [1] compliant standard data formats developed by the MGED community [2], namely, MAGE-ML (MicroArray Gene Expression Markup Language) [3] and MAGE-TAB [4]. These standards have been used to represent microarray experiments and such documentations normally contain rich provenance-related information about the experiments, the experiment protocol and data protocol. Some provenance-related information can be extracted from experiments documented in these standard formats, including:

organization performing the experiments;
domain-specific information about data sources used in the experiments, including the organism of the data source, their cell type, ethical origin, disease state, etc.

Goal

Enable collection, representation, and use of provenance information for ownership and quality-aware integration of microarray experiment datasets.

Use Case Scenario

A bioinformaticians B wants to look for related microarray experiments to assemble a bigger study. B wants to find experiments that study gene expression in a specific region of human brain. Once B finds all relevant microarray experiments, B selects a subset of these experiments based on B's knowledge about the organizations who performed the experiments or where the experiments have been published. B wants to retrieve all the gene expression data produced by these selected experiments, so that B can perform some statistics analysis of this larger dataset. B wants the provenance of each data set to be associated with the data so that their ownership can be retained.

Challenges and Potential Solutions

We need to be able to have a common provenance model that enables representation of complex domain-specific provenance information about these experiments. We need to be able to automatically process microarray experiments documented in different standards and extract provenance data out of these documents. Finally, we need a well-defined query mechanism to automatically analyze the provenance information and help in integrating suitable Microarray datasets.

A potential solution to these challenges should re-use the large number of existing biomedical domain ontologies (listed at the National Center for Biomedical Ontologies) wherever possible and also leverage or extend existing Semantic Web technologies.

We need provenance information to be portable, so that when data are moved out of a system their provenance information can be moved along with them.

References

[1] Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 29(4):365-71, 2001.

[2] http://www.mged.org/

[3] Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball C, Lepage M, Swiatek M, Marks WL, Goncalves J, Markel S, Iordan D, Shojatalab M, Pizarro A, White J, Hubley R, Deutsch E, Senger M, Aronow BJ, Robinson A, Bassett D, Stoeckert CJ Jr, Brazma A. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3(9):RESEARCH0046, 2002

[4] Rayner TF, Rocca-Serra P, Spellman PT, Causton HC, Farne A, Holloway E, Irizarry RA, Liu J, Maier DS, Miller M, Petersen K, Quackenbush J, Sherlock G, Stoeckert CJ Jr, White J, Whetzel PL, Wymore F, Parkinson H, Sarkans U, Ball CA, Brazma A. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC ßBioinformatics 7:489, 2006.