HCLS/ISWC2007/BOF/Hunter

The BioManta Project – Implementing A Scale-Out Architecture for Expediting the Semantic Integration and Processing of Biomolecular Data

Andrew Newman1, Melissa Davis1, Imran Khan1, Yuan Fang Li1, Chris Bouton2 and Jane Hunter1

1 University of Queensland, St Lucia Brisbane, QLD 4072, Australia; 2 Pfizer Research Technology Center, Cambridge, US

BioMANTA is a collaborative project between Pfizer Research and the University of Queensland that aims to facilitate in silico drug discovery and development by identifying candidate therapeutic targets through analysis of integrated datasets that relate molecular interactions and biochemical pathways with physiological effects such as compound toxicology and gene-disease associations. Protein-protein interaction and biomolecular pathways data hold tremendous potential for the drug discovery and development process. However, this data is currently distributed across a range of large, disparate databases. Integration, manipulation, processing and analysis of the data is required in order to yield potential new discoveries. The Semantic Web has been investigated as a solution to this problem by a number of projects who are using RDF and OWL to integrate, represent and analyse protein interaction data. However, existing Semantic Web data stores have had difficulty in scaling to large biological datasets, because they largely rely upon traditional database techniques. So while the use of ontologies and other Semantic Web technologies such as RDF can provide the ability to integrate, reason and process over data sets, the magnitude of the processing and the size of the data sets have traditionally prevented a speedy, efficient end-to-end solution. Inferencing is typically limited to either basic operations across large amounts of data or richer inferencing over small amounts of data – within the Biomanta project we require rich, complex inferencing over large amounts of data.

In order to process data quickly, a parallel architecture-based technique known as MapReduce [4] is becoming increasingly popular. This data processing technique provides a common way to solve general processing problems and is closely aligned with the way data is acquired from experiments or simulations [3]. In a MapReduce system, a map function takes input key/value pairs and transforms them to output key/value pairs. The reduce function takes the values in each unique key and produces output values. MapReduce libraries are found in the majority of the most popular languages including Java, Javascript, C++, Perl, C#, Python, Ruby, and Scala. The advantages of this architecture are numerous including [4, 5]:

A programming model that is abstract, simple, highly parallel, restricted, powerful, easy to maintain, and easy to learn;
An ability to efficiently leverage low-end commodity hardware;
Easy deployment on hundreds to thousands of nodes on internal or external hosting services; and
Robustness and ability to handle corruption of data, and the loss of individual nodes.

Our hypothesis is that Semantic Web applications can benefit from using a scale-out architecture together with MapReduce data processing, in order to speed up querying, inferencing and processing over large RDF triple stores of scientific and biomedical data. In this presentation we will present an alternative distributed approach to semantic querying and inferencing of biological data - based on the Hadoop “scale-out” architecture. This novel approach involves distributing the RDF triples, analysis and semantic inferencing across a computing cluster in order to improve scalability and performance, generate cost benefits, and reduce implementation and deployment difficulties. We evaluate and apply this work in the context of the BioMANTA project, a collaboration that aims to use Semantic Web technologies to integrate, model and analyse large sets of protein-protein interaction and pathway data. This presentation will describe:

The underlying OWL ontology that we have developed;
The datasets that we have mapped to the ontology and represented as RDF molecules;
The “scale-out” architecture based on Hadoop and MapReduce that we have implemented
Examples of the kinds of SPARQL queries and reasoning across large disparate biomedical datasets we are trying to support
A summary of the results to date and future workplans