See also HCLS/Banff2007Demo
The date of the demonstration is Thursday, 10 May 2007.
The objective of this demonstration is to demonstrate the value of semantic web technology to health care and the life sciences by highlighting the benefits of using semantic web technology
Detailed Use Case Sketch
- Detailed Use Case Sketch from the March 5th Face to Face
- Latest Version of ../DemoScript
Benefits of the Semantic Web Illustrated in the Demonstration
Many web sites disseminate their data without providing machine-readable semantics. For example, flat file formats are widely used for data dissemination, but they make machine querying, inferencing, and reasoning difficult if not impossible. This demonstration implements the AD/PD use, and illustrates how to address these problems by providing the data sources for the use case in RDF or OWL, the Standard languages of the Semantic Web. Existing RDF databases (e.g., Oracle [oracle spatial] and Sesame [sesame]) and OWL reasoners (e.g., Racer [racer] and Pellet [pellet]) can be used to query and reason about data in this demonstration. The benefits of the Semantic Web illustrated in the demonstration include knowledge representation, knowledge integration, and knowledge discovery.
The Semantic Web languages OWL and RDF enable knowledge representation by means of a knowledge base, i.e., a knowledge repository - not just a database. A knowledge base consists of two parts: ontology and individuals, i.e., instances of classes in the ontology. There are two fundamental differences between a knowledge base and a database: formal definition of semantics (i.e., the "meaning" of the elements and their relationships) and automated reasoning. Mathematical rigor for the OWL formal definition of semantics and automated reasoning comes from Description Logic [dl handbook].
Formal Definition of Semantics
OWL has a formal definition for the semantics of an OWL knowledge base, i.e., given a knowledge base, associated semantics are primarily provided explicitly within the knowledge base itself. Neither of the commonly used information resource technologies: XML (including its associated technologies, e.g., XPath, XQuery, XML database)and relational database (RDB), including its associated language SQL [sql], have both formal definition of semantics and automated reasoning.
XML is a grammar writing system with no defined relationship between a given schema and its semantic meaning. An XML schema is simply a grammar. Any semantics represented by that schema and its associated documents are specified external to those representations, e.g., in documentation.
RDB has a formal definition of semantics, but no automated reasoning capability. Section 4.3 and chapter 16 of [dl handbook] compare in some detail the similarities and differences between the formal definition of semantics in RDB and the formal definition of semantics in Description Logic.
Given that a knowledge base is represented in OWL, it becomes amenable to automated reasoning for the purpose of validating and augmenting the knowledge represented. There are three reasoning tasks that can be automated (see section 2.2 of [dl handbook]):
- Satisfiability: ensures that every defined class can be non-empty.
- Subsumption: determines class hierarchy.
- Consistency: identifies class membership for individuals.
These tasks help validate that there are no contradictions in class definitions and class membership.
RDB has the capability to define constraints on data within tables in the database. However, there is no capability for automatically checking for contradictions within the set of constraints. With OWL, satisfiability helps ensure that there are no contradictions among classes (see [nistir] for a simple example). In addition, if a query is modeled as a class definition, satisfiability ensures that it is possible for that query to return results. RDBs have no automated tools to check that a given query does not contradict constraints on data within tables. A query that contradicts constraints on data within tables will never return anything.
These reasoning tasks also augment the knowledge in the knowledge base. Subsumption computes class hierarchy for those classes whose position in the hierarchy was not explicitly specified, and for those individuals whose class membership was not fully specified, consistency places each individual in the classes where it is a member. Not only do these reasoning tasks augment knowledge in the knowledge base, they also help ensure a knowledge base's validity. Following the subsumption and consistency tasks, erroneous and unintended class definitions can usually be identified.
For a knowledge base represented in OWL DL, these reasoning tasks are always decidable and fully automated. Not all knowledge is expressible in OWL (see section 5.4 of [horrocks]). Furthermore, as a result of the open world semantics of OWL, some knowledge/information constructs are difficult or impossible to represent in OWL (see section 188.8.131.52 of [dl handbook] and [alan]).
Currently, Semantic Web reasoning tools are only capable of fully applying these three reasoning tasks to knowledge bases which can reside in memory. Such knowledge bases (which can be as small as tens of thousands of triples) are not sufficient for most applications. Most applications require at least hundreds of thousands of triples. For those applications, there are several tools available that enable query inferencing on the full knowledge base, e.g., [sesame], [oracle spatial]. Query inferencing makes use of role properties, e.g., symmetry, transitivity, to reduce the complexity of queries and in some cases, to reduce the size of the knowledge base. Full reasoning can usually be applied to the ontology itself (perhaps, with some sample of individuals) so that some level of validation and augmentation of the knowledge base can be automatically applied.
Semantic Web technology enables the integration of knowledge bases with each other, and with datasets represented using non-semantic web technologies. Integrating such resources requires that the semantics represented within them is harmonized.
OWL/RDF enables resources to be linked semantically, whereas, Web pages are not if there are links at all. HTML links are simply references with no semantic meaning. Separate neuroscience databases may contain different concepts, e.g., neurons such as CA1 pyramidal, peptides such as beta amyloid, drugs, and antibodies that can potentially be linked together in a meaningful way. For example, beta amyloid "is_neurotoxic_to" CA1 pyramidal. Some drug or antibody "reduces" such neurotoxicity. Semantic Web technology allows these links to be created between different resources so data aggregation and integration can be more easily implemented.
Integrating datasets using XML is difficult and not consistent with the very dynamic nature of some applications. See [xiaoshu] for details.
With RDB, integrating datasets can be accomplished by creating queries which access tables in different schemas located in disparate databases. The queries become the concepts which accomplish the integration. Integration may also require new tables and constraints that relate concepts and terms. As is the case with XML, this process is problematic, and as previously described, neither formal definition of semantics nor automated reasoning is available.
With Semantic Web technology, the ontology is the method used to harmonize semantics associated with different resources. The process can be generally described as:
- Create an annotation layer on top of existing resources using standard vocabulary and ontologies.
- Harmonize concepts in existing resources by providing semantic links between concepts.
- Provide data in the existing resources in OWL or RDF format.
For some examples, see [eric], chapter 12 in [kei book], [susie], [kei paper], [scott], [kei book].
Many web resources, including neuroscience resources, lack a standard description, making resource discovery difficult. Keyword-based search engines like Google are limited in terms of specificity and sensitivity. For example, discovering what microarray databases contain gene expression data in both hippocampus and striatium is difficult and time consuming using keyword-based search engines. Each of the individual microarray database web sites (assuming that the user knows about these databases) must be searched, using each site's custom search interface, to discover if such gene expression datasets exist.
OWL/RDF enables each of the microarray data sources to be annotated using a standard vocabulary or ontology that includes things like domain type (e.g., neuroscience), organism (e.g., human, rat, and/or mouse), and brain region. When hierarchical relationships (e.g., specific brain regions vs. general brain regions) have been defined, inference may be used to guide knowledge discovery. For example, if the user searches based on a specific brain region such as Neostriatum, inference can be performed to allow more general regions (e.g., Basal Ganglia) to be included in order to make the search broader. Such inferencing is possible because the semantics of OWL/RDF are formally defined.
How These Benefits are Demonstrated
- Query 2: Genes, CA1 Pyramidal Neurons and signal transduction processes from the Banff Presentation illustrates the sensitivity, precision, and efficiency of queries to a Semantic Web knowledge base that integrates several existing datasets (see Knowledge Discovery and Knowledge Integration). This query replaces the several queries that would be needed to query each dataset in order, i.e., developing a query for a dataset, reducing the number of responses to obtain the information desired, and forming a query for the next dataset in order.
- Query 2: Genes, CA1 Pyramidal Neurons and signal transduction processes from the Banff Presentation also illustrates query inferencing (see Automated Reasoning). This query takes advantage of the transitivity of
http://purl.org/obo/owl/obo#part_ofin the demonstration's ontology and thus, is simpler than the equivalent query in an RDB.
- Query 1: Research statements about therapeutic targets from the HCLS DEMO USE CASE SCRIPT can be modeled using the defined class:
swan:[[ResearchStatement]] and (swan:citesConcept some swan:[[TherapeuticTarget]])Satisfiability, one of the reasoning tasks, shows that this query is not contradictory with the rest of the ontology, and consequently, shows that it is possible for this query to return results. Moreover, subsumption, another reasoning task, identifies this class in the ontology's class hierarchy as a subclass of
swan:[[ResearchStatement]], augmenting the knowledge in the knowledge base. (see Automated Reasoning).
See HCLSIG Demo QueryScratch for useful queries on the way to the demo.
[oracle spatial] Oracle Spatial Resource Description Framework (RDF), 10g Release 2 (10.2), B19307-03. http://download-east.oracle.com/otndocs/tech/semantic_web/pdf/rdfrm.pdf
[dl handbook] Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. The Description Logic Handbook: Theory, Implementation, and Application (Cambridge University Press, 2003).
[sql] Database Language – SQL, ISO 9075, 1992
[nistir] John Barkley, Using Semantic Web Methods to Improve Information Resource Quality, NIST Internal Report 7354, September 2006. http://www.itl.nist.gov/div897/staff/barkley/consistency-validation-OWLvXML-RDB-9-29-06.pdf
[horrocks] Horrocks, I., Patel-Schneider, P. F., van Harmelen, F. From SHIQ and RDF to OWL: The making of a web ontology language. J. of Web Semantics, 1(1):7-26, 2003. http://www.cs.man.ac.uk/~horrocks/Publications/download/2003/HoPH03a.pdf
[alan] Alan Ruttenberg, Jonathan A. Rees, Joanne S. Luciano, Experience Using OWL DL for the Exchange of Biological Pathway Information, OWL: Experiences and Direction Workshop, Galway Ireland, November 2005. http://www.mindswap.org/2005/OWLWorkshop/sub37.pdf
[eric] Biodash: Eric K. Neumann and Dennis Quan, A Semantic Web Dashboard for Drug Development, Pacific Symposium on Biocomputing 11:176-187(2006). http://helix-web.stanford.edu/psb06/neumann.pdf
[susie] Stephens, Susie; LaVigna, David; DiLascio, Mike; Luciano, Joanne. Aggregation of Bioinformatics Data Using Semantic Web Technology. In: Journal of Web Semantics, (4)3, 2006. http://www.websemanticsjournal.org/ps/pub/showDoc.Fulltext/document.pdf?lang=en&doc=2006-15&format=pdf&compression=
[kei paper] Hugo Y.K. Lam, Luis Marenco, Tim Clark, Yong Gao, June Kinoshita, Gordon Shepherd, Perry Miller, Elizabeth Wu, Gwen Wong, Nian Liu, Chiquito Crasto, Thomas Morse, Susie Stephens, and Kei-Hoi Cheung, Semantic Web Meets e-Neuroscience: An RDF Use Case, 43rd Annual Technical Meeting Society of Engineering Science (SES2006). http://www.oracle.com/technology/industries/life_sciences/press/semantic_web_meets_eneuroscience.pdf
[scott] M. Scott Marshall, Lennart Post, Marco Roos, Timo M. Breit, Using semantic web tools to integrate experimental measurement data on our own terms, International Workshop on Knowledge Systems in Bioinformatics (KSinBIT'06), Montpellier, France, 2006. http://integrativebioinformatics.nl/docs/MarshallKSinBIT.pdf
[kei book] Baker, Christopher J.O.; Cheung, Kei-Hoi (Eds.), Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences (Springer 2007)
[xiaoshu] Xiaoshu Wang, Robert Gorlitsky & Jonas S Almeida, From XML to RDF: how semantic web technologies will change the design of 'omic' standards, Nature Biotechnology 23, 1099 - 1103 (2005). http://www.nature.com/nbt/journal/v23/n9/pdf/nbt1139.pdf
We are limiting the scope of this demo to only those SW technologies, data and resources that can be pulled together and highlighted in the next few months.
The top level 'scenario' is two researchers, one from AD the other from PD meet and start discussing their research. It occurs to them that there are overlaps in the questions they are asking and that one's data may be useful to the other. So, the one agrees to make his data available to the other.
What needs to happen is - specific queries need to be articulated and the databases that are relevant. The tools.
(add your name)
- Joanne Luciano
- Bill Bug
- Kei Cheung
- Gwen Wong
- Elizabeth Wu
- June Kinoshita
- Tim Clark
- Susie Stephens
- Donald Doherty
- Ray Hookway
Use case context
Problem statement for this use case
Information Resources Used
RDF/OWL Representation Schemes for Various Data and Knowledge Sources
- Proposed RDF representation for Entrez Gene and Homologene by Ray and Vipul
- Proposed RDF representation of Entrez Gene and Homologene by Olivier and Satya Sahoo
- Proposed RDF Representation of NeuronDB data by Kei Cheung and Matthias Samwald
- Proposed RDF Representation of PubMed annotations by Jonathan Rees
- Proposed RDF Representation of BAMS
- Proposed RDF Representation of Allen Brain Atlas
- Proposed RDF Representation of SWAN by the SWAN Group
- Proposed RDF Representation of Homologene
- Proposed OWL Representation of GO
- Proposed OWL Represenation of Mammalian Phenotype
- Proposed OWL Represenation of PDSP KI
- Proposed SKOS/OWL Representation of MeSH
- Proposed RDF Representation of AlzGene
- Proposed RDF Representation of PubChem
- Proposed RDF Representation of Antibodies
- Proposed RDF Representation of Reactome
BAMS (Brain Architecture Management System)
- Part names: http://hissa.nist.gov/jb/biordf-demo/bams_part_names.txt
- Cell names: http://hissa.nist.gov/jb/biordf-demo/bams_cell_names.txt
- Molecule names: http://hissa.nist.gov/jb/biordf-demo/bams_molecule_names.txt
- 4-23-07 knowledge base of swanson-98.xml (5.4 MB): http://svn.neurocommons.org/svn/trunk/convert/bams/bams-from-swanson-98-4-23-07.owl.txt
PubChem CompoundsDB and therapeutic agents
PubChem Resource and candidate therapeutics agents: http://esw.w3.org/topic/HCLSIG/DEMO/ChemicalAgents
Here are links to other projects or pieces of projects that highlight the use of semantic web in HC and the LS.
Task supports and dependencies
- Generate Use Case Script (The two researhers meeting over coffee) that incorporates AD and PD
- Identify data sets to use in this specific use case
- Link the data (in RDF)
- User Interface
- Demo Script -
Tools and Services
- We are using Openlink Virtuoso as our RDF Store. Note that they have an open source version, and have been good at supplying technical support.
- Hewlett-Packard has allowed us the use of two HP Proliant DL360 servers.
- prefuse.org has UI elements that AlanRuttenberg has code to use.
- Timeline (e.g. E.Coli parts - use firefox)