Helping Drug Discovery Through Hypothesis-Based Knowledge Bases
Case Study Rhetorical Article Model SWASD Task
November 27, 2009, Anita de Waard
Pharmaceutical researchers need to be able to have access to an integrated view of all of their data in order to be able to make effective decisions as to which drug targets and compounds to pursue. Companies want to minimize costly late-stage attrition by identifying and eliminating drugs that do not have desirable safety profiles or sufficient efficacy as early on as possible. The need for effective data integration has become stronger as the cost of drug discovery and development has soared to over $1 billion.
The integration of data, however, has proven to be far from straightforward. Data commonly originates in different departments in which varying terminologies are used. Further, the data itself is very heterogeneous in nature, and consists of data types that include electronic patient records, chemical structures, biological sequences, images, biological pathways, and scientific papers. Many companies have attempted to create data warehouses that contain all of this data, but many have found this approach lacking the flexibility required within a scientific discipline. Consequently, many pharmaceutical companies are exploring the use of the Semantic Web for data integration.
And increasingly, Pharmacologists are using semantically-based computational-linguistic technologies to access the vast quantities of data produced by internal and external sources. However, despite a flowering of data integration, text mining and fact extraction systems, the fact remains that these systems currently only extract textual occurrences of terms of interest, and not the knowledge context or ‘epistemic’ (truth-value) status of these terms as they are mentioned in scientific claims. What is needed is a new level of markup, which we are calling ‘Epistemic Markup’, that allows the user access to the knowledge claims, linked to experimental evidence, that forms the argumentational backbone of the article.
Pharmacology is1 “the science of drug action on biological systems. In its entirety, it embraces knowledge of the sources, chemical properties, biological effects and therapeutic uses of drugs. It is a science that is basic not only to medicine, but also to pharmacy, nursing, dentistry and veterinary medicine. Pharmacological studies range from those that determine the effects of chemical agents upon subcellular mechanisms, to those that deal with the potential hazards of pesticides and herbicides, to those that focus on the treatment and prevention of major diseases by drug therapy. Pharmacologists are also involved in molecular modeling of drugs, and the use of drugs as tools to dissect aspects of cell function.”
Pharmacologists are involved in the molecular modeling of drugs, and the use of drugs as probes to better understand all aspects of cell function. There are two major divisions to pharmacological research: one is “academic”, the other “industrial’:
- The goal of academic pharmacologists is to understand the way in which physiological systems work. They achieve this by studying the effects of drugs on these systems, and by conducting basic scientific inquiry into specific cellular and molecular systems. Most basic research in pharmacology provides fundamental scientific information that crosses disciplines. Data about the molecular characteristics of drug targets (the biological molecules that interact with the drugs) is increasing rapidly.
- The goal of industrial pharmacologists is to design and identify therapeutic drugs for (mostly) human diseases. Their work tends to be focused on conducting feasibility studies for new avenues of drug discovery and devising assay systems for screening compounds. However, in each company there will be a cohort of researchers whose work patterns are almost exactly the same as academic pharmacologists (without the teaching).
3. Use cases
3.1 Use Case #1: Hypothesis-Based Data Integration
Task: A scientist working on (drug) target optimization tries to understand the relationship between a gene and the expressed protein. She would like to know everything that is known about a given gene, and more particularly the effect of an engineered section of the gene on the subsequently expressed protein.
Today: The scientist searches by gene name, GenBank number in the corporate biology information repository. She then searches GenBank, SwissProt and PubMed for information on this particular gene. In each source she needs to read the information, find the information describing the protein expression and associated mechanisms of action, and copy the relevant information to a report. This process requires tedious, manual information collection. Specifically, after all this work, it is still unclear whether the information found was up-to-date, or has been succeeded by new information; also, it is unclear whether the status of the experimental evidence is still valid now.
Future: A future workflow should allow the scientist obtain all information on the expression on the gene and sections of the gene by browsing through a Hypothesis-based knowledge base. Specifically, this means that
- a) The use performs a query on a given gene
- b) The tool extracts a series of statements pertaining to findings on that gene, that contain
- A list of hypotheses concerning this gene
- A direct link to experimental evidence supporting this hypothesis
- A network of relationships linking a specific hypothesis to previous or succeeding hypotheses.
Alternatively, heat-map tools can be developed that allow the user to list and compare hypotheses and pick the one that ‘holds up’ best, given the experimental evidence and the user’s question.
3.2 Use case #2: Improving Existing Text Mining Systems
Task: A member of the Allergy & Respiratory therapeutic area wishes to search for new opportunities to utilize the progress made by other teams.
Today: The user opens up a Web page which is centered on the internal portfolio across all therapeutic areas, but cross-referenced with public and private information associating the projects with data relevant to Allergy & Respiratory diseases. The data which populates this table is drawn from a variety of different internal and external systems. Key “facts” are exported from these systems into RDF. Using this tool, the Allergy & Respiratory scientist is able to ask the question “for targets being investigated by other Therapeutic areas, which have any information that suggest they may also be useful for me?”. The user has access to aggregated data that was garnered from large scale text-mining experiments which run over the biomedical literature. However, since the text mining systems are not able to distinguish between a casual or non-essential mention of a drug or therapeutic area, and the proposal of where a truly new therapy, the amount of data that he needs to weed through to find anything new is overwhelming. To find out whether or not there is a real connection between a given target and the therapeutic area of interest and identify whether a clinical effect was a side effect or a therapeutic effect, it turns out that he has to read through vast quantities of literature, after all.
Future: Once texts are augmented with Epistemic Markup, it becomes clear which statements are key research questions that have been experimentally verified, and which are simple mentions of an entity or disease that are not experimentally tested in the research. Since the epistemic markup is easy to add and trivial to query, internal and external documents can be connected and browsed with ease, and relations between current and past hypotheses visualized directly. Since hypotheses that are not of interest can be excluded easily, the number of data sources under scrutiny at any given time is drastically reduced, and precious research time recovered.
Acknowledgements and Reference:
This Use case has borrowed heavily from the significantly superior knowledge of the field of Pharmacology of my Elsevier colleagues Stephanie Diment and Pieder Caduff, as displayed in several internal reports written from 1998 – 2003. The introductory paragraphs have been borrowed, in true modular spirit, from the Semantic Web In Use (SWIU) Case Study ‘Prioritization of Biological Targets for Drug Discovery’ http://www.w3.org/2001/sw/sweo/public/UseCases/Lilly/ written by Susie Stephens. Use case # 2 is taken from the SWIU Use Case ‘Case Study: Applying the Semantic Web to Internal Compound Repurposing’ written by Nigel Wilkinson and Lee Harland http://www.w3.org/2001/sw/sweo/public/UseCases/Pfizer/.