From W3C Wiki
Jump to: navigation, search

Goal: Add data to eric p's federated query

I'm still adding links to this and cleaning formats (wilbanks, 1 september 2004).


A protein target for angiogenesis has been found. Angiogenesis (angio'gen'esis) - the growth of new blood vessels - is an important natural process occurring in the body, both in health and in disease. It is a hot new type of approach for attacking cancer. If you can stop the growth of new blood vessels, the theory goes, you can essentially starve tumors to death. By the way, "new" in this context means the theory is about 35 years old and the first drugs are in clinical trials.

More info on angiogenesis.

Proof of Concept:

  • Based on the protein screened (human angiogenin) what are other interesting targets based on Uniprot annotations, KEGG pathways and GO classifications? What other types of semantic "lensing" would be interesting (i.e., what other facts do we want about the protein)?
  • What is the public information about known compounds that inhibit angiogenin specifically and angiogenesis more globally? What are their mechanisms of actions, safety, toxicity and efficacy?
  • What drugs use those MoA and who owns them? What are the other off-label uses? Could any of those have been inferred?
  • Can we correlate the bioactivity (how many pathways are activated - second question is how to rank activity!) with the substructures of the compounds (the public ones and the one that was tested)? If so, how do we measure and rank?
  • Can we accomplish this with a small set of queries using a semantic web approach? How long would it have taken otherwise?

Strawman Code:

Elements of the story...

  • The protein target has been screened for compounds that bind to it. This is the "key" and the protein is the "lock" in the classic small molecule drug metaphor.
  • Screen for compounds that inhibit the activity of human angiogenin, a protein with RNase activity that can induce angiogenesis. An in vitro assay for angiogenin RNase activity was developed using purified angiogenin and a small (approx. 5 nucleotide) RNA analogue that fluoresces when cleaved.

There is a microarray experiment profiling angiogenesis.

Scientists are primarily interested in the analysis of the data - the raw data matters from a trust/provenance perspective (was this experiment well-formed and well-executed, and therefore can I trust it?)

Most scientists use software to perform statistical cluster analysis (k-means, nearest neighbor) and some more advanced scientists will run techniques like support vector machines and principal components analysis - the method isn't important for this proof of concept.

An example: clusters for the above microarray experiment.

Representing the results of those analyses is what's important. I propose to use Yoshio's work on representing probability in RDF for this.

Scientists now need to deconvolute the statistical results: which clusters are actually being driven by biological activity, and which are merely artifacts of the math? The traditional approach is to take the gene names for the cluster and start searching the literature, but when you are dealing with thousands of genes (each of which has n synonyms!) it gets pretty unworkable.

There is public toxicity information on compounds targeting angiogenesis, though it's mainly free text.

TOXNET - search on angiogenesis 9 records in Hazardous Substances Databank 635 hits in Toxline (literature) Probably need to text mine and convert to RDF - for this experiment we'll do it by hand Links in other compounds as well as side effect mechanisms (anti-targets) There is a connection between the screening database (ChemBank) and this tox information - can do the join on the chemical names/aliases Three categories for PoC: animal toxicity, pharmacokinetics, and chemical/physical structures Toxnet also has synonym information that should be leveraged Choosing three drugs: neomycin, thalidomide and celecoxib

There is clinical trial information on each of those drugs as well: Thalidomide Clinical Trials Celecoxib Clinical Trials

Most of these trials are "combination" trials - it might be interesting to use "being tested in combination with" as a relationship and bring those chemical names into the query

There is public information about proteins in RDF Uniprot Contains OWL version of the gene ontology

KEGG makes metabolic pathways available in XML KEGG ML download page Use GRDDL to convert?