Difference between revisions of "HCLSIG/LODD/Data"

From W3C Wiki
< HCLSIG‎ | LODD
Jump to: navigation, search
(Added some more details on the data sets.)
Line 13: Line 13:
 
| [http://www4.wiwiss.fu-berlin.de/drugbank/ DrugBank]
 
| [http://www4.wiwiss.fu-berlin.de/drugbank/ DrugBank]
 
| Drugs
 
| Drugs
|  [http://www.drugbank.ca/ Drugbank.ca] provides drug (i.e., chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e., sequence, structure, and pathway) information
+
|  [http://www.drugbank.ca/ Drugbank.ca] provides drug (i.e., chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e., sequence, structure, and pathway) information ({{doi|10.1093/nar/gkj067}})
 
| 766,920 triples; 4,800 drugs, 2,500 protein sequences
 
| 766,920 triples; 4,800 drugs, 2,500 protein sequences
 
| updated regularly
 
| updated regularly
Line 70: Line 70:
 
| [http://www4.wiwiss.fu-berlin.de/sider/ SIDER]
 
| [http://www4.wiwiss.fu-berlin.de/sider/ SIDER]
 
| Diseases / Side Effects
 
| Diseases / Side Effects
|  [http://sideeffects.embl.de/ SIDER] contains information on marketed drugs and their adverse effects  
+
|  [http://sideeffects.embl.de/ SIDER] contains information on marketed drugs and their adverse effects ({{doi|10.1038/msb.2009.98}})
 
|  192,515 triples;  1,737 genes
 
|  192,515 triples;  1,737 genes
 
| updated 2009
 
| updated 2009
Line 78: Line 78:
 
| [http://www4.wiwiss.fu-berlin.de/stitch/ STITCH]
 
| [http://www4.wiwiss.fu-berlin.de/stitch/ STITCH]
 
| Chemicals / Proteins
 
| Chemicals / Proteins
|  [http://stitch.embl.de/ STITCH] contains information on chemicals, proteins, and their interactions  
+
|  [http://stitch.embl.de/ STITCH] contains information on chemicals, proteins, and their interactions ({{doi|10.1093/nar/gkm795}})
 
|  7,500,000 chemicals; 500,000 proteins; 370 organisms  
 
|  7,500,000 chemicals; 500,000 proteins; 370 organisms  
 
| updated July 2009
 
| updated July 2009
Line 94: Line 94:
 
| [http://www.ebi.ac.uk/chembl/ ChEMBL]
 
| [http://www.ebi.ac.uk/chembl/ ChEMBL]
 
| Chemical / Assays (Proteins, Organisms) / Papers
 
| Chemical / Assays (Proteins, Organisms) / Papers
|  ChEMBL contains information on trial drugs with information about activity against targets like but not limited to proteins. All is backed up by and linked to literature. Includes links to Bio2RDF for ChEBI and Uniprot.
+
[http://www.ebi.ac.uk/chembl/ ChEMBL]] contains information on trial drugs with information about activity against targets like but not limited to proteins. All is backed up by and linked to literature. Includes links to Bio2RDF for ChEBI and Uniprot. License: CC-BY-SA.
 
| ~24M triples
 
| ~24M triples
 
| Updated 2010-01
 
| Updated 2010-01

Revision as of 13:21, 2 December 2010

LODD-related datasets that the LODD group already made available as Linked Data

Name Topic Short Description Size Status/ Activity Example Instances SPARQL Endpoint
DrugBank Drugs Drugbank.ca provides drug (i.e., chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e., sequence, structure, and pathway) information (doi:10.1093/nar/gkj067) 766,920 triples; 4,800 drugs, 2,500 protein sequences updated regularly Varenicline via Marbles, via OpenLink Data Explorer http://www4.wiwiss.fu-berlin.de/drugbank/sparql
LinkedCT Clinical Trials Linked data source of trials from ClinicalTrials.gov 7 million triples, 62000 trials preview release Influenza (Intervention), A Trial, AIDS (condition), A reference, A location http://data.linkedct.org/sparql
DailyMed Drugs dailymed.nlm.nih.gov provides information about approved prescription drugs, includes FDA approved labels (package inserts) 164,276 triples; 4,039 drugs updated regularly "Sterile Water (Irrigant)" via Marbles, via OpenLink Data Explorer http://www4.wiwiss.fu-berlin.de/dailymed/sparql
DBpedia Drugs/ Diseases/ Proteins RDF data about 2.49 million things that has been extracted from Wikipedia 218 million RDF triples; 2,300 drugs, 2,200 proteins updated every 3 months Aspirin, HIV http://dbpedia.org/sparql
Diseasome Diseases / Genes Diseasome describes characteristics of disorders and disease genes linked by known disorder–gene associations 91,182 triples; 2,600 genes updated 2006 Alzheimer's via Marbles, via OpenLink Data Explorer http://www4.wiwiss.fu-berlin.de/diseasome/sparql
RDF-TCM Genes / Diseases / Medicine / Ingredients Traditional Chinese medicine, gene and disease association dataset and a linkset mapping TCM gene symbols to Extrez Gene IDs created by Neurocommons 117,643 updated August 2009 (stable) Ginkgo biloba http://hcls.deri.org/sparql; graph name: http://hcls.deri.org/resource/graph/tcm
RxNorm Drugs A linked version of the NLM's RxNorm database that connects prescription drugs, ingredients, and NDC through RXCUI a concept unique identifier. RxNorm is a product developed by NIH’s National Library of Medicine. It currently interlinks 12 different drug vocabularies around a unique concept identifier. Due to licensing only six of the drug vocabularies are made available as part of the LODD cloud. This includes: Medical Subject Headings,, Metathesaurus FDA National Drug Code Directory, Metathesaurus FDA Structured Product Labels, National Drug File, RxNorm Vocabulary, Veterans Health Administration National Drug File

Links are provided connecting RxNorm to drug bank and to the UMLS.

over 7.7 million triples; 165,806 RXCUI (Concept Unique Identifiers) Unique drugs and ingredients; 332,754 RXAUI (Atomic Unique Identifiers) sourced terms Based on 3/2010 Rxnorm Release; Last updated 5/2010 Singulair from the Metathesaurus FDA Structured Product Labels http://link.informatics.stonybrook.edu/sparql/
SIDER Diseases / Side Effects SIDER contains information on marketed drugs and their adverse effects (doi:10.1038/msb.2009.98) 192,515 triples; 1,737 genes updated 2009 Confusion via Marbles http://www4.wiwiss.fu-berlin.de/sider/sparql
STITCH Chemicals / Proteins STITCH contains information on chemicals, proteins, and their interactions (doi:10.1093/nar/gkm795) 7,500,000 chemicals; 500,000 proteins; 370 organisms updated July 2009 Lactose via Marbles http://www4.wiwiss.fu-berlin.de/stitch/sparql
Medicare Medicare Formulary xxx xxx xxx xxx http://www4.wiwiss.fu-berlin.de/medicare/sparql
ChEMBL Chemical / Assays (Proteins, Organisms) / Papers ChEMBL] contains information on trial drugs with information about activity against targets like but not limited to proteins. All is backed up by and linked to literature. Includes links to Bio2RDF for ChEBI and Uniprot. License: CC-BY-SA. ~24M triples Updated 2010-01 A IC50 activity. http://rdf.farmbio.uu.se/chembl/sparql
WHO Global Health Observatory Infectious Diseases /Demography / Socioeconomic Conditions / Environmental Factors Data and statistics for infectious diseases at country, regional, and global levels 354300 Updated 2010-09 xxx http://aksw.org/Projects/GHO2SCOVO?v=wmb

lodd-datasets_2009-08-06.png

This figure shows the incorporation of LinkedCT, DailyMed, DrugBank, Diseasome, RDF-TCM, and SIDER into the Linked Data cloud. These data sets are represented in dark gray, while light gray represents other Linked Data from the life sciences, and white indicates interlinked datasets covering geographic, person-related and conceptual data. More on the interlinking methodology and statistics can be found on the Interlinking page.

The LODD datasets have been crawled by the SWSE Semantic Web search engine and can be accessed via a faceted browsing interface at [1] (Example query: Varenicline).

Most of the LODD datasets have also been integrated into the SPARQL endpoint of the HCLS Knowledge Base, see the wiki page of the HCLS KB for further information.

Bio2RDF Data Sets

The Bio2RDF project has published 40 biology-, gene- and medical-related datasets (altogether 2.3 billion triples). The datasets are available via SPARQL endpoints and as Linked Data. It is recommended that you use the Bio2RDF Java Servlet, and optionally download the databases for efficient personal use. Running your own instance of the OpenLink Virtuoso AMI for EC2 is also an option (and for basic URI resolution doesn't require the Java Servlet, although if you want advanced queries you should still download it and configure it to query your EC2 sparql endpoint).

Chem2bio2RDF

Data Sets for the LODD Task

To complement the drug-related Web of Data build by the LODD effort, the following data sets could/should also be published as Linked Data.

The LODD effort is currently gathering more information about relevant datasets. See also Evaluation of LODD Data Sets for current evaluation results.

Alternative Herbal Medicine use case

Identified Based Linkage Points

  • INCHIs
  • PubChem Compound ID (CID)
  • PubChem NSC
  • Chemical Abstract ID (CAS)
  • New Drug Application (NDA)

Data Set Attributes

  • Licensing
  • Data Format
  • Identifiers