HCLSIG/LODD/Interlinking
Interlinking Methodology in the LODD project
There are many commonly used identifiers in the life sciences that can be utilized for making links between data sets explicit. Links that were generated based on shared identifiers include the connections from LinkedCT to Bio2RDF's PubMed, and from DrugBank to DBpedia. The connections between bioinformatics and cheminformatics data sources are already provided by Bio2RDF allowing us to interlink our drug-related data sets to their work.
In cases where no shared identifiers exist, state-of-the-art string and semantic matching techniques were applied for link discovery. Approximate string matching was employed to interlink LinkedCT and Diseasome, where for instance "Alzheimer's disease" in LinkedCT was matched with "Alzheimer_disease" in Diseasome. Semantic matching is especially useful in matching clinical terms as many drugs and diseases have multiple names. Drugs tend to have generic names and brand names, for example, "Varenicline" has the synonym "Varenicline Tartrate" and the brand names "Champix" and "Chantix".
Semantic link discovery in this project is performed using the following novel link discovery tools:
- LinQuer [1] is a novel tool for semantic link discovery over relational data. The LinQuer framework consists of LinQL, a declarative language that allows specification of linkage requirements in a wide variety of applications. The framework then rewrites LinQL queries into standard SQL queries that can be run over existing relational data sources. LinQuer is particularly useful due to the fact that most of our data is published using tools that operate over relational data sources (such as D2R Server). LinQuer allows semantic link discovery based on state-of-the-art string and semantic matching techniques and their combinations.
- Silk [2] discovers links between data sources. It provides a declarative language for specifying the link types and conditions. The implemented similarity metrics include string, numeric, data, URI, and set comparison methods as well as a taxonomic matcher that calculates the semantic distance between two concepts within a concept hierarchy. Each metric evaluates to a similarity value between 0 and 1 (higher values indicating a greater similarity). Metric results can be weighted and form an overall similarity value.
More on the interlinking methodology and statistics will be made available soon.
[1] O. Hassanzadeh, R. Xin, R. J. Miller, L. Lim, A. Kementsietsidis, and M. Wang, Linkage Query Writer, To Appear in Proceedings of the 35th International Conference on Very Large Data Bases (VLDB 2009) - Demonstrations Track
[2] Volz, J., Bizer C., Gaedke, M., and Kobilarov, G.: Silk – A Link Discovery Framework for the Web of Data. In: Linked Data on the Web workshop at WWW2009, 2009.
Interlinking
The figure below shows the data sets that have been published and their interlinking pathes so far.
Number of outgoing Links
Data Set |
DailyMed |
DrugBank |
DailyMed |
LinkedCT |
RDF-TCM |
SIDER |
Linkage types
Source Data Source | Target Data Source | Number of Links |
DailyMed | LinkedCT | 27,685 |
DailyMed | LinkedCT | 44 |
DailyMed | DBpedia | 49 |
DailyMed | DBpedia | 2504 |
DailyMed | Diseasome | 6,124 |
DailyMed | DrugBank | 1,593 |
DailyMed | RDF-TCM | 21 |
Diseasome | DBpedia | 1,300 |
Diseasome | DBpedia | 643 |
Diseasome | GeneID | 688 |
Diseasome | HGNC | 688 |
Diseasome | OMIM | 2,929 |
Diseasome | Symbol | 9,743 |
Diseasome | LinkedCT | 372 |
Diseasome | DailyMed | 6,124 |
Diseasome | DrugBank | 8,202 |
Diseasome | RDF-TCM | 313 |
Diseasome | RDF-TCM | 63 |
DrugBank | ChEBI | 736 |
DrugBank | PDB | 3,379 |
DrugBank | CAS | 2,240 |
DrugBank | Pfam | 19,082 |
DrugBank | UniProt | 4,660 |
DrugBank | HGNC | 1,675 |
DrugBank | GeneID | |
DrugBank | Symbol | 1,533 |
DrugBank | LinkedCT | 12,127 |
DrugBank | DBpedia | 187 |
DrugBank | DBpedia | 1,522 |
DrugBank | Diseasome | 8,202 |
DrugBank | DailyMed | 1,593 |
DrugBank | KEGG | 913 |
DrugBank | KEGG Compound | 1,331 |
DrugBank | RDF-TCM | 384 |
DrugBank | RDF-TCM | 1 |
DrugBank | PubMed | 96 |
LinkedCT | DailyMed | 27,685 |
LinkedCT | DrugBank | 12,127 |
LinkedCT | Diseasome | 372 |
LinkedCT | Geonames | 129,177 |
LinkedCT | DBpedia | 8,848 |
LinkedCT | Yago | |
LinkedCT | PubMed | 42,219 |
LinkedCT | RDF-TCM | 141 |
RDF-TCM | DBPedia | 649 |
RDF-TCM | DBPedia | 496 |
RDF-TCM | DBPedia | 255 |
RDF-TCM | Sider | 171 |
RDF-TCM | Diseasome | 313 |
RDF-TCM | Diseasome | 63 |
RDF-TCM | DrugBank | 1 |
RDF-TCM | DrugBank | 384 |
RDF-TCM | EntrezGene | 944 |
RDF-TCM | DailyMed | 21 |
RDF-TCM | LinkedCT | 141 |
Sider | RDF-TCM | 171 |
Sider | DrugBank | 1,140 |
Sider | DailyMed | 1,986 |
Sider | Diseasome | 238 |
Sider | DBpedia | 1,392 |
Sider | DBpedia | 735 |
Sider | STITCH | 14,894 |
STITCH | DBpedia | 123 |