HCLSIG/LODD/Interlinking

Interlinking Methodology in the LODD project

There are many commonly used identifiers in the life sciences that can be utilized for making links between data sets explicit. Links that were generated based on shared identifiers include the connections from LinkedCT to Bio2RDF's PubMed, and from DrugBank to DBpedia. The connections between bioinformatics and cheminformatics data sources are already provided by Bio2RDF allowing us to interlink our drug-related data sets to their work.

In cases where no shared identifiers exist, state-of-the-art string and semantic matching techniques were applied for link discovery. Approximate string matching was employed to interlink LinkedCT and Diseasome, where for instance "Alzheimer's disease" in LinkedCT was matched with "Alzheimer_disease" in Diseasome. Semantic matching is especially useful in matching clinical terms as many drugs and diseases have multiple names. Drugs tend to have generic names and brand names, for example, "Varenicline" has the synonym "Varenicline Tartrate" and the brand names "Champix" and "Chantix".

Semantic link discovery in this project is performed using the following novel link discovery tools:

LinQuer [1] is a novel tool for semantic link discovery over relational data. The LinQuer framework consists of LinQL, a declarative language that allows specification of linkage requirements in a wide variety of applications. The framework then rewrites LinQL queries into standard SQL queries that can be run over existing relational data sources. LinQuer is particularly useful due to the fact that most of our data is published using tools that operate over relational data sources (such as D2R Server). LinQuer allows semantic link discovery based on state-of-the-art string and semantic matching techniques and their combinations.
Silk [2] discovers links between data sources. It provides a declarative language for specifying the link types and conditions. The implemented similarity metrics include string, numeric, data, URI, and set comparison methods as well as a taxonomic matcher that calculates the semantic distance between two concepts within a concept hierarchy. Each metric evaluates to a similarity value between 0 and 1 (higher values indicating a greater similarity). Metric results can be weighted and form an overall similarity value.

More on the interlinking methodology and statistics will be made available soon.

[1] O. Hassanzadeh, R. Xin, R. J. Miller, L. Lim, A. Kementsietsidis, and M. Wang, Linkage Query Writer, To Appear in Proceedings of the 35th International Conference on Very Large Data Bases (VLDB 2009) - Demonstrations Track

[2] Volz, J., Bizer C., Gaedke, M., and Kobilarov, G.: Silk – A Link Discovery Framework for the Web of Data. In: Linked Data on the Web workshop at WWW2009, 2009.

Interlinking

The figure below shows the data sets that have been published and their interlinking pathes so far.

Number of outgoing Links

Data Set

Linkage types

Source Data Source	Target Data Source	Number of Links
DailyMed	LinkedCT	27,685
DailyMed	LinkedCT	44
DailyMed	DBpedia	49
DailyMed	DBpedia	2504
DailyMed	Diseasome	6,124
DailyMed	DrugBank	1,593
DailyMed	RDF-TCM	21
Diseasome	DBpedia	1,300
Diseasome	DBpedia	643
Diseasome	GeneID	688
Diseasome	HGNC	688
Diseasome	OMIM	2,929
Diseasome	Symbol	9,743
Diseasome	LinkedCT	372
Diseasome	DailyMed	6,124
Diseasome	DrugBank	8,202
Diseasome	RDF-TCM	313
Diseasome	RDF-TCM	63
DrugBank	ChEBI	736
DrugBank	PDB	3,379
DrugBank	CAS	2,240
DrugBank	Pfam	19,082
DrugBank	UniProt	4,660
DrugBank	HGNC	1,675
DrugBank	GeneID
DrugBank	Symbol	1,533
DrugBank	LinkedCT	12,127
DrugBank	DBpedia	187
DrugBank	DBpedia	1,522
DrugBank	Diseasome	8,202
DrugBank	DailyMed	1,593
DrugBank	KEGG	913
DrugBank	KEGG Compound	1,331
DrugBank	RDF-TCM	384
DrugBank	RDF-TCM	1
DrugBank	PubMed	96
LinkedCT	DailyMed	27,685
LinkedCT	DrugBank	12,127
LinkedCT	Diseasome	372
LinkedCT	Geonames	129,177
LinkedCT	DBpedia	8,848
LinkedCT	Yago
LinkedCT	PubMed	42,219
LinkedCT	RDF-TCM	141
RDF-TCM	DBPedia	649
RDF-TCM	DBPedia	496
RDF-TCM	DBPedia	255
RDF-TCM	Sider	171
RDF-TCM	Diseasome	313
RDF-TCM	Diseasome	63
RDF-TCM	DrugBank	1
RDF-TCM	DrugBank	384
RDF-TCM	EntrezGene	944
RDF-TCM	DailyMed	21
RDF-TCM	LinkedCT	141
Sider	RDF-TCM	171
Sider	DrugBank	1,140
Sider	DailyMed	1,986
Sider	Diseasome	238
Sider	DBpedia	1,392
Sider	DBpedia	735
Sider	STITCH	14,894
STITCH	DBpedia	123

Interlinking Methodology in the LODD project

Interlinking

Number of outgoing Links

Linkage types

Metadata about Interlinking