From W3C Wiki

SWEO Community Project: Linking Open Data on the Semantic Web

Equivalence Mining and Matching Frameworks

This page collects software tools and papers about techniques that can be used to auto-generate links between data items within different datasources.

The page is part of the community project SweoIG/TaskForces/CommunityProjects/LinkingOpenData

An example of an equivalence link is <[1]> owl:sameAs <[2]> claiming that a DBpedia data item identifier (e.g, <[3]>) and a Geonames data item identifier (e.g., <[4]>) refer to the data item. This is also known as "co-reference".

Simple alternative which avoids the need of equivalence mining is to use commonly accepted identifiers within URIs. For example, the RDF book mashup uses ISBN numbers in its URIs- This allows other data sources about books to set links to the data items of the book mashup using a simple URI-pattern including the ISBN number.

Software Tools

  • Silk - A Link Discovery Framework for the Web of Data The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Silk can be run on a single machine or on a hadoop cluster (for instance Amazon EC2).
  • LIMES - Link Discovery Framework for Metric Spaces. LIMES implements time-efficient and lossless approaches for large-scale link discovery based on the characteristics of metric spaces.
  • LDIF - Linked Data Integration Framework translates heterogeneous Linked Data from the Web into a clean local target representation while keeping track of data provenance. The framework contains an identity resolution module for replacing URI aliases with a single target URI.
  • DSNotify - Detecting and Fixing Broken Links in Linked Data Sets
  • TopBraid Composer (ontology editor made by TopQuadrant) has a wizard for linking ontology instances to corresponding DBpedia concepts. See for details.
  • SERIMI. Automatic link generation tool that does not require matches prior knowledge of the data, domain or schema of these datasets.
  • SemMF SemMF is a flexible framework for calculating semantic similarity between objects that are represented as arbitrary RDF graphs. The framework allows taxonomic and non-taxonomic concept matching techniques to be applied to selected object properties.
  • Yves Equivalence Miner together with an experience report about the problems he ran into while interlinking Jamendo and Musicbrainz.
  • MOAT: Meaning Of A Tag Framework for manually interlinking tags with Semantic Web URIs (such as URIs from dbpedia, geonames … or any knowledge base)


People Interested in the Area

  • Stefano Mazzocchi (work plan)
  • Felix Van de Maele
  • Chris Bizer (I want to set links from the dbpedia dataset to other datasets. Already done: geonames, planed: Musicbrainz, US Census data. If you have other datasets that fit to be linked to dbpedia, please let me know.)
  • Tom Heath (I'm primarily interested right now in very lightweight, low-cost heuristics/hacks to link up things, places, and reviews; the RDF Bookmashup ISBN approach is the kind of place I'm looking to start)
  • Yves Raimond
  • Hugh Glaser (Doing a lot of this stuff between big people, projects and publications sources.)
  • Oktie Hassanzadeh and Mariano Consens (Currently developing a tool for finding links between different data sources using state-of-the-art similarity join techniques)
  • François Scharffe - Working on automating the mining process by using ontology alignments
  • Thomas Schandl - Working on LASSO project, which is about interlinking instances from local knowledge bases to LOD sources

Papers and Web Resources on the Topic

This stuff has been done over and over in the database community, often called duplicate recognition or record linkage. So if somebody knows good overview papers about the area please add them to this page, so that people don't have to reinvent the wheel.

There was a workshop on Ontology Matching at ISWC 2006. The approaches proposed there might also be useful for equivalence mining on data item/instance level.