TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining
SWEO Community Project: Linking Open Data on the Semantic Web
Equivalence Mining and Matching Frameworks
This page collects software tools and papers about techniques that can be used to auto-generate links between data items within different datasources.
The page is part of the community project SweoIG/TaskForces/CommunityProjects/LinkingOpenData
An example of an equivalence link is <[1]> owl:sameAs <[2]> claiming that a DBpedia data item identifier (e.g, <[3]>) and a Geonames data item identifier (e.g., <[4]>) refer to the data item. This is also known as "co-reference".
Simple alternative which avoids the need of equivalence mining is to use commonly accepted identifiers within URIs. For example, the RDF book mashup uses ISBN numbers in its URIs- This allows other data sources about books to set links to the data items of the book mashup using a simple URI-pattern including the ISBN number.
Software Tools
- Silk - A Link Discovery Framework for the Web of Data The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Silk can be run on a single machine or on a hadoop cluster (for instance Amazon EC2).
- LIMES - Link Discovery Framework for Metric Spaces. LIMES implements time-efficient and lossless approaches for large-scale link discovery based on the characteristics of metric spaces.
- LDIF - Linked Data Integration Framework translates heterogeneous Linked Data from the Web into a clean local target representation while keeping track of data provenance. The framework contains an identity resolution module for replacing URI aliases with a single target URI.
- DSNotify - Detecting and Fixing Broken Links in Linked Data Sets
- TopBraid Composer (ontology editor made by TopQuadrant) has a wizard for linking ontology instances to corresponding DBpedia concepts. See for details.
- SERIMI. Automatic link generation tool that does not require matches prior knowledge of the data, domain or schema of these datasets.
- SemMF SemMF is a flexible framework for calculating semantic similarity between objects that are represented as arbitrary RDF graphs. The framework allows taxonomic and non-taxonomic concept matching techniques to be applied to selected object properties.
- Yves Equivalence Miner together with an experience report about the problems he ran into while interlinking Jamendo and Musicbrainz.
- MOAT: Meaning Of A Tag Framework for manually interlinking tags with Semantic Web URIs (such as URIs from dbpedia, geonames … or any knowledge base)
Benchmarks
- The Ontology Alignment Evaluation Initiative 2009 and 2010 competitions included an instance matching track. See Results 2010 and Results 2009
- The ISLab Instance Matching Benchmark, provides for benchmarking instance matching and itentity resolution tools.
People Interested in the Area
- Stefano Mazzocchi (work plan)
- Felix Van de Maele
- Chris Bizer (I want to set links from the dbpedia dataset to other datasets. Already done: geonames, planed: Musicbrainz, US Census data. If you have other datasets that fit to be linked to dbpedia, please let me know.)
- Tom Heath (I'm primarily interested right now in very lightweight, low-cost heuristics/hacks to link up things, places, and reviews; the RDF Bookmashup ISBN approach is the kind of place I'm looking to start)
- Yves Raimond
- Hugh Glaser (Doing a lot of this stuff between big people, projects and publications sources.)
- Oktie Hassanzadeh and Mariano Consens (Currently developing a tool for finding links between different data sources using state-of-the-art similarity join techniques)
- François Scharffe - Working on automating the mining process by using ontology alignments
- Thomas Schandl - Working on LASSO project, which is about interlinking instances from local knowledge bases to LOD sources
Papers and Web Resources on the Topic
- Wolger et al.: A Survey on Data Interlinking Methods, March 2011.
- Robert Isele, Anja Jentzsch, Christian Bizer: Silk Server - Adding missing Links while consuming Linked Data. 1st International Workshop on Consuming Linked Data (COLD 2010), Shanghai, November 2010.
- Julius Volz, Christian Bizer, Martin Gaedke, Georgi Kobilarov: Discovering and Maintaining Links on the Web of Data. International Semantic Web Conference (ISWC2009), Westfields, USA, October 2009.
- Yves Raimond, Christopher Sutton and Mark Sandler: Automatic Interlinking of Music Datasets on the Semantic Web. LDOW 2008 Paper.
- Afraz Jaffri, Hugh Glaser and Ian Millard: URI Disambiguation in the Context of Linked Data. LDOW 2008 Paper.
- Andriy Nikolov, Victoria Uren, Enrico Motta and Anne de Roeck: Handling instance coreferencing in the KnoFuss architecture, 2008.
- A. Nikolov, V. Uren, E. Motta, A. de Roeck: KnoFuss: A comprehensive architecture for knowledge fusion. K-CAP 2007, Whistler, Canada, 2007.
- Christian Becker, Chris Bizer, Georgi Kobilarov: BBC interlinks with DBpedia, 2008
- Chris Bizer, Tom Heath: Auto-generated owl:SameAs links between the RDF Book Mashup and the DBLP database
- Stefano Mazzocchi: Rewiring Scenarios
- Yves Raimond: Linking open data: publishing and linking the Jamendo dataset
- Yves Raimond: Linking open data: interlinking the BBC John Peel sessions and the DBPedia datasets
- Alani, H., Dasmahapatra, S., Gibbins, N., Glaser, H., Harris, S., Kalfoglou, Y., O'Hara, K. and Shadbolt, N. Managing Reference: Ensuring Referential Integrity of Ontologies for the Semantic Web (2002).
- see also Section on Link Discovery in Linked Data book.
This stuff has been done over and over in the database community, often called duplicate recognition or record linkage. So if somebody knows good overview papers about the area please add them to this page, so that people don't have to reinvent the wheel.
- Wikipedia: Record Linkage
- Duplicate Record Detection: A Survey. by Elmagarmid et al. TKDE, 2007.
- Tutorial on Approximate Joins. VLDB, 2005
- Fellegi: A theory of record linkage. Journal of the American Statistical Association, 1969
- Hernandez: Real-world Data is Dirty: Data Cleansing and The Merge / Purge Problem. Data Mining and Knowledge Discovery, 1998
There was a workshop on Ontology Matching at ISWC 2006. The approaches proposed there might also be useful for equivalence mining on data item/instance level.