HCLSIG/LODD/Interlinking/Metadata/ProvenanceForDataInterlinking

From W3C Wiki
< HCLSIG‎ | LODD‎ | Interlinking‎ | Metadata
Jump to: navigation, search

This document is created for sharing provenance-related use case with the W3C Provenance Incubator Group.

Owner

JunZhao and AnjaJentzsch

Background

This use case comes from the LODD Task force of the W3C HCLS Interest Group (http://esw.w3.org/topic/HCLSIG/LODD/Interlinking/Metadata) and is based on the discussions between Jun and Anja.

See related use case: http://www.w3.org/2005/Incubator/prov/wiki/Use_Case_Simple_Trustworthiness_Assessment.

With the scale of data items to be interlinked on the Web of Data, an automatic way to create these interlinks on a large scale is highly desirable. However, without any information about which tools were used for generating the links, when the links were generated, and what mapping scores were used to filter mapping results, users have less confidence on the automatic interlinking results. Therefore, there is need for having a vocabulary to provide metadata about these links, created by automatic interlinking discovery tools, for any two or more data sets on the Web.

Goal

Enable linked data publishers and application developers to link to data links of high confidence value.

Use Case Scenario

A linked data application developer D wants to build an application that combining information from data source A and B. Both A and B publish information about drugs. Each dataset uses different URIs to identify the same drug. The different drug URIs were mapped using software tool Silk when B was published by its data provider. In the application, D wants to include drug-related information from both A and B because they complement each other. D only wants to present drugs from A that are linked to those from B with a confidence value higher than 97%.

Challenges and Potential Solutions

The links between A and B can be published as a separate dataset in order to simplify the maintenance of updates. The provenance about these links should includes

  • linkage tool used to create the link
  • linkage tool version
  • link confidence as a number (percentage) (in Silk this would be the generated similarity value)
  • rule set used to create the link (at least a link, not necessarily a version number)
  • the person or organization created the links
  • rule set used to create the link (at least a link, not necessarily a version number)
  • previous versions of links

Existing work has been reviewed at http://esw.w3.org/topic/HCLSIG/LODD/Interlinking/Metadata. But a best practice is yet agreed for representing provenance of an RDF triple.