Warning:
This wiki has been archived and is now read-only.

Use Case Semantic Disambiguation of Data Provider Identity

From XG Provenance Wiki

Jump to: navigation, search

1 Name
2 Owner
3 Provenance Dimensions
4 Background and Current Practice
5 Goal
6 Use Case Scenario
7 Problems and Limitations
8 References

Name

Semantic disambiguation of data provider identity

Owner

Aleksey Chayka

Provenance Dimensions

Primary: Content: Attribution (verifying attribution)

Secondary: Content: Evolution and versioning (republishing) Use: Interoperability, Understanding (Presentation)

Background and Current Practice

In the W3C recommendation Uniform Resource Identifier (URI): Generic Syntax, a resource is defined as “anything that has identity”. To date, the good practice of associating only unique URI with corresponding resource on the Web is not supported by any large-scale web infrastructure. This means that there is no easy and “standard” way for preventing the creation of URI aliases; as a consequence, a new URI is minted for the same resource any time a statement is made about it in different locations of the Web. Identification of a proper source that can use a bunch of URIs becomes a problem.

There are currently two major approaches which can potentially help to solve the problem. The first is the Linking Open Data Initiative 2, which has the goal to “connect related data that wasn’t previously linked”. The main approach pursued by the initiative is to establish owl:sameAs statements between resources in RDF. While the community has made a huge effort to link a significant amount of data, their approach depends on specialized, data source dependent heuristics to establish the owl:sameAs statements between resources, and it requires the statements to be stored somewhere, along with the data. However, such an approach has several concerns. First, in most Web scenarios it is hard to find standard web users making an effort to create owl:SameAs statements for their data. Second, an error in an identity statement might have long ramifications on the entire Web of Data. Finally, reasoning over massive numbers of owl:sameAs statements in distributed ontologies is computationally a complex and highly expensive task, which may lead to the conclusion that these linked data are more suitable for browsing than for reasoning or querying.

The second approach is presented in Jaffri et al. 3. In their work resulting from the ReSIST project, these authors recently came to a conclusion that the problem of proliferation of identifiers and the resulting coreference issues should be addressed on an infrastructural level. As a solution, they propose what they call a Consistent Reference Service. However, their point about URI potentially changing “meaning” depending on the context in which they are used, is philosophically disputable: the fact that several entities might be named in the same way (“Spain” the football team and the country) must not lead to the conclusion that they can be considered the same. Furthermore, their implementation of “coreference bundles” which establish identity between entities, are in fact very similar to a collection of owl:sameAs statements, that was described in the previous approach.

Goal

The goal is disambiguation of information source identities, despite of the way how the sources can be identified by other users. A proper identity of a source can serve for reasoning about it.

Use Case Scenario

Bob wants to cite the 43rd President of the United States in his e-mail to a group of people. He wants to refer to the President in such a way so everybody recognize the correct person. Millions of people may call him as “43rd” which will uniquely identify the person for each of those people. Billions of people may identify him as “George W. Bush”, for the rest probably the best identifier of a person would be “the President of the United States George W. Bush”. Bob’s goal is to let other people find (identify) the source of information that he wants to cite. Note that even if a source is misidentified, its wrong ID still can be used inside a group of people that have common conventions on identification of such a source 1.

Problems and Limitations

Abstracting from the human perception of a source ID, the problem is to unify the way how a source can be identified to be further consistently queried or reasoned. For the web data, identification of a source itself (with the help of a proper domain name and local URI) is not an issue. But when a source operates with RDF graphs where one can find statements referring to other sources, the question of semantic disambiguation of types and the sources themselves arise. Whether alternative IDs or some facts about the source will be available, the disambiguation can be resolved by such a system as Entity Name System 1. The provenance of a source will help to disambiguate the source itself (i.e. differentiate data originator from a mediator).