Use Case Ignoring Unreliable Data


Olaf Hartig, Chris Bizer


Paolo Missier

Provenance Dimensions

  • Primary:
    • Content: Process (Data Access + Data Creation)
  • Secondary:
    • Content: Attribution (Verifying attribution), Evolution and versioning (Republishing)
    • Management: Publication, Access
    • Use: Trust (Information Quality)

Background and Current Practice

The decision to rely on data or to ignore it may have very different reasons. This use case focuses on the requirement for data that is verifiable to be unmodified. Data that is created by some party and provided by others is vulnerable to manipulation; so is data transfered over an insecure channel. However, the creator as well as the publisher may provide a digital signature for the data so that any attempt for manipulation can be detected.


The goal is to enable a user who consumes data that has been created based on multiple source data items, to ignore data items for which the integrity of involved source data cannot be guaranteed.

Use Case Scenario

Bob publishes a statistical dataset that he created by combining data provided by many different sources. Bob accesses these sources using various channels, some of which are insecure. Alice, a friend of Bob, realizes Bob's dataset is a valuable source for her studies. However, she considers only these statistical records as reliable that are based on source data that are guaranteed to be unmodified.

The domain, statistical data, used in this scenario is an example; it can be replaced by other domains where data is harvested from multiple sources and combined afterwards. The scenario itself does not depend on a specific technology. A possible instance of the scenario could be a Linked Data application that aggregates data from multiple sources who publish statistical data according to the Linked Data principles.

Problems and Limitations

To enable Alice to ignore data items for which the integrity of involved source data cannot be guaranteed she requires provenance information about all the data items she considers to use. This information must include the source data items used to create each data item as well as information about how Bob retrieved these source data items. The latter should include information about the corresponding transmission channel and the result of Bob's attempts to verify digital signatures in case the retrieved data was signed. The technical challenges are:

  • Bob has to make available provenance-related metadata about the pieces of his dataset. This is a provenance management issue (dimensions: Publication, Access).
  • The metadata must include the aforementioned information. This is a provenance content issue (dimensions: Attribution, Process).

Existing Work

See WIQA framework on IQ in Linked Data main page.

Last modified on 4 January 2010, at 08:06