Use Case Linked Data Timeliness

From XG Provenance Wiki
Jump to: navigation, search

Owner

Olaf Hartig

(Curator: Paolo Missier)

Provenance Dimensions

  • Primary:
    • Use: Trust (Information Quality)
  • Secondary:
    • Content: Evolution and versioning -> Republishing, Process (Data Creation, Data Access)
    • Management: Publication, Access
    • Use: Interoperability

Background and Current Practice Scenario

Timeliness refers to the property of a piece of data to be "recent enough" to still be useful for a specific application. A typical example is stock ticks, which may become obsolete very quickly for the purpose of real-time trading, but stay timely for a long time for time series analysis on historical data, for example.

Estimating or measuring the age of data, and therefore determine its timeliness relative to a certain use, has been a long-standing problem of interest in the data quality community. Assessing timeliness relies on knowledge of the creation date/time of a piece of data. This metadata may be made available in various ways, which are typically data- and application-specific. In particular, when the creation time of a piece of data is not available, surrogates can sometimes be used, such as the time of last access, which however leads to an approximation on the timeliness assessment.

Goal

Thus, the goal is to enable users of Web data to make informed decisions on data fitness for purpose, based (in part) on its age, and therefore on its timeliness. This use case explores the use of provenance as a novel way to address the timeliness assessment problem, in the context of Web data.

Use Case Scenario

Alice uses an application that provides her with a particular stream of data. To fix the idea, we will consider data that contains the latest traffic news. To achieve a more complete and more balanced view and to guarantee the latest information the application takes data from multiple data sources on the Web into account.

In turn, Bob and Carol publish local traffic data on the Web, as two separate data sources. However, they both take the data from the same data source, X, which holds traffic data about the whole country. This data source changes frequently.

During its execution, Alice's application compares two data items, B and C, that carry different traffic information for the same road. Alice needs to choose one of the two, and the criteria she uses is based on data timeliness, and specifically, timeliness is directly related to the creation date of the items.

These sources are published by Bob and Carol, respectively, using X as their common data source, and they both come with some form of provenance associated to them. In Bob's case, however, the provenance includes the creation date/time of the version of the data item provided by X, that B is based upon. This is precisely the metadata that Alice needs to determine B's timeliness. On the other hand, Carol's application, which is used to publish C using some version of the data provided by X, is unaware of the creation date of that version. Instead, Carol includes the date of her last access to X as a surrogate.

In practice, B and C carry two types of time-related provenance metadata, stronger for B, and weaker for C. Alice uses this metadata to make her decision, for instance, to ignore B as it is older than C, conscious however that there is a chance of making the wrong choice (for example when C has been updated recently from a version of the data provided by X that is, however, older than the version used to update B).

Problems and Limitations

Here are the main technical challenges in this use case, followed by a brief description of a specific setting, whereby we argue that provenance can be used effectively in combination with based on Linked Data principles.

  • both B and C, the providers, must associate several pieces of provenance metadata to the data they publish. This must include the originating source (X), and should include the creation date for the version of X that the data is based upon, or at least, the last access date. Without these, timeliness assessment is inaccurate at best, or impossible.
    • This is a provenance content and management issue.
  • The difference in semantics between the two dates, described above, must be made explicit so that Alice is aware of the potential errors. Without this, provenance is insufficient for Alice to reach a correct decision.
    • This is a provenance content and management issue.
  • Alice must be able to understand the representation of provenance for both B and C. Ideally, these is represented in an uniform way, although it has been generated independent by two different providers. Without this, Alice's code will be "hard-wired" to Bob and Carol's provenance formats, which makes it hard to reuse and to extend.
    • This is provenance use issue.

We propose to cast the use case in a setting where data is published according to the Linked Data principles. In this case, Alice's application has access to it through a Web interface, in the form of RDF graphs. This setting, which is becoming increasingly common in practice, partially facilitates addressing the two technical challenges above, by providing a uniform way of addressing and accessing the data. By itself it does not, however, solve the problem of provenance semantics and interoperability.

Existing Work

[Hartig and Zhao SWPM09] describe an approach to develop a timeliness assessment method for Web data.