Use Case IQ Assessment for Linked Data

From XG Provenance Wiki
Revision as of 08:06, 4 January 2010 by Pgroth (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Owner

Olaf Hartig, Jun Zhao, and Chris Bizer

Curator

Paolo Missier

Background

Information quality (IQ) is a multidimensional concept with different criteria such as accuracy, believability, completeness, timeliness, etc. (find a more comprehensive list of criteria in, e.g., [1] and [2]). IQ assessment is the process of determining numerical values, called IQ scores, to certain IQ criteria. IQ assessment is commonly conceived to be a complex problem. The methods that can be applied for IQ assessment are diverse and depend on the specific criterion as well as the use case ([1] and [2] outline various methods).

The openness of the Web allows many of the linked data on the Web are derived from others by replication, queries, modification or merging. Little is known about who created data on the Web and how. This means that poor data quality can quickly propagate on the Web of Data. Unless an approach for evaluating the quality of data is established, the Web of Data would soon be widely contaminated and applications built upon them would lose their values.

Goal

Information quality assessment for Linked Data

Use Case Scenario

With the rapid growth of Linked Data on the Web more and more applications emerge that make use of this data. It is to be expected that these application will consume Linked Data from a large amount of different sources on the Web. Due to the openness of the Web these applications have to take IQ of the consumed data into account. Hence, these applications have to apply IQ assessment methods to assess certain IQ criteria such as timeliness, accuracy, and believability. The applied methods may vary depending on the importance and the criticality of the application. For many applications fairly simple methods may suffice. These methods can be based on the provenance of the assessed data.

Problems and Limitations

To apply provenance-based IQ assessment methods Linked Data consuming applications require provenance-related metadata. Hence, data publishers have to be enabled and encouraged to provide this metadata. For very simple assessments information about the provider, the creator and the creation time may be enough. Further information that might be useful is: what source data has been used for creation; where, how, and when has the data (or the source data) been retrieved from the Web; who is responsible for accessed services; how was the data created. However, the provenance information that is required depends on the assessment method applied by the users and is, therefore, difficult to predetermine. To get at least an idea of the diversity of possible assessment methods take a look at Use Case Linked Data Timeliness, Use Case Simple Trustworthiness Assessment, and Use Case Ignoring Unreliable Data. Please note, in certain assessment scenarios provenance information itself would not be sufficient; in these cases additional information is required, such as other metadata or an analysis of the data content.

Existing Work

Sig.Ma is a user interface for the Sindice Semantic Web Search Engine which allows users to filter information based on provenance (data source).

Many Linked Data browsers and search engines display basic provenance information (URL from where a RDF triple has been retrieved) next to the actual data. Examples Disco, Marbles, VisiNav

WIQA - Information Quality Assessment Framework is a set of software components that empowers information consumers to employ a wide range of different information quality assessment policies to filter information from the Web. WIQA includes a RDF data browser. In order to facilitate the user's understanding of the filtering decisions, the browser can create explanations [3] why displayed information fulfils a selected policy.

The Provenance Vocabulary provides classes and properties to describe the provenance of data from the Web. Hence, this vocabulary enables providers of Web data to publish provenance-related metadata about their data. The vocabulary is based on a model for Web data provenance as presented in [5]. Based on the Provenance Vocabulary different Linked Data publishing tools have been extended with a metadata component that automatically provides provenance information:

The tRDF4Jena library extends the Jena RDF framework with classes to represent, determine, and manage trust values that represent the trustworthiness of RDF statements and RDF graphs. Furthermore, tRDF4Jena contains a query engine for tSPARQL [4], a trust-aware extension to the query language SPARQL.

References

[1] Felix Naumann: Quality-Driven Query Answering for Integrated Information Systems. Springer Berlin / Heidelberg, 2002.

[2] Christian Bizer: Quality-Driven Information Filtering in the Context of Web-Based Information Systems. Thesis, Freie Universität Berlin, 2007.

[3] Tim Berners-Lee: Cleaning Up the User Interface, Section: The "Oh,yeah?"-Button, 1997.

[4] Olaf Hartig: Querying Trust in RDF Data with tSPARQL. In Proceedings of the 6th European Semantic Web Conference (ESWC), Heraklion, Greece, June 2009

[5] Olaf Hartig: Provenance Information in the Web of Data. In Proceedings of the Linked Data on the Web (LDOW) Workshop at WWW, Madrid, Spain, April 2009 Download PDF