228 Best Practices Pragmatic Provenance

From Government Linked Data (GLD) Working Group Wiki
Jump to: navigation, search

Best Practices: Pragmatic Provenance for Government LOD

Back to Best Practices Wiki page

Status

  • Dec 2011 - Initial revisions by John Erickson (RPI)

Overview

Provide best practice recommendations for stakeholders on documenting the provenance of their linked government data and how to interpret that data so that consumers know what they are looking at. (suggested by Hadley Beeman)

Background =

In 1997 Tim Berners-Lee called for pervasive provenance on the Web:

At the toolbar (menu, whatever) associated with a document there is a button marked "Oh, yeah?". You press it when you lose that feeling of trust. It says to the Web, 'so how do I know I can trust this information?'. The software then goes directly or indirectly back to metainformation about the document, which suggests a number of reasons.

W3C GLD therefore seeks to recommend practices that enable government providers to create the metadata necessary to answer their users' "oh yeah?" questions about the linked data they publish. Our recommendations may include processes as well as the application of specific vocabularies/ontologies.

What do we mean by "Provenance?"

The W3C's Provenance Incubator Group (2010) provides this simple definition of provenance:

Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.

More recently the W3C Provenance WG (PROV-WG) defines "provenance" for their work:

The provenance of digital objects represents their origins. The PROV Data Model (PROV-DM) is a proposed standard to represent provenance records, which contain assertions about the entities and activities involved in producing and delivering or otherwise influencing a given object. By knowing the provenance of an object, we can make determinations about how to use it. Provenance records can be used for many purposes, such as understanding how data was collected so it can be meaningfully used, determining ownership and rights over an object, making judgments about information to determine whether to trust it, verifying that the process and steps used to obtain a result complies with given requirements, and reproducing how something it was generated...As a standard for provenance, PROV-DM accommodates all those different uses of provenance. Different people may have different perspectives on provenance, and as a result different types of information might be captured in provenance records.

What do we mean by "Pragmatic Provenance?"

The W3C Government Linked Data WG accepts PROV WG's definition of provenance but recognizes that PROV-DM is a powerful tool. W3C GLD WG seeks to provide best practice recommendations that will be useful to government data stakeholders, that make sense for GLD use cases and are easily adopted by practitioners.

W3C GLD could recommend a simple provenance scoring system for GLD analogous to TBL's 5 stars for linked data. Such a system might include:

  • One star: Using the basic W3C DCAT for Linked Data at the catalogs and dataset level
  • Two stars: DCAT enhanced with more complete Dublin Core and other metadata
  • Three stars: Above, but with based provenance metadata "within" the datasets
  • More stars: More rigorous use of PROV DM

Use cases for provenance in GLD

Provide use cases here...

  • Specifying catalog- and dataset-level provenance
  • Specifying provenance within datasets
    • Preserving and encoding pre-existing provenance data
    • Generating provenance when processing data (e.g. during the Linked Data creation process)

Possible organization of use cases (Adapted from Trust and Linked Data):

  • Simple "Oh Yeah?" scenario
    • User retrieves a dataset, then clicks on “oh yeah” button, then site returns a provenance record
  • Licensing scenario
    • User retrieves dataset, then wants to check permission to use
  • Referral scenario
    • Site refers queries about provenance in terms of pointers to another site’s provenance facilities
  • Repeated queries scenario
    • Service repeatedly queries a site, wants provenance for all the answers
    • This is similar to PROV WG example, where user follows provenance record, asking follow-up questions based on previous answers
  • Versioning scenario
    • User retrieves a dataset, then wants to see its provenance, but the dataset has been updated in the original site (its provenance as well)
  • Dynamic scenario
    • User retrieves a resource that is dynamically created

"Worked" examples of provenance in GLD

Provide examples here...

Background or Related Work