This wiki has been archived and is now read-only.


From Provenance WG Wiki
Jump to: navigation, search

Provenance Example

At the May 05 2011 telecon, we agreed on an example-based process to discuss charter concepts. This document presents an example with which we can illustrate all concepts. This page replaces a deprecated version that was too "technology specific".

Data Journalism Example


An online newspaper publishes a story making using of RDF data (GovData) provided through a government portal, in England. The government portal not only makes the data available but also publishes how the data was generated. The newspaper publishes a story with an incidence map and associated chart based on GovData along with a photo, supplied by a freelancer, illustrating an impacted group. The story includes in the byline both the author of the story but the creator of the chart as well. To be transparent, the newspaper publishes a document describing the provenance of the chart including where it got the data from but also what tools and assumptions it used to create the chart. This also contains a link to who created the chart. Importantly, because the GovData is in the public domain, the newspaper retains the copyright to its chart but does not own the underlying data.

A blogger looking at the chart spots what he thinks to be an error. Having retrieved the provenance, he is able to trace back the error not to the newspapers processing but an error in how the government translated the data into RDF. However, that error had been spotted and fixed by the government portal. The blogger is able to publish a new chart that is correct and gives it an open license. Thus, when searching for information on the story a user can find which figure is based on newer data.

Processing steps

  • government (gov) converts data (d1) to RDF (f1) at time (t1)
  • government (gov) generates provenance information (prov) regarding RDF (f1)
  • government (gov) publishes RDF data (f1) along with its provenance (prov) on a portal with a license (li1); the rdf data is now available as a Web resource (r1)
  • analyst (alice) downloads a turtle serialization (lcp1) of the resource (r1) from government portal
  • analyst (alice) generates a chart (c1) from the turtle (lcp1) using some software (tools1) with statistical assumptions (stats1)
  • newspaper (news) obtains image (img1) from freelancer, Carlos.
  • newspaper (news) publishes the incidence map (map1), chart (c1) and the image (img1) within a document (art1) written by (joe) using license (li2)
  • government (gov) publishes an update (d2) of data (d1) as a new Web resource (r2)
  • blogger (bob) downloads turtle (lcp2) of the resource (r2) from government portal, determines that it's a different version of the same data
  • blogger (bob) generates new chart (c2) based on the data (lcp2) using some software (tools2) with statistical assumptions (stats2)
  • blogger (bob) publishes the chart (c2) under an open license (li3).

Provenance Questions

We list here questions whose answers draw on provenance information.

  • Is chart c2 dependent on original data set d1?
  • Does chart c1 constitute derivative work of d1?
  • Is license li2 compatible with the terms of li1?
  • What software tools were used to produce charts c1 and c2?
  • Which chart uses the most up-to-date data?
  • Who authored the chart included in article art1?
  • ... more queries to be phrased
  • What was the role of the blogger (Bob) in the creation of the chart?
  • Is the blogger related to the creation of other charts? (is he an expert?)
  • What are the main differences between both charts?
  • Does the creation process step of c2 have any annotations that justify it? (Why does a second chart exist with the "same" data?)
  • Who is the rights holder for c2? Can I reuse it in another article?
  • Is the process followed to generate c1 similar to the one used to generate c2?
  • What are the references used for creating art1? (images, videos, charts, etc.)
  • Do d1 or f1 refer to a specific location?
  • Have d1 or f1 been created in the same location that they are referring to?
  • What has been corrected in C2 and what caused the error?

Example Variants

In this section, some variants of the example are discussed.


  • Why is there a new version d2 of the data? If there was only something wrong with the RDF generation process, r2 is still based on d1, right? --Stian
  • Are d2 and d1 examples of invariant views over the data d? --Stian
  • Example should include realisation of IVP of by having some mutable things and different perspectives of those. --Stian