Interview: Paul Groth and Luc Moreau on Provenance

Paul Groth (VU University Amsterdam) and Luc Moreau (U of Southampton) co-Chair the W3C Provenance Working Group. The group has just published 12 documents to support the widespread publication and use of provenance information of Web documents, data, and resources. I spoke with them to find out what their group’s work —called PROV— will enable.

IJ: What are the main use cases that your Working Group was trying to address?

Paul: Though we studied a number of use cases, I would say there are three main scenarios. The first is attribution. People frequently quote or copy/paste on the Web. Remixing is good, and content creators often want due credit. The second is about aggregation and integration. Provenance information helps people judge the quality of information. Lately I’ve been talking about “fair trade” data. One example: TechCrunch ran an article in which they said that Google was going to buy a particular company. It turns out the information found it’s way incorrectly into a PR databases. The third scenario is compliance, an important enterprise use case. You have a contract with someone and want to be able to prove that you performed according to specification.

Luc: Provenance information is an enabler of services that add value to data. For example, in the past I have spoken about creating a search engine for provenance. The search engine would give you provenance information for anything: products, information, whatever. So you might find yourself in a store with your mobile phone, scan a bar code, and retrieve the provenance. People would build services on top of the provenance information, for example ratings for products that don’t rely on child labor.

IJ: Right, those sorts of services foster trust over the network.

Paul: Provenance is a foundation for trust judgments, but is not all that one would need. We think PROV does provide a foundation for trust judgments.

IJ: What are some examples that people have built?

Luc: As part of our progression to Recommendation we catalogued 66 applications. Some are interesting academic examples, others very practical. For me, one that stands out is NASA’s use of PROV to provenance-enable the National Climate Assessment report. They are currently working on a prototype and I believe they will launch by the end of the year. NASA is also about to launch a satellite mission where data and processing is accompanied by provenance information.

Paul: One of the reasons NASA is interested in PROV is that they have data coming from multiple different systems and they need a common standard for provenance information that works across those systems.

IJ: How interoperable is PROV today?

Luc: There were at least 10 related vocabs out there, some more popular than others. PROV has brought those communities plus others to the table to agree on a common core.

IJ: And how is the market responding to the new interop?

Paul: People wanted to know what to use, and now they are moving toward PROV. They want to support the standard because it lets them move on to other topics like capturing and analyzing provenance information.

Luc: Most of the people working on older provenance models are either moving to PROV or extending PROV with features they like but that are not part of the new standard.

IJ: What challenges do people face when trying to assemble provenance data?

Luc: Sometimes you have to reconstruct provenance information because it wasn’t recorded at the right time. This can be very tedious.

IJ: Can you indicate the reliability of your provenance data using PROV?

Paul: You can, through annotations, but we did not standardize a single weighting system for this. There will be a variety of systems for describing reliability.

IJ: What’s the “hello world” example for PROV?

Paul: I wrote up some examples in a blog post in 2011. For example, in these two statements I say that a blog post was attributed to Paul (who is a person, not, say, a company):

  ex:post prov:wasAttributedTo ex:Paul.
  ex:Paul a foaf:Person.

IJ: Those are RDF triples. How heavily does PROV rely on using RDF?

Luc: The PROV model is independent of any particular data system. Provenance information is inherently cross-system - your data flows from system to system, and so you need independence from all those systems. You can serialize PROV data with RDF+OWL, or in XML, and we’re working on other serializations as well so that developers can use their favorite data serialization.

Paul: RDF does give you super easy data integration.

Luc: The British Gazette is another example of an organization that will be using PROV, as well as other Semantic Web technology.

IJ: What do you think should happen next in the provenance space?

Paul: PROV provides a foundation for a vibrant provenance community. I anticipate a lot more implementation of PROV and an explosion in this space as there is a real demand for ways to be transparent. I hope to see a lot more tools emerge to support this.

Paul: There is also vocabulary work going on at W3C that builds on PROV (such as the “organization” ontology).

Luc: At the beginning Paul thought we would have more requirements for specific types of properties like revision and derivation. I think there are additional vocabularies that communities will want, and they will work on specialization of PROV for their needs. At some point those communities may decide to standardize on those vocabularies.

Luc: Paul and I writing a book on PROV…look for that as well!

Ian: Thank you both for your time!