This wiki has been archived and is now read-only.

State of the Art Report

From XG Provenance Wiki
Jump to: navigation, search

This document is a report on the state of the art of provenance research and practice.

About this document

Source: W3C Provenance Incubator Group

Authors: Paul Groth, James Cheney, Simon Miles, James Myers, Yolanda Gil

Release Date: October 20, 2010.

Description: This document is a report on the state of the art of provenance research and practice in the view of the W3C Provenance Incubator Group. The group analyzed several scenarios for provenance on the Web highlighting user and technical requirements for each scenario and the relevant current approaches and research. As a result, the group identified a number of technology gaps that need to be addressed to accomplish the original scenarios.


This document is a report on the state of the art of provenance research and practice.

The document is organized around three scenarios described in a Requirements Report released earlier.

We carried out a detailed analysis for each of them following these guidelines that included: 1) a detailed analysis of user and technical requirements, 2) an overview of current approaches and research relevant to the scenario, and 3) a discussion of technology gaps.

This document summarizes the findings, but more details can be found in the group's in-depth analysis of the News Aggregator scenario, in-depth analysis of the Disease Outbreak scenario, and in-depth analysis of the Business Contract scenario.

We also took into account the results of a number of prior activities of the group, including:

Analysis of News Aggregator Scenario

The group carried out a detailed in-depth analysis of the News Aggregator scenario. The provenance issues highlighted in that scenario include:

  • checking licenses when reusing content
  • verifying that a document/resource complies with licensing policies of pieces it reused
  • integrating unstructured content, ie documents and media (in contrast with integrating structured data)
  • content aggregation: aggregating RSS feeds, or product information, or news (in the case of the scenario)
  • checking authority
  • recency of information
  • verification of original sources
  • conveying to an end user the derivation of a source of information
  • versioning and evolution of unstructured content, whether documents or other media (eg images)
  • tracking user/reuse of content
  • scalable provenance management

Relevant State of the Art for the News Aggregator Scenario

This section describes the research conducted in relation to, and the technology available to fulfill, the above described requirements with specific regard to the Business Contract Scenario.

Here is a summary of related work brought up in the original use cases.

Existing Solutions Used Today for News Aggregation

A summary of research, with referenced papers, which consider issues similar to the scenario-specific user requirements above, or are applied in a similar domain.

The News Aggregator envisions a system that can automatically tell where a piece of content (i.e. object) on the Web comes from and who is responsible for that content after it has aggregated content.

Aggregation Today

Content aggregation is widely used on the web. Examples of content aggregation for news include sites such The Huffington Post, Digg, and Google News. Personal aggregation is facilitated by feed technologies (RSS, Atom) and their associated readers (e.g. Google Reader). Newer aggregators like Flipboard provide a merged view of content, thus, hiding some of the provenance of the information to increase visual appeal. This is similar to what is envisioned in the News Aggregator scenario.

Tracking Content

A number of systems have looked at tracking content, in particular, quotes across the web. Memetracker (Leskovec2009), for example, provides a system and algorithms for tracking distinctive phrases through the blogosphere. Memetracker is able to reconstruct the news cycle by tracking both news outlets and blogs (Leskovec2009). (Gomez-Rodriquez2010) expands on this work to track how information is propagated through the network and thus which blogs and media have the greater influence. Similarly, (Cha2010) studies influence in the microblogging network, Twitter, including the difference between the influence of content and the influence of users. (Lerman2010) studied how the social networks of both Twitter and Digg impact the propagation of information through these networks.

Most of these systems rely on crawls of the web that are produced uniquely for each application. However, there are tools and services for producing these crawls uniquely. For example, BlogTracker is tool for periodically crawling and then analyzing blogs(Agarwal2009). On a commercial scale, Spinn3r (http://www.spinn3r.com/) provides realtime access to a large crawl of both blog and microblog data.

Need for Explicit Provenance

It is important to note that these systems deduce the provenance of an object (e.g. a piece of text) after the fact from crawled data. However, determining provenance after the fact is often difficult. For example, during the Iranian Green Revolution protests it was extremely difficult to determine the actual origins of tweets about what was happening on the ground (Brinkley2009). Because of the variabilities of determining where tweets came from, in later 2009, Twitter launched it's own service to explicitly capture the notion of retweeting (Williams2009). (Lynch2001) provides an analysis of the difficulty in determining provenance for accountability and trust from crawled documents. Another widely used example of explicit tracking of the origin of web data is trackbacks [1] and pingbacks [2] for blogs. These are used to notify a blog when another blog has linked to it. We note that these systems are isolated to specific technology platforms and do not encompass the whole of the web.


A crucial reason for tracking provenance in the News Aggregator Scenario is the ability to determine if an image (or other content) can be reused. In the area of music, Creative Commons hosts the site ccMixter [CreativeCommons2010], which allows musicians to track the licenses of different music clips across sites. In a similar vein, the Google Books Rights Registry [Wikipedia2010] tracks and maintains the licensing rights for books especially with respect to online publishing of those books. Fundamental technology related to this is the digital representation of licenses. An example of this is the representation of Creative Commons licenses as RDFa [3].

In summary, there are a number of systems on the web today that resemble BlogAgg, however, the technologies that underpin these systems, in particular with respect to provenance, are divergent.

Current Provenance Research Relevant to the News Aggregator Scenario

This is only a partial overview of the related work in provenance having to do with the News Aggregator Scenario. The group tagged over 30 papers as relevant to this scenario. We note that much of the work, while provenance related, is not considered to provenance research per-say and is thus discussed in the prior section about current approaches to solving the scenario. We discuss the work in terms of the provenance dimensions. While the scenario touches on many of the dimensions not all have specific research addressing the issues in the context of social media.



A key part of the News Aggregator Scenario is to be able to point to the piece of media to determine its provenance. This is typically done through the use of URLs. However, systems do need to be able to point to parts of media for example a scene in video clip or part of movie. The Multimedia Semantics Incubator Group has a detailed discussion of describing multimedia and thus how identify their constituent parts (Hausenblas 2007).


Attribution is a key part of the scenario. It's important to know who or what asserted a statement. Dublin Core is a widely used vocabulary for expressing attribution information. Deeply tied to attribution in social web scenarios is the notion of identity. For an overview of identity technologies, research and implications for social media, we refer readers to the Social Web Incubator Group's final report and its section on identity (Halpin,Tuffield 2010). Another crucial aspect is that attribution may not be explicitly recorded or may be falsely claimed. Lynch 2011 discusses this problem in depth and Juola 2006 suggests a mechanism for inferring attribution based on language.


Most of the work in capturing the process or how a data item was produced has been done in the context of workflow systems and databases. This is discussed in more depth in the other scenarios. Most of the work with respect to provenance in social media, reconstructs the provenance after the fact from large crawls (See Lerman 2010). Groth 2010 present a model designed for representing processes in web based mash-up. Buneman et al. discuss a model for provenance of manually modified data such as web pages.

Versioning and Evolution A particular part of social media is tracking changes in web documents. Wikipedia edit histories have been widely studied. A good starting point is Lieberman's paper on locating users through their edit histories.



Publication of provenance on the web has largely been discussed with respect to the semantic web and linked data. Hartig and Zhao discuss one approach to publishing provenance on the web of data. It identifies important points around how information access should be published as part of provenance. Reid et al. discuss how provenance in a research environment can be published and linked to social media. Bao et al. discuss tracking changes in MediaWiki platform and publishing those changes using semantic web technologies.



By far the largest area of work with respect to the News Aggregator Scenario is with respect to provenance. Artz and Gil provide a comprehensive survey of trust and the semantic web. IWTrust uses provenance to improve users trust of in the results of a question-answering system that works over web documents. In a similar vein, the Trellis system derives trust ratings of web content using the judgements of many different users. Bizer discusses in detail information accountability with respect to web based information systems and discusses implementations with respect to the semantic web.

Gap Analysis for the News Aggregator Scenario

Given the analysis of the state of the art above, what are the technology gaps to achieve what the scenario proposes?

First, existing provenance solutions only address a small portion of the scenario and are not interlinked or accessible among one another. For each step within the News Aggregator scenario, there are existing technologies or relevant research that could solve that step. For example, one can properly insert licensing information into a photo using a creative commons license and the Extensible Metadata Platform. One can track the origin of tweets either through retweets or using some extraction technologies within twitter. However, the problem is that across multiple sites there is no common format and api to access and understand provenance information whether it is explicitly indicated or implicitly determined. To inquire about retweets or inquire about trackbacks one needs to use different apis and understand different formats. Furthermore, there is no (widely deployed) mechanism to point to provenance information on another site. For example, once a tweet is traced to the end of twitter there is no way to follow where that tweet came from.

Second, system developers rarely include provenance management or publish provenance records. Systems largely do not document the software by which changes were made to data and what those pieces of software did to data. However, there are existing technologies that allow this to be done, for example to document the transformations of images. There are also general provenance models would allow this to be expressed, but they are not currently widely deployed. There are no widely accepted architectural solutions to managing the scale of the provenance records, as they may be significantly larger than the base information itself in addition to also evolving over time.

Third, while many sites provide for identity and there are several widely deployed standards for identity (OpenId), there are no existing mechanisms for tying identity to objects or provenance traces. This is a fundamental open problem in the web, and affects provenance solutions in that provenance records must be attached to the object(s) they describe.

Finally, although there have been proposals for how to use provenance to make trust judgments on open information sources, there are no broadly accepted methodologies to automatically derive trust from provenance records. Another issue that has been largely unaddressed is the incompleteness of provenance records and the potential for errors and inconsistencies in a widely distributed and open setting such as the web.

Analysis of Disease Outbreak Scenario

The group carried out a detailed in-depth analysis of the Disease Outbreak scenario, which is summarized here.

This scenario covers aspects of provenance involving scientific data and public policy, at a number of levels of detail ranging from working scientists and data specialists who need to understand the detailed provenance of the data they work with, to high-level summaries and justifications used by non-expert decision makers.

In this scenario, there are several distinctive uses of provenance:

  • data integration: combining structured and unstructured data, data from different sources, linked data
  • archiving: understanding how data sources evolve over time through provenance and versioning information
  • justification: summarizing provenance records and other supporting evidence for high-level decision making
  • reuse: using data or analytic products published by others in a new context
  • repeatability: using provenance to rerun prior analyses with new data

Relevant State of the Art for the Disease Outbreak Scenario

Here we describe the research conducted in relation to, and the technology available to fulfill, the above described requirements with specific regard to the Disease Outbreak Scenario.

Existing Solutions Used Today for Disease Outbreak Scenario

Data provenance

Today, most scientific or biomedical data sharing is labor-intensive, and any provenance records are either produced by hand (i.e. by scientists filling in data entry forms on submission to a data archive or repository), produced by ad hoc applications developed specifically for a given kind of data, or not produced at all.

Within curated biological databases (including large examples such as UniProt, Ensembl, or the Gene Ontology, with tens or hundreds of curators, as well as many smaller databases), provenance is often recorded manually in the form of human-readable change logs. Some scientists are starting to use wikis to build community databases.

There has been a stream of research on provenance in databases but so far little of it has been transferred into practice, in part because most such proposals are difficult to implement, and often involve making changes to the core database system and query language behavior. Such techniques have not yet matured to the point where they are supported by existing database systems or vendors, which limits their accessibility to non-researchers.

Workflow provenance

On the other hand, a number of workflow management systems (e.g. Taverna, Pegasus) and Semantic Web systems (Inference Web) have been developed and are in active use by communities of scientists. Many of these systems implement some form of provenance tracking internally, and have begun to standardize on one of a few common representations for the provenance data, such as OPM or PML. Moreover, research on storing and querying such data effectively (e.g. ZOOM) has easily be transferred to practice since it relies only on standard systems.

Provenance as justification for public policy

Justification information, describing how the data has been processed from raw observations to processed results that support scientific conclusions, is legally required in some scientific settings. Some researchers in social sciences are developing techniques directly addressing these problems through computer systems, but most such justification information is created and maintained by user effort.

Current Provenance Research Relevant to the Disease Outbreak Scenario

This section is a partial overview of representative work in this area. The bibliography contains over 50 papers tagged as relevant to this scenario. We focus on surveys and key papers that introduced new ideas or concluded a line of research, not incremental steps that, while also important, are of less interest to readers outside the provenance reserach community.



There appear to be two basic approaches to describing the objects to which provenance records refer: a "coarse-grained" approach which assigns atomic identifiers (e.g. URLs) to data objects (with no containment relationships) and a "fine-grained" approach in which objects may be at multiple levels in a containment hierarchy, and identifiers explicitly reflect this. The Open Provenance Model (Moreau et al. 2010) and most of the provenance vocabularies studied in the Provenance Vocabulary Mappings report adopt the coarse-grained model, while much work on provenance in databases (Buneman 2006) (and increasingly some in workflows and other settings) adopts the fine-grained model.

The difference between coarse- and fine-grained identification of objects is not really a difference in kind, but one of emphasis: coarse-grained object identifiers can of course be used to simulate fine-grained collections; the issue is really whether the identifiers have a standard, transparent structure, or are ad hoc. Thus, the two models can probably be reconciled.


Attribution is important for making data citable and helping ensure that creators or cleaners of data receive credit to offset the extra effort required to publish data. URLs do not necessarily provide adequate attribution since they are not considered stable citations. Initiatives such as PURL and DOI have addressed this to some extent, but still only allow for references to whole Web resources, and the targets of URLs may change over time. There are some proposals for fine-grained citation of data in databases (Buneman 2006) and for versioning (Memento, Van de Sompel et al. 2010) that can begin to address this (see Versioning/Evolution below). Bleiholder and Naumann discuss the issue of attribution in the context of data fusion or data integration, where it is also important to be able to track data to its source in order to understand errors or anomalies.


Much of the focus of work on provenance in workflows and distributed systems has been recording processing steps, the precise parameters used, and any other metadata considered important for ensuring repeatability of electronic "experiments". There is a great deal of work in this area, particularly in scientific workflows, high-performance computing and eScience (Simmhan et al. 2005, Sahoo et al. 2008, Bose et al. 2005), including many early systems that initially adopted ad hoc formats and strategies for recording provenance. Scientific workflow systems such as Taverna (Oinn et al. 2004) and Wings/Pegasus (Kim et al 2008) showed how the use of a workflow structure could provide a basis to document the provenance of new data products. The Open Provenance Model, Proof Markup Language, and other provenance vocabularies represent progress towards a uniform representation for this kind of process provenance. However, although these standards specify a common data format that can be used to represent a wide variety of kinds of processes, they can still represent the same process in many different ways. Thus, further work needs to be done on top of these efforts (possibly by domain experts) to ensure that provenance produced by different systems is compatible and coherent. Another important, and ill-understood, problem is how to relate the provenance information that is recorded with the "actual" behavior of a system; that is, how to ensure that such provenance is "correct" in some sense besides the trivial fact that it is what the implementation happened to record.


Provenance is relevant to versioning and evolution of Web resources and data, in several ways. First, the version history of a resource itself can be considered part of its provenance (often an important part). Second, provenance records are often about data that is static in the short term during which results are produced, but changes in the long term. Third, descriptions of the change history of an object can be used to reconcile differences between replicated data or propagate updates among databases that share data on the basis of conditional trust relationships.

Version history provenance is seldom recorded and there is no standard way of doing so. For example, curated databases in bioinformatics record provenance showing who has made changes to each part of the data (often by manual effort); Buneman et al. 2006 study techniques for automating this kind of provenance tracking. Efforts such as the Internet Archive to take "snapshots" of the Web for the benefit of posterity have attempted to provide a partial, coarse-grained historic record for the Web. The Memento project (Van de Sompel et al. 2010) is an effort to extend HTTP with a standard way to access past versions of a Web resource. Neither provides detailed provenance explaining how the different versions are related, but a versioning infrastructure such as Memento could add value to provenance by making it possible to retrieve contemporaneous versions of past data referenced within provenance records. The ORCHESTRA system (Green et al. 2007) provides update exchange among relational databases based on schema mappings and provenance, which can be used to undo updates if trust policies change.


Many forms of provenance describe processes that can be characterized as logical inferences (e.g., database queries and queries over RDF data or ontologies can all be understood as deduction in various flavors of first-order or description logic). Thus, work on provenance has often focused on describing the entailment relationships linking assumptions to conclusions. Some early models of provenance, such as Cui et al. (2000)'s "lineage" and Buneman et al. 2001's "why-provenance", were indirectly based on the idea of a "witness", or a (small) subset of the facts in the database that suffice to guarantee the existence of a result fact. Green et al. (2007) generalize these ideas to a model that also allows distinguishing between different "proofs", in the form of algebraic expressions. Work on knowledge provenance and Proof Markup Language appear to be based on essentially the same idea, but make an explicit link between provenance and deductive proofs more explicit ([Pinheiro da Silva et al. 2003]), and permit greater reflection on the structure of such proofs.


An important goal for provenance in this scenario is to show the connection between evidence, processes used to reach decisions, and conclusions, first as a quality-control measure prior to publication or policy changes Wong et al., and second for review after-the-fact. Achieving this requires all of the previous components, and also requires developing an understanding of what it means for provenance to "justify" or "prove" a conclusion. In some settings, such as deduction over Web data (Pinheiro da Silva et al. 2008) it may be enough to provide the raw data and lightweight descriptions of processes used in enough detail that it can be rerun; in others, it might be required to document that the conclusions were tested in a number of different ways, or to identify all possible parts of the input on which a given part of the output "depends" (Cheney et al. 2007).


Publication, Access, and Dissemination

Hartig 2009 specifically discusses publication, accessibility and dissemination of Web data provenance, based on standard technologies such as RDF, and outlines some open questions, particularly the issues of ambiguity of provenance vocabularies and the absence of provenance information for most data on the Web. This suggests that standards for documenting the kind of provenance information exchanged in this scenario need to be developed and evaluated carefully to ensure they are unambiguous and easy-to-use, or that provenance can be generated and exchanged automatically.


Work on provenance in database and workflow systems has addressed issues of scale. Buneman et al. 2006 investigated querying and compaction for provenance records in curated databases. Chapman et al. 2008 investigate compression for workflow provenance records which dwarf the size of the underlying data. Heinis and Alonso 2008 study the problem of performing graph queries efficiently over typical workflow provenance graphs.



An important goal of provenance records in the disease outbreak scenario is to promote understanding, both between peers or specialists with detailed knowledge of different areas (e.g. epidemiologist Alice and biochemist Bob), and between nonspecialist decision-makers and experts. There is also a great deal of work on visualizing provenance as graphs (e.g. Cheung and Hunter 2006, among many others). Some of this work has focused on providing high-level "user views" of workflow provenance graphs, allowing users to select which parts of the graph they are interested in (Biton et al. 2008). Machine learning and inference techniques may be useful for automatically identifying important or influential steps in provenance graphs (Gomez-Rodrigues et al. 2010).


Interoperability issues involving provenance have mostly been addressed through the development of common data models for provenance records. These range from fields concerning authorship in metadata standards such as Dublin Core, to vocabularies for complex graph data structures to represent provenance of processes, such as PML (Pinheiro da Silva et al. 2006) or OPM (Moreau et al. 2010). These proposals typically describe (the provenance of) static data, whereas provenance management in this scenario also requires relating dynamic data, both over the short term (when changes may be very frequent) and long term (when data may be dormant and perhaps subject to degradation or preservation failure). This appears likely to raise a number of interoperability issues, including some (such as standard citation and versioning infrastructure) that are somewhat independence of provenance and may be of general interest.


Obviously, once we have provenance records in a common format, they can be compared using a variety of techniques (e.g. string, tree or graph variants of differencing algorithms). However, it is not clear how well existing off-the-shelf algorithms meet the needs of scientific users. Miles et al. 2006 and Bao et al. 2009 present initial steps in understanding the requirements and practical issues in comparing provenance.


In the Disease Outbreak Scenario, provenance promotes accountability by making explicit the authors, sources, and computational steps, inference steps, and choices made to reach a conclusion. (Edwards et al. 2009) gives an overview of the social science research requirements for accountability and policy assessment, together with a discussion of how automatic provenance recording and management can better support accountability.


Issues of trust are discussed in Stead et al., Missier et al. (SIGMOD 2007), and Aldeco-Perez and Moreau (VCS 2008). Artz and Gil (J. Web Sem. 2007) also gives a survey of trust in computer science and the semantic web. Trust is widely described as a motivation for recording provenance. Many approaches model trust as Boolean or probability annotations that are propagated in various ways, for example using Boolean conjunction or disjunction or probabilistic operations to propagate trust annotations on raw data to trust annotations on results.


Provenance records, like all electronic data, are subject to loss, degradation, or corruption. This means that the provenance records may be misleading or incomplete. If provenance is used to make important decisions, then attackers can influence these decisions by falsifying provenance. Some work on security for provenance addresses this using standard cryptographic signing techniques (Hasan et al. 2009). In the database community, some recent work has investigated the problem of inferring provenance for data for which the source has been lost (Zhang and Jagadish 2010).


Provenance records can be used to debug complex processes or clean dirty data. Differencing techniques (Bao et al. 2009), for example, can help clarify whether an anomalous result is due to errors in the source data, incorrect processing steps, or transient hardware or software failure. Techniques based on entailment analysis or dependency-tracking (Cheney et al. 2007) can also address similar issues, by highlighting data that contributed (or excluding data that did not contribute) to a selected part of the result. These techniques are closely related to debugging techniques such as program slicing and to security techniques such as information flow analysis.

Gap Analysis for the Disease Outbreak Scenario

This scenario is data-centric, and there is currently a large gap between ideas that have been explored in research and techniques that have been adopted in practice, particularly for provenance in databases. This is an important problem that already imposes a high cost on curators and users, because provenance needs to be added by hand and then interpreted by human effort, rather than being created, maintained, and queried automatically. However, there are several major obstacles to automation, including the heterogeneity of systems that need to communicate with each other to maintain provenance, and the difficulty of implementing provenance-tracking efficiently within classical relational databases. Thus, further research is needed to validate practical techniques before this gap can be addressed.

In the workflow provenance (e.g. Taverna, Wings/Pegasus, ZOOM etc.) and Semantic Web systems (Inference Web) area, provenance techniques are closer to maturity, in part because the technical problems are less daunting because the information is coarser-grained, typically describing larger computation steps rather than individual data items, and focusing on computations from immutable raw data rather than versioning and data evolution. There is already some consensus on graph-based data models for exchanging provenance information, and this technology gap can probably be addressed by a focused standardization effort.

Guidance to users about how to publish provenance at different granularity is also very important, for example whether publishing the provenance of an individual object or a collection of objects. Users need to know how to use different existing provenance vocabularies to express such different types of provenance and what the consequence will be, for example, how people will use this information and what information is needed to make it useful.

Analysis of Business Contract Scenario

The group conducted a detailed in-depth analysis of the Business Contract scenario. The provenance issues highlighted in this scenario include:

  • Checking whether past actions comply with stated obligations
  • Understanding how one product is derived from another
  • Filtering the information revealed in provenance by privacy and ownership concerns
  • Discovering where two sources have a hidden mutual dependency
  • Resolving apparent inconsistencies in multiple accounts of the same event
  • Verifying that those who performed actions had the expertise or authority to do so

Relevant State of the Art for the Business Contract Scenario

Here we describe the research conducted in relation to, and the technology available to fulfil, the above described requirements with specific regard to the Business Contract Scenario.

Existing Solutions Used Today for Business Contract Provenance

The Business Contract scenario envisages mechanisms to explain the details of procedures, in this case instances of design and production, such that it can be compared with the normative statements made in contracts, specifications or other legal documents. It further assumes the ability to filter this evidence on the basis of confidentiality. While we are not aware of any system providing exactly the functionality described, there are many which address particular aspects.

Tracking Design

It is critical to track which products a design decision or production action ultimately affect, so that it is possible to both show that what was done in producing an individual product fulfilled obligations and determine which set of products a decision affected. As an example of the need for the latter, Toyota recently had to recall millions of vehicles due to a particular manufacturing issue [BBC10Toyota]. Without knowing the connection between the manufacturing actions and products, many vehicles unaffected by the problem may be recalled, so costing a company a great deal more than necessary.

A key feature of the VisTrails workflow system [Scheidegger08Vistrails] is its ability to capture not only the provenance of data produced by executing workflows, but changes to the design of the worklow itself. The users can then employ this information to replay all or part of their workflow, revert decisions, etc.

Computer-Aided Design

Computer-aided design (CAD) systems can include features to capture what is occurring as a design is created. Aside from storing a history of changes to a design, some systems allow the rationale behind design choices to be captured [Bracewell08Rationale], which may be an essential part of the record in explaining how contractual obligations were aimed to be met (particularly where a contract states that some factor must be 'taken account of').

The provenance of a design goes beyond just the changes made through a single CAD system, both because one design may be based on another previously developed for another customer and because the design is just a part of a larger manufacturing and retail process. Ways in which the interconnection between parts of a process is eased include the use of common formats for describing designs and standardised archiving mechanisms used by all stages of a process [Patel08CAD].

Electronic Notebooks

Using the web for business processes allows multiple remotely deployed and regularly updated tools and services to be integrated through a single portal. One beneficial effect of this is that the different activities involved in a complex design process can be tracked together, so providing a single coherent record of how something was produced. These systems are sometimes called electronic notebooks [Reimer04Notebook].

Current Provenance Research Relevant to the Business Contract Scenario

Here, we aim to discuss how the technical requirements drawn from the Business Contract Scenario are addressed by existing technology and research, focusing on surveys and papers that introduce key ideas ro conclude lines of research, not comprehensive coverage of all related work. We proceed by discussing each dimension in turn.



In the scenario, BWF has many customers over time (and re-uses earlier designs in satisfying later customers). In order to ensure that adequate proof of contract fulfilment can be provided if needed, BWF and a customer may agree up front as to the form this, i.e. the provenance data, would take. It is infeasible for such a format to be reinvented for each customer, so it is preferable to use a commonly accepted suitable provenance model. Attempts at shared exchange language for provenance (of the form required in the scenario) are the Open Provenance Model [Moreau10OPM] and Proof Markup Language [Pinheiro06Proof].



Nagappan and Vouk [Nagappan08Privacy] directly consider the issues of preserving an audit trail, suitable for later providing as evidence, and access control to ensure confidential provenance information is not exposed. This approach is applied within the context of a database accessed via a web query interface. More recently, Davidson et al. [Davidson07Workflow] have considered how to characterise and measure the privacy of information revealed within provenance data (in the form of an executed workflow).


A noticeable point in the scenario is that BWF do not know in advance that the request to prove contract compliance will be made, or what particular part of the design process will be queried, e.g. the competence of a particular engineer. In general, it is impossible to know exactly what information about current events, and to what detail, will be required by those later querying the provenance. On the other hand, it is also impossible to keep records of everything which occurs in full detail. This problem was noted by Bose and Frew [Bose05Survey], and attempts have been made to give engineers guidance in tackling it by Chapman and Jagadish Chapman07Issues and Miles et al. [Miles10TOSEM].



Analogous to the scenario, Miles et al. [Miles07Validation] conducted research into checking compliance of a past procedure to given constraints, where the procedure was a bioinformatics experiment. Ontology-based reasoning was used to allow the constraints (which concerned debugging and checking use of licensed material) to be expressed at a coarser granularity to the description of the process in the provenance.

The question "Were two experts' assessments truly independent, or based on the same ultimate source?" is one tackled, in general form, by other existing work. As the sources from which an assessment are derived may not all be apparent from the assessment, especially where information is second- or third-hand, the provenance of the assessments can help determined whether there is commonality. Analogous research has been conducted by Townend et al. [Townend05Fault] in the area of debugging systems which provide multiple independent checks for fault tolerance in service-based applications.


A question is raised in the scenario about the quality of the information given, and a dispute occurs over the semantics of time information. There exists work by Hartig and Zhou [Hartig09Quality] on using provenance of web data to assess its quality, in particular its timeliness. Given the reliance on correct engineering of a secure website for Customers Inc., deciding contract compliance in their or BWF's favour may be a question of the risk caused by BWF's engineering approach. Gandhi and Lee [Ghandi08Regulatory] propose a methodology for providing a chain of evidence to produce insights for risk assessment.


Also using provenance to determine contract compliance, Groth et al. [Groth09Trust] used OPM along with an electronic contract representation to determine whether a new contract proposal should be accepted based on it similarity with past contracts and the success of those contracts (as determined from the provenance).


With regards to design of software, such as BWF's websites, the provenance of a manufactured product is closely related to the traceability of the design. Jahnke et al. [Jahnke02History] proposed automatically capturing the design process by recording the changes in the models, so allowing a designer to relate a later design to an earlier one from which it was derived. They extended this idea to view re-design as a workflow-like process [Jahnke02Iterations], so tying traceability even more closely to many approaches to provenance.

Gap Analysis for the Business Contract Scenario

Given the analysis of the state of the art above, what are the gaps in technology which need to be addressed to achieve what the scenario proposes?

Overall, there is a gap in practice: provenance technologies are simply not being used for the described purpose. Even with encouragement, the existing state of the art does not make it a simple task to achieve the scenario, because of a lack of standards, guidelines and tools. Specifically, we can consider the gaps with regards to content, management and use, as modelled above.

First, there is no established way to express what needs to be expressed regarding the provenance of an engineered product, in such a way that this can be used by both the developer and the customer (and any independent third party). Each needs to first identify and retrieve the provenance of some product, then determine from the provenance where one item is a later version of another, where implementation activity was triggered by a successful quality check etc. Moreover, even if a commonly interpretable form for the latter provenance information was agreed, there is not an established way to then augment the data so that we can verify the quality of the process it describes, e.g. how to make it comparable against contractual obligations, or to ensure it cannot be tampered with prior to providing such proof. While provenance models, digital signatures, deontic formalisms, versioning schemes and so on provide an adequate basis for solutions, there is a substantial gap in establishing how these should be applied in practice. Without common expectations on how provenance is represented, a producer cannot realistically document its activities in a way which all of its customers can use.

Assuming that common representations are established, there is a gap in what the technology provides for storing and accessing that information. This goes beyond simply using existing databases due to the form of information conveyed in provenance. For example, if an independent expert documents, in some store local to themselves, that they checked a website design, this must be connected to rest of the website's manufacturing process to provide a full account. While web technologies could be harnessed for the purpose, there is no established approach to interlinking across parts of provenance data. Moreover, the particular characteristics of provenance data may affect how other data storage requirements must be completed: how to scale up to the quantities of interlinked information we can expect with automatic provenance recording, how to limit access to only non-confidential information etc. Finally, the provenance, once recorded, has to be accessible and there are no existing standards for exposing provenance data on the web.

With regards to use of provenance, the scenario requires various sophisticated functions, and it is non-obvious how existing technologies would realise them. For example, the various parties need to be able to understand the engineering processes at different levels of granularity, to resolve apparent conflicts in what appears to be expressed in the provenance data, to acquire indications and evidence of which parts of the record can be relied on, determine whether the provenance shows that the engineering process conformed to a contract, check whether two supposedly independent assessments rely on the same, possibly faulty, source, and so on. To encourage mass adoption of provenance technologies, it must be apparent how to achieve these kinds of function, through guidelines and standardised approaches.


There are many proposed approaches and technology solutions that are relevant to provenance, as we have illustrated with three very different scenarios. Despite this large body of growing work, there are several major technology gaps to realizing the requirements of these scenarios. Organized by our major provenance dimensions, they are:

  • With respect to content:
    • No mechanism to refer to the identity/derivation of an information object. In the NA scenario objects are changed and republished as they are disseminated and rewritten in the blogosphere and the twittosphere. In the DO scenario, data is published and reused on the web.
    • No guidance on what level of granularity should be used in describing provenance of complex objects. In the DO scenario, the data published may contain many records composed of complex objects and provenance could be associated at any level of aggregation and granularity.
    • No common standard for exposing and expressing provenance information that captures processes as well as the other content dimensions
    • No guidance on publishing provenance updates
    • No standard techniques for versioning and expressing provenance relationships linking versions of data
    • No standard formats which can characterise whether provenance is of adequate quality for proof, e.g. signatures
    • No pre-specified way to ensure provenance content can be compared against norms and expectations (eg contracts, required processes)
  • With respect to management:
    • No well-defined standard for linking provenance between sites.
    • No guidance for how existing standards can be put together to provide provenance (e.g. linking to identity).
    • No guidance for how application developers should go about exposing provenance in their web systems.
    • No proven approaches to manage the scale of the provenance records to be recorded and processed,
    • No standard mechanisms to find and access provenance information for each item that needs to be checked,
    • No well-defined means of ensuring only essential non-confidential provenance is released when querying. In the BC scenario, only portions of the provenance records should be available to the customer.
  • With respect to use:
    • No clear understanding of how to relate provenance at different levels of abstraction, or automatically extract high-level summaries of provenance from detailed records.
    • No general solutions to understand provenance published on the Web by another party
    • No standard representations that support integration of provenance across different sources
    • No standard representations that support comparison of provenance. In the DO scenario, we would like to compare the provenance of different scientific analysis results. In the BC scenario, we wish to compare provenance with business process specifications
    • No broadly applicable approaches for dealing with imperfections in provenance,
    • No broadly applicable methodology for making trust judgments based on provenance when there is varying information quality.
    • No standard methods for validating whether provenance is of adequate quality for proof
    • No mechanism to query whether provenance data shows that laws, regulations or contracts have been complied with
    • No standard mechanism to assess whether two supposedly independent assessments rely on the same, possibly faulty, source
    • No means to resolve conflicts in (possibly inferred) provenance data