Requirements Clean

From XG Provenance Wiki
Jump to: navigation, search

About this document

Last updated April 7, 2010. Comments closed March 30, 2010 (see discussion tab). Final version for public dissemination at User_Requirements


This document provides an accessible discussion of the requirements for provenance on the Web developed by the W3C Provenance Incubator Group. The document focuses on user requirements: those requirements that users have on software systems with respect to provenance. These requirements are organized into a number of dimensions focusing broadly on three aspects of provenance: the content that needs to be represented and maintained in provenance records, the management of provenance records, and the uses of provenance information. To illustrate these requirements, we use three scenarios that were synthesized from a large set of use cases collected by the the group.

Introduction

The provenance of information is crucial to making determinations about whether information is trusted, how to integrate diverse information sources, and how to give credit to originators when reusing information. Broadly construed, provenance encompasses the initial sources of information used as well as any entity and process involved in producing a result. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable. People make trust judgments based on provenance that may or may not be explicitly offered to them. Reasoners in the Semantic Web will need explicit representations of provenance information in order to make trust judgments about the information they use. With the arrival of massive amounts of Semantic Web data (eg, via the Linked Open Data community) information about the provenance of that data becomes an important factor in developing new Semantic Web applications. Therefore, a crucial enabler of the Semantic Web deployment is the explicit representation of provenance that is accessible to machines, not just to humans.

The W3C Provenance Group was formed in September 2009 as a W3C Incubator Activity. Its charter was provide a state-of-the art understanding and develop a roadmap in the area of provenance for Semantic Web technologies, development, and possible standardization. At the time of writing this document, the group has been in existence for six months, which is a half of its expected activity. The group has produced a number of documents, including a report of key dimensions for provenance, more than thirty use cases spanning many areas and contexts that illustrate these key dimensions, and a broad set of user requirements and technical requirements derived from those use cases. The group's reports are publicly available in the Provenance Group's web site.

The present document is intended for a broad audience and summarizes the group's findings regarding requirements for provenance motivated by use cases collected by the group. We refer the reader to the Provenance Group's web site for details and documentation.

We begin describing three broad scenarios that illustrate the need for provenance:

  1. News Aggregator Scenario: a news aggregator site that assembles news items from a variety of sources (such as news sites, blogs, and tweets), where provenance records can help with verification, credit, and licensing
  2. Disease Outbreak Scenario: a data integration and analysis activity for studying the spread of a disease, involving public policy and scientific research, where provenance records support combining data from very diverse sources, justification of claims and analytic results, and documentation of data analysis processes for validation and reuse
  3. Business Contract Scenario: an engineering contract scenario, where the compliance of a deliverable with the original contract and the analysis of its design process can be done through provenance records

These three scenarios were motivated by the original use cases collected by the Provenance Group. They were designed to cover different subsets of requirements and address different communities of interest.

After describing the three scenarios, we discuss the requirements for provenance using the scenarios to illustrate and clarify. We group the requirements for provenance into three major areas of concern: content, management, and use. Content refers to the type of information that provenance records need to contain. Management refers to the mechanisms that make provenance available and accessible in an open system like the Web. Uses of provenance refer to the purposes and usage of provenance information.

Motivating Scenarios: The Need for Provenance

News Aggregator Scenario

Many web users would like to have mechanisms to automatically determine whether a web document or resource can be used, based on the original source of the content, the licensing information associated with the resource, and any usage restrictions on that content. Furthermore, in cases of mashed-up content it would be useful to ascertain automatically whether or not to trust it by examining the processes that created, processed, and delivered it. To illustrate these issues, we present the following scenario of a fictitious website, BlogAgg, that aggregates news information and opinion from the Web.

BlogAgg aims to provide rich real time news to its readers automatically. It does this by aggregating information from a wider variety of sources including microblogging websites, news websites, publicly available blogs and other opinion. It is imperative for BlogAgg to present only credible and trusted content on its site to satisfy its customer, attract new business, and avoid legal issues. Importantly, it wants to ensure that the news that it aggregates are correctly attributed to the right person so that they may receive credit. Additionally, it wants to present the most attractive content it can find including images and video. However, unlike other aggregators it wants to track many different web sites to try and find the most up to the date news and information.

Unfortunately for BlogAgg, the source of the information is not often apparent from the data that it aggregates from the web. In particular, it must employ teams of people to check that selected content is both high-quality and can be used legally. The site would like this quality control process to be handled automatically.

For example, one day BlogAgg discovers that #panda is a trendy topic on Twitter. It finds that a tweet "#panda being moved from Chicago Zoo to Florida! Stop it from sweating http://bit.ly/ssurl", is being retweeted across many different microblogging sites. BlogAgg wants to find the correct originator of the microblog who first got the word out. It would like to check if it is a trustworthy source and verify the news story. It would also like to credit the original source in its site, and in these credits it would like to include the email address of the originator if allowed by that person.

Following the tiny-url, BlogAgg discovers a site protesting the move of the panda. BlogAgg wants to determine what organization is responsible for the site so that its name can run next to the snippet of text that BlogAgg runs. In determining the snippet of text to use, BlogAgg needs to determine whether or not the text originated at the panda protest site or was a quoted from another site. Additionally, the site contains an image of a panda that appears as if it is sweating. BlogAgg would like to automatically use a thumbnail version of this image in its site, therefore, it needs to determine if the image license allows this or if that license has expired and is no longer in force. Furthermore, BlogAgg would like to determine if the image was modified, by whom, and if the underlying image can be reused (i.e. whether the license of the image is actually correct). Additionally, it wants to find out whether any modifications were just touch-ups or were significant modifications. Using the various information about the content it has retrieved, BlogAgg creates an aggregated post. For the post, it provides a visual seal showing how much the site trusts the information. By clicking on the seal, the user can inspect how the post was constructed and from what sources and a description of how the trust rating was derived from those sources.

Note, that BlogAgg would want to do this same sort of process for thousands to hundreds of thousands a sites a day. It would want to automate its aggregation process as much as possible. It is important for BlogAgg to be able to detect when this aggregation process does not work in particular when it can not determine the origins of the content it uses. These need to be flagged and reported to BlogAgg's operators.



Influenced by:

Covered User Requirements:

  • C-Attr-UR 1: Determine who contributed to a document
  • C-Attr-UR 2:To be able to follow a process which will ensure correct management of provenance of data creators, including understanding of which content was created by which person
  • C-Vers-UR 1: Determine how content changes (i.e. its version) across the web and who is responsible for those changes
  • C-Vers-UR 1b: Determine and record how content has changed
  • C-Vers-UR 2b: Determine and record when content was changed
  • C-Vers-UR 5.1: Permit anonymization of the records before they are viewed or extracted
  • C-Vers-UR 6b: Use across different platforms
  • C-JUST-UR4: As much as possible of the above processes should be automated, to reduce effort and ensure compliance with regulations.
  • C-Entail-UR 3: Be aware of the sources that contributed to the materialized view, both at a coarse and fine granularity levels. More specifically, we would like to know the database, table, or tuple thereof that was used in the computation of the view.
  • C-Entail-UR 8: Identify the date and time of the derivation
  • M-Pub-UR 1: Publish provenance information associated with data on the Web
  • M-Pub-UR2: Publish provenance in a way that makes it easy to access and query
  • M-Pub-UR3: Choose a representation format to publish provenance information
  • M-Pub-UR4: Users need to identify who published the provenance information
  • M-Scale-UR 1: Allow provenance tracking for large scale data such as blogs posts and other Web content
  • U-Under-UR 3: Be able to provide Subject Matter Experts (SMEs) with explanations of the rationale behind a certain outcome
  • U-Under-UR4: Enable users to approach the provenance graph at different levels of detail,
  • U-Inter-UR2: (Chained provenance) Enable users/systems to trace back the origin of an "entity" whose ancestors have been produced/generated by different systems.
  • U-Inter-UR3: (Record) Enable users/systems to express the provenance of an "entity" and make it persistent.
  • U-Inter-UR 5: Facilitate data sharing through interoperability of the associated provenance information
  • U-Tru-UR 1: Enable users to assess the trustworthiness of Web data.
  • U-Tru-UR 2: It should be possible for users to assess trust on Web data based on its attribution metadata
  • U-Tru-UR 3: Allow users to interpret the evaluation of the trustworthiness of Web data
  • U-Imper-UR 2: Allow users to access summarized provenance
  • U-Debug-UR 1: Allow users to detect where there is a single point of failure (source of potentially faulty information) somewhere in the process by which we came to have some derived information

Disease Outbreak Scenario

Many uses of the web involve the combination of data from diverse sources. Data can only be meaningfully reused if the collection processes are exposed to users. This enables the assessment of the context in which the data was created, its quality and validity, and the appropriate conditions for use. Often data is reused across domains of interest that end up mutually influencing each other. This scenario focuses on the reuse of data across disciplines in both anticipated and unanticipated ways.

Alice is an epidemiologist studying the spread of a new disease called owl flu (a fictitious disease made up for this example), with support from a government grant. Many studies relevant to public policy are funded by the government, with the expectation that the conclusions will provide guidance for public policy decision-makers. These decisions need to be justified by a cost-benefit analysis. In practice, this means that results of studies not only need to be scientifically valid, but the source data, intermediate steps and conclusions all need to be available for other scientists or non-experts to evaluate. In the United Kingdom, for example, there are published guidelines that social scientists like Alice are required to follow in reporting their results, called “The Green Book: Appraisal and Evaluation in Central Government”.

Alice’s study involves collecting reports from hospitals and other health services as well as recruiting participants, distributing and collecting surveys, and performing telephone interviews to understand how the disease spreads. The data collected through this process are initially recorded on paper and then transcribed into an electronic form. The paper records are confidential but need to be retained for a set period of time.

Alice will also use data from public sources (Data.gov, Data.gov.uk), or repositories such as the UK Data Archive (data-archive.ac.uk). Currently, a number of e-Social Science archives (such as NeISS, e-Stat and NESSTAR) are being developed for keeping and adding value to such data sets, which Alice may also use. Alice may also seek out "natural experiment" data sources that were originally gathered for some other purpose (such as customer sales records, Tweets on a given day, or free geospatial data).

Once the data are collected and transcribed, Alice processes and interprets the results and writes a report summarizing the conclusions. Processing the data is labor-intensive, and may involve using a combination of many tools, ranging from spreadsheets, generic statistics packages (such as SPSS or Stata), or analysis packages that cater specifically to Alice's research area. Some of the data sources may also provide some querying or analysis processing that Alice uses at the time of obtaining the data.

Alice may find many challenges in integrating the data in different sources, since units or semantics of fields are not always documented. When different data sources use different representations, Alice may need to recode or integrate this data by hand or by guessing an appropriate conversion function - subjective choices that may need to be revisited later. To prepare the final report and make it accessible to other scientists and non-experts, Alice may also make use of advanced visualization tools, for example to plot results on maps.

The conclusions of the report may then be incorporated into policy briefing documents used by civil servants or experts on behalf of the government to identify possible policy decisions that will be made by (non-expert) decision makers, to help avoid or respond to future outbreaks of owl flu. This process may involve considering hypotheticals or reevaluating the primary data using different methodologies than those applied originally by Alice. The report and its linked supporting data may also be provided online for reuse by others or archived permanently in order to permit comparing the predictions made by the study with the actual effects of decisions.

Bob is a biologist working to develop diagnostic and chemotherapeutic targets in the human pathogen responsible for owl flu. Bob’s experimental data may be combined with data produced in Alice’s epidemiological study, and Bob’s level of trust in this data will be influenced by the detail and consistency of the provenance information accompanying Alice’s data. Bob generates data using different experiment protocols such as expression profiling, proteome analysis, and creation of new strains of pathogens through gene knockout. These experiment datasets are also combined with data from external sources such as biology databases (NCBI Entrez Gene/GEO, TriTrypDB, EBI's ArrayExpress) and information from biomedical literature (PubMed) that have different curation methods and quality associated with them. Biologists need to judge the quality, timeliness and relevance of these sources, particularly when data needs to be integrated or combined.

These sources are used to perform "in silico" experiments via scientific workflows or other programming techniques. These experiments typically involve running several computational steps, for example running machine learning algorithms to cluster gene expression patterns or to classify patients based on genotype and phenotype information. The results need to meet a high standard of evidence so that the biologists can publish them in peer-reviewed journals. Therefore, justification information documenting the process used to derive the results is required. This information can be used to validate the results (by ensuring that certain common errors did not occur), to track down obvious errors or understand surprising correct results, to understand how different results arising from similar processes and data were obtained, and to infer heuristics for data quality.

As more data of owl flu outbreaks and treatments become available over time, Alice's epidemiological studies and Bob's research on the behavior of the owl flu virus will need to be updated by repeating the same analytical processes incorporating the new data.


Influenced by:

Covered User Requirements:

  • C-Attr-UR 1: Determine who contributed to a document (disambiguation).
  • C-Attr-UR 2: To be able to follow a process which will ensure correct management of provenance of data creators, including understanding of which content was created by which person (versions).
  • C-Proc-UR 1: Be able to determine the influence of an agent on a given (sub)process.
  • C-Proc-UR 2: It should be possible to reason on the outcome of a given process, assuming changes in their preconditions.
  • C-Proc-UR 3: A process should be reproducible from its provenance graph.
  • C-Proc-UR 5: It should be possible to analyze the provenance of a process at different levels of granularity.
  • C-Vers-UR 1: Determine how content changes (i.e. its version) across the web and who is responsible for those changes
  • C-JUST-UR1: The end results of engineering processes or scientific studies need to be justified by linking to source and intermediate data.
  • C-JUST-UR2: The justification should facilitate informed discussion and decisions about the results
  • C-JUST-UR3: The justification should be preserved so that the actual long-term behavior of a product, or effects of a policy can be compared with predictions
  • C-JUST-UR4: As much as possible of the above processes should be automated, to reduce effort and ensure compliance with regulations.
  • C-Entail-UR 1 : Decide the trustworthiness of a reasoning process or a materialized view.
  • C-Entail-UR 6: Identify agents (e.g., humans and software components) responsible for conclusion derivation
  • C-Entail-UR 7: Identify the transformation pattern used to derive conclusions:
  • C-Entail-UR 8: Identify the date and time of the derivation
  • C-Entail-UR 9: Identify any input information directly and indirectly used to derive conclusions
  • M-Acc-UR 2: Given a set of provenance information, the user must be able to determine the source and authority of the provenance author
  • M-Acc-UR 4: Provide a way for stable provenance information to survive deidentification processes without endangering privacy.
  • M-Diss-UR 1: Verify that data, disseminated to some entity for processing, was processed for a purpose which was valid under some generally applied rules of validity, or as stated by the entity upon requesting the data.
  • M-Diss-UR 2: Verify that data, disseminated to some entity for processing, was processed only by that entity
  • M-Diss-UR 4: Check which uses of some data are affected by a change in that data, including those of remote, independent users who copied the data long before the change.
  • U-Under-UR 3: Be able to provide Subject Matter Experts (SMEs) with explanations of the rationale behind a certain outcome (domain)
  • U-Under-UR4: Enable users to approach the provenance graph at different levels of detail, e.g. enabling understanding for users of different levels of expertise, ranging from novice to expert (granularity).
  • U-Inter-UR 1 Enable users/systems to merge metadata about a same "entity" according to its attribution/provenance
  • U-Inter-UR2: (Chained provenance) Enable users/systems to trace back the origin of an "entity" whose ancestors have been produced/generated by different systems.
  • U-Inter-UR 5: Facilitate data sharing through interoperability of the associated provenance information
  • U-Comp-UR 1: Enable users to determine the similarities and differences between past processes or events
  • U-Tru-UR 1: Enable users to assess the trustworthiness of Web data. * U-Imper-UR 1: Allow users to access provenance information even if it cannot be directly observed.
  • U-Debug-UR 1: Allow users to detect where there is a single point of failure (source of potentially faulty information) somewhere in the process by which we came to have some derived information

Business Contract Scenario

In scientific collaborations and in business, individual entities often enter into some form of contract to provide specific services and/or to follow certain procedures as part of the overall effort. Proof that work was performed in conformance with the expectations of the project leadership (as expressed in the contract) is often required in order to receive payment and/or to settle disputes. Such proof must, for example, document work that was performed on specific items (samples, artifacts), provide a variety of evidence that would preclude various types of fraud, allow a combination of evidence from multiple witnesses, and be robust to providing partial information to protect privacy or trade secrets. To illustrate such demands of proof, and the other requirements which stem from having such information, we consider the following use case.

Bob's Website Factory (BWF) is a fictitious company that creates websites which include secured functionality, such as for payments. Customers choose a template website structure, enter specifications according to a questionnaire, upload company graphics, and BWF will create an attractive and distinct website. The final stage of purchasing a website is for the customer to agree to a contract setting out the responsibilities of the customer and BWF. The final contract document, including BWF's digital signature will be automatically sent by email to both parties.

BWF has agreed a contract with a customer, Customers Inc., for the supply of a website. Customers Inc. are not happy with the product they receive, and assert that contractual requirements on quality were not met. Specifically, BWF finds it must defend itself against claims that work to create a site to the specifications was not performed or was performed by someone without adequate training, that the security of payments to the site is faulty due to improper quality control and testing procedures not being followed. Finally, Customers Inc. claim that records were tampered with to remove evidence of the latter improper practices.

BWF wish to defend themselves by providing proof that the contract was executed as agreed. However, they have concerns about what information to include in such a proof. Many websites are not designed from scratch but are based on an existing design in response to the customer's requests or problems. Also, sometimes parts of the design of multiple different sites, designed for other customers, are combined to make a new website. Both protecting its own intellectual property and confidential information regarding other customers mean that BWF wishes to reveal only that information needed to defend against Customers Inc.'s claims.

There are many kinds of entities relevant to describing BWF's development processes, from source code text in programs to images. To provide adequate proof, BWF may need to include documentation on all of these different objects, making it possible to follow the development of a final design through multiple stages. The contract number of a site is not enough to unambiguously identify that artefact, as designs move through multiple versions: the proof requires showing that a site was not developed from a version of a design which did not meet requirements or use security software known to be faulty.

With regard to showing that there was adequate quality control and testing of the newly designed or redesigned sites, BWF needs to demonstrate that a design was approved in independent checks by at least two experts. In particular, the records should show that the experts were truly independent in their assessments, i.e. they did not both base their judgement on one, possibly erroneous, summarised report of the design.

Finally, Customers Inc. claim that there is a discrepancy in the records, suggesting tampering by BWF to hide their incompetence, as the development division apparently claimed to have received instructions from the experts checking the design that it was OK before the experts themselves claim to have supplied such a report. BWF suspects, and wishes to check, that this is due to a difference in semantics between the reported times, e.g. in one case it regards the receipt of the report by the developers, in the other it regards the receipt of the acknowledgement of the report from the developers by the experts. These reports should be shared in a format, which both parties understand.


Influenced by:

Covered User Requirements:

  • C-Attr-UR 2: To be able to follow a process which will ensure correct management of provenance of data creators, including understanding of which content was created by which person
  • C-Proc-UR 1: Be able to determine the influence of an agent on a given (sub)process
  • C-Proc-UR 4: It should be possible to compare processes with each other, group them into categories or clusters, and retrieve them on the basis of their provenance graphs.
  • C-Vers-UR 1b: Determine and record how content has changed (tweets, IQ assessment for linked data, timeliness)
  • C-Vers-UR 2b: Determine and record when content was changed (tweets, IQ assessment for linked data, timeliness)
  • C-Vers-UR 3b: Determine and record who and/or what is responsible for the changes (tweets, IQ assessment for linked data, timeliness)
  • C-Vers-UR 5b: View and extract some or all of the provenance records as required (tweets, IQ assessment for linked data, timeliness)
  • C-JUST-UR1: The end results of engineering processes or scientific studies need to be justified by linking to source and intermediate data.
  • C-JUST-UR2: The justification should facilitate informed discussion and decisions about the results
  • C-JUST-UR3: The justification should be preserved so that the actual long-term behavior of a product, or effects of a policy can be compared with predictions
  • C-Entail-UR 1 : Decide the trustworthiness of a reasoning process or a materialized view
  • C-Entail-UR 3: Be aware of the sources that contributed to the materialized view, both at a coarse and fine granularity levels.
  • C-Entail-UR 6: Identify agents (e.g., humans and software components) responsible for conclusion derivation
  • C-Entail-UR 9: Identify any input information directly and indirectly used to derive conclusions
  • M-Pub-UR3: Choose a representation format to publish provenance information
  • M-Acc-UR 1: Given an entity, the user must be able to query a single source or federation of sources for provenance information directly applicable to that entity
  • M-Acc-UR 3: Given a set of provenance information, the user must be able to query a single source or federation of sources to find partial matches of that provenance information in other sets of provenance information, usually to corroborate the history of any of the entities in the set
  • M-Acc-UR 4: Provide a way for stable provenance information to survive deidentification processes without endangering privacy
  • M-Diss-UR 5: Prove or disprove that a resource produced by one user is actually derived from a design produced by another user
  • U-Acct-UR 1: Allow users to verify that the work performed meets the contract decided upon earlier
  • U-Imper-UR 2: Allow users to access summarized provenance
  • U-Debug-UR 1: Allow users to detect where there is a single point of failure (source of potentially faulty information) somewhere in the process by which we came to have some derived information

Requirements for Provenance on the Web

We organize the requirements for provenance into three related categories: content, management, and use. We will refer to the three scenarios described above: the News Aggregator Scenario, the Disease Outbreak Scenario, and the Business Contract Scenario.

Content

Content refers to what types of information would need to be represented in a provenance record. That is, what structures and attributes would need to be defined in order to contain the kinds of provenance information that we envision need to be captured.

We need to establish first what is the artifact that is the object of any statements about provenance, and be able to refer to it. This object can be a variety of things. In the Web this will be a web resource, essentially anything that can be identified with a URI, such as web documents, datasets, assertions, or services. In the News Aggregator Scenario the object of provenance was the final news item posted by the news aggregator. It can also be a set or collection of items, as we may wish to associate provenance information to a group of items, such as the pages of the website sold in the Business Contract Scenario. Conversely, sometimes provenance information needs to refer to a particular portion or aspect of an artifact. A challenge to keep in mind is being able to keep track of provenance during an object's lifetime. For example, objects may be organized in collections, then subgroups selected, then portions of some objects modified, etc.

One important aspect of content is Attribution. Attribution refers to the sources (i.e., typically any web resource that has an associated URI such as documents, web sites, or data) or entities (i.e., people, organizations, and other identifiable groups) that contributed to create the artifact in question. In the News Aggregator Scenario, sources would be the URIs for a blog or for a news item at a website, while entities would include Twitter.com and the person who created the original microblog. Typically an artifact would be associated with several sources and entities. One challenge in this respect is to represent which of them has responsibility for the artifact at hand, meaning which of the many entities that were associated with its creation actually endorses or stands by it. For example, in the News Aggregator Scenario the image of the panda may have been modified before posted on the protest site but some entity must have originally backed up having taken the actual picture from a panda really sweating. Comparably, in the Business Contract Scenario, it is a critical part of the argument to distinguish the expert's checking a design's validity from the developer's creating a site from that design. In addition, the provenance representation of attribution should also enable us to see the true origin of any statement of attribution to an entity. It is important to represent whether the statement was recorded by that entity and can be verified (for example with a digital signature). Alternatively, we would want an indication that the attribution statement was stated by the original entity or was reconstructed and then asserted by a third party.

Another important aspect of provenance content is Process. This refers to the activities (or steps) that were carried out to generate the artifact at hand. These activities encompass the execution of a computer program that we can explicitly point to, a physical act that we can only refer to, and some action performed by a person that can only be partially represented. An example of such activity in the Disease Outbreak Scenario is the execution of a machine learning algorithm to create a classification for a clinical trial dataset. In the News Aggregator Scenario, examples of activities are each of the retweets (representing the action that different people did of taking an assertion and including it in a new tweet) and each of the edits to the panda image done by different people. Activities are typically related to one another temporally or causally, and the provenance representation should capture such relationships. Provenance information may need to refer to descriptions of the activities, so that it becomes possible to support reasoning about processes involved in provenance and support descriptive queries about them. For example, in the Disease Outbreak Scenario, the machine learning algorithm used may be described as a Bayes algorithm or a decision tree algorithm, and a user may query for whether the classifier is human readable or not (which the latter is while the former is not). Processes can be represented at a very abstract level, focusing only on important aspects, or at a very fine-grained level to include minute details. In this respect, an important consideration is to represent enough process details to allow reproducibility, that is, the ability to re-create the artifact by re-enacting (re-executing) the process as described by its provenance record. For example, in the Disease Outbreak Scenario a scientist may want to re-run the same machine learning algorithm to include more recent data and check whether they obtain the same classification or a different one. For this they would need to have access to the code and perhaps the parameter settings, while it would not be important to know what particular machine was used or how much memory was originally allocated to run the algorithm. Process provenance information needs to be connected with attribution information, so that it is possible to represent how sources or entities were involved in what activities and what their role was. For example, in the Disease Outbreak Scenario the clinical dataset was a source that played the role of being input to the machine learning algorithm, while the scientist was associated with that same activity as the entity that set its parameters and submitted it for execution. An important aspect of representing process provenance are resource access aspects. This includes matters such as the access time, the server accessed, and the party responsible for the server. For example, in the Disease Outbreak Scenario the dataset of patients would be larger over time, and if so the results of the analysis process could change drastically. In the News Aggregator Scenario, any of the tweets and posts may have been corrected after the access occurred and if so it would be important to know at what time the content of any resource was accessed.

Evolution and versioning must be treated with special status in a provenance representation. As an artifact evolves over time, its provenance should be augmented in specific ways that reflect the changes made over prior versions and what entities and processes were associated with those changes. When one has full control over an artifact and its provenance records this may be a simple matter of housekeeping, but this is a challenge in open distributed environments such as the web. Consider the representation of provenance when republishing, for example by retweeting, reblogging, or repackaging a document. It should be possible to represent when a set of changes grant the denomination of a new version of the object in question. In some cases, it may be desirable to specify the reasons for some changes, for example to summarize a document to facilitate readability or to compress a dataset in order to reduce its size. In addition, it is an open question whether each new version published should publish not only the artifact but also its full provenance as available in the original source. In the Business Contract Scenario, when the design of a website developed for one customer has aspects drawn from designs of other customers, publishing the full provenance may breach confidentiality: information might be revealed about past customers. Another open question is how to associate provenance information to each of the updates to an artifact. One must consider whether the entire provenance trail should be associated with each update, or whether each change should only represent the delta difference and refer to prior versions of the artifact for more extensive provenance.

A particular kind of provenance information is justifications of decisions. The purpose of a justification is to allow those decisions to be discussed and understood. In the Business Contract Scenario, the critical requirement is for a company to justify the processes it undertook (or did not undertake), involvement of particular people, intermediate data (designs) used and so on, leading to a decision to provide some website to a customer under a contractual agreement. This decision may then be discussed in a court. To allow justifications to be made, the justifying party must be prepared, i.e. they must adequately store the provenance data which can be used to make their case and set up procedures to gather that information, ideally automatically. Another example of the importance of using provenance to justify decisions is in the Disease Outbreak Scenario where it is important to capture the arguments for and against the conclusions of the owl flu report as well as the ability to capture the evidence behind particular hypotheses in the report.

Inference may be required to derive information from the original provenance records. Some provenance information may be directly asserted by the relevant sources of some data or actors in a process, while other information may be derived from that which was asserted. In general, one fact may entail another, and this is important in the case of provenance data which is inherently describing the past, for which the majority of facts cannot now be known. To an example from the Business Contract Scenario, the complainants in the case derive from the (ambiguous) records held by the company that a check on the quality of security-related code was performed only after development based on that code had started. This derivation should not be taken as plain fact by the court, but understood as due to a particular derivation process, with a set of assumptions, and by a particular (biased) source.

Management

Provenance management refers to the mechanisms that make provenance available and accessible in a system.

An important issue is the publication of provenance. Provenance information must be made available on the web. Related issues include how is provenance exposed, discovered, and distributed. A provenance representation language must be chosen and made available so others can refer to it in interpreting the provenance. The publisher of provenance information should be associated with provenance records.

Once provenance is available, it must be accessible. That is, it must be possible to find it by specifying the artifact of interest. For example, in the News Aggregator Scenario the aggregator must be able to point to a tweet and query about its provenance. In some cases, it must be possible to determine what the authoritative source of provenance is for some class of entities. For example, in the Business Contract Scenario, the website development company will argue that their own records are the authoritative sources regarding their own internal processes. Query formulation and execution mechanisms must be defined for provenance representation.

In realistic settings, provenance information will have to be subject to dissemination control. Provenance information may be associated with access policies about what aspects of provenance can be made available given a requestor's credentials. Provenance may have associated use policies about how an artifact can be used given its origins. This may include licensing information stated by the artifact's creators regarding what rights of use are possible for the artifact. For example, in the News Aggregator Scenario the aggregator needs to be able to determine what license the panda image actually has. Finally, provenance information may be withheld from access for privacy protection. In the Business Contract Scenario, the website developer has a need to filter what provenance information is revealed about a design due to its entanglement with their own intellectual property and confidential information about other customers.

The scale of provenance information is a major concern, as the size of the provenance records may by far exceed the scale of the artifacts themselves. Despite the presence of large amounts of provenance, efficient access to provenance records must be possible. Tradeoffs must be made regarding the granularity of the provenance records kept and the actual amount of detail needed by users of provenance. For instance, in the Disease Outbreak Scenario, the complete provenance about the result of an analysis may include a huge portion of the biomedical literature distributed across many published repositories.

Use

We need to take into account requirements for provenance based on the use of any provenance information that we have recorded. The same provenance records may need to accommodate a variety of uses as well as diverse users/consumers.

An important consideration is how to make provenance information understandable to its users/consumers. Just because the information that they need is recorded in the provenance representation it does not mean that they would be able to use it for their purposes. An important challenge that we face is to allow for multiple levels of abstraction in the provenance records of an artifact as well as multiple perspectives or views over such provenance. In the Disease Outbreak Scenario a scientist may want to start with a high-level description that includes only the machine learning algorithms used but not any details of the data format conversion operations that were carried out, and then the latter may be shown upon request. In addition, appropriate presentation and visualization of provenance information is an important consideration, as users will likely prefer something other than a set of provenance statements. For example, in the Disease Outbreak Scenario a scientist may prefer to look at a causal diagram of the processes used to generate a result, while in the News Aggregator Scenario a simple logo signifying approved provenance may be more appropriate to use instead. Transitions between different presentations of provenance need to be defined so that users understand how the different views are related. To achieve understandability, it is important to be able to combine general provenance information with domain-specific information. This is the case in the Business Contract Scenario where general information about how one item derives from another (a website from a design) needs to be combined with information specific to the case in hand, the specifications unique to the customer. Different presentations may be needed for end users with different levels of expertise in the subject matter.

Because provenance information may be obtained from heterogeneous systems and different representations, interoperability is an important requirement. A query may be issued to retrieve information from provenance records created by different systems that then need to be integrated. At a finer grain, the provenance of a given artifact may be specified by multiple systems and need to be combined. It could be that each system contributed a type of provenance information, or that different systems contributed the provenance of sources used to create the artifact. In the News Aggregator Scenario, each news site may use a different representation of provenance, and the aggregator would need to integrate them to provide a coherent provenance picture to the user. Users may want to also know what sources contributed what provenance information, so they can make decisions in case of conflicts. If the different systems use different vocabularies, shared standards or mappings would be needed facilitate interoperability and usability.

Another important use of provenance is for comparison of artifacts based on their origins. Two artifacts may seem very different while their provenance may indicate significant commonalities. Conversely, two artifacts may seem alike, and their provenance may reveal important differences. For example in the Disease Outbreak Scenario two scientists may want to compare results from their experiments by comparing the process provenance information as well as the entities (e.g., reagents) that they each used.

Provenance data can be used for accountability, such as in the Business Contract Scenario, where the website engineering company uses provenance data to account for its actions. Specifically, accountability may mean allowing users to verify that work performed meets a contract decided upon earlier, determining the license that a composite object has due to the licenses of its components, or comparing that an account of the past suggested by a provenance record is compliant with regulations about what should have occurred. Accountability requires that the users can rely on the provenance record and authenticate its sources. This could be achieved with signatures, or possibly by combining multiple independent accounts to collectively provide enough weight to the claim that the correct thing happened, as occurs with the two experts' accounts in the engineering scenario.

A very important use of provenance is trust. Provenance information is used to make trust judgments on a given entity. For example, in the News Aggregator Scenario, the aggregator site makes decisions about whether to include a news item based on its provenance. Users can make similar decisions based on provenance as well, for example by defining filters to eliminate news items based on specific properties of their provenance information. Trust is often based on attribution information, by checking the reputation of the entities involved in the provenance, perhaps based on past reliability ratings, known authorities, or third-party recommendations. Similarly, measures of the relative information quality can be used to choose among competing evidence from diverse sources. For example, in the Disease Outbreak Scenario, researchers will need to assess the quality of the data they are using, particularly if they are obtained from open public sources that do not follow formal creation or curation processes. Finally, users should be able to access and understand how trust assessments are derived from provenance.

Using provenance information may imply handling imperfections. Provenance information may be incomplete in that some information may be missing, or incorrect if there are errors. Provenance information may also be provided with some uncertainty or be of a probabilistic nature. These imperfections may be caused because of problems with the recording of the provenance information. But they may also arise because the user does not have access to the complete and accurate provenance records even if they exist. This is the case when provenance is summarized or compressed, in which case many details may be abstracted away or missing. This is also the case if the user does not have the right permissions to access some aspects of the provenance records. This would also be the case if some aspect of the provenance records needs to be withheld for privacy reasons. Finally, provenance may also be subject to deception and be fraudulent partially or in its entirety. Several of the latter cases can be seen in the Business Contract Scenario: the engineering company hides some records due to confidentiality, leading to gaps in the account, while the complainant claims that this and the apparent time discrepancies suggest that the records have been fraudulently altered to support the engineering company's case.

Another use of provenance is debugging. Users may want to detect failure symptoms in the provenance records and diagnose problems in the process that generated an artifact, whether conducted in a software system or by people. An example is shown in the Business Contract Scenario, where the company tries to determine whether it has made an error due to two apparently independent evaluations being dependent on the same faulty source information. Without a record of the provenance of those evaluations, it would be impossible to determine whether such a 'bug' in the process had indeed occurred.

Concluding Remarks

This document focuses on user requirements for provenance, motivated by an extensive set of use cases contributed by members of the Provenance Group and that cover a wide variety of perspectives on the topic of provenance.

This document will be followed by a state-of-the-art report on provenance, which will place existing work on provenance in the context of the user requirements described here.