What Is Provenance
Provenance refers to the sources of information, such as entities and processes, involved in producing or delivering an artifact. The provenance of information is crucial to making determinations about whether information is trusted, how to integrate diverse information sources, and how to give credit to originators when reusing information. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable. People make trust judgments based on provenance that may or may not be explicitly offered to them. Reasoners in the Semantic Web will need explicit representations of provenance information in order to make trust judgments about the information they use. With the arrival of massive amounts of Semantic Web data (eg, via the Linked Open Data community) information about the origin of that data, ie, provenance, becomes an important factor in developing new Semantic Web applications. Therefore, a crucial enabler of the Semantic Web deployment is the explicit representation of provenance information that is accessible to machines, not just to humans.
Provenance is concerned with a very broad range of sources and uses. Business applications may exploit provenance in trusting a product as they consider the manufacturing processes involved. The provenance of a cultural artifact in terms of its origins and prior ownerships is crucial to determine its authenticity. In a scientific context, data is integrated depending on the collection and pre-processing methods used, and the validity of an experimental result is determined based on how each analysis step was carried out. Throughout this diversity, there are many common threads underpinning the representation, capture, and use of provenance that need to be better understood to enable a new generation of Semantic Web applications that takes provenance and trust into account.
There are many pockets of research and development that have studied relevant aspects of provenance. The Semantic Web and agents communities have developed algorithms for reasoning about unknown information sources in a distributed network. Logic reasoners can produce justifications of how an answer was derived, and explanations that help find and fix errors in ontologies. The information retrieval and argumentation communities have investigated how to amalgamate alternative views and sources of contradictory and complementary information taking into account its origins. The database and distributed systems communities have looked into the issue of provenance in their respective areas. Provenance has also been studied for workflow systems in e-Science to represent the processes that generate new scientific results. Licensing standards bodies take into account the attribution of information as it is reused in new contexts. However, these results are not really known to the Semantic Web community, nor are they necessarily expressed in terms that could facilitate their adoption. Moreover, it is unclear that this existing body of work could address all the needs for provenance management in the Semantic Web without a better understanding of what those needs are.
A Working Definition of Provenance
Provenance is a very broad topic that has many meanings in different contexts. As a group, we developed a working definition of provenance on the web:
|Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.|
Provenance, Metadata, and Trust
Provenance is often conflated with metadata and trust. These terms are related, but they are not the same.
Provenance and Metadata
Metadata is used to represent properties of objects. Many of those properties have to do with provenance, so the two are often equated. How does metadata relate to provenance?
Descriptive metadata only becomes part of provenance when one also specifies its relationship to deriving an object. For example, a file can have a metadata property that states its size, which is not considered provenance information since it does not relate to how it was created. The same file can have metadata regarding creation date, which would be considered provenance-relevant metadata. So even though a lot of metadata has to do with provenance, both terms are not equivalent. In summary, provenance is often represented as metadata, but not all metadata is necessarily provenance.
Provenance and Trust
Trust is a term with many definitions and uses, but in many cases establishing trust in an object or an entity involves analyzing its origins and authenticity. How does trust relate to provenance?
Trust is often equated with provenance, and it is indeed related but it is not the same. Trust is derived from provenance information, and typically is a subjective judgment that depends on context and use. We focus on how to represent, manage, and use provenance information, but not on detailed approaches to how trust may be derived from provenance. In essence, provenance is a platform for trust algorithms and approaches on the web.
Authentication is often conflated with provenance because it leads to establishing trust. However, current mechanisms available for authentication address the verification of an identity or the access to a resource, such as digital signatures and access control. Provenance information may be used for authentication purposes, for example the creator of a document may provide a signature that can be verified by a third party.
Other Definitions of Provenance
Provenance is too broad a term for it to be possible to have one, universal definition - like other terms such as "process", "accountability", "causality" or "identity", we can argue about their meanings forever (and philosophers have indeed debated concepts such as identity or causality for thousands of years without converging). Our goal was to have a working definition that could reflect how the group views provenance.
There are many definitions in the literature that emphasize different views of provenance. Other views on provenance include: 1) Provenance as Process, 2) Provenance as a Directed Acyclic Graph, 3) Why-Provenance, 4) Where-Provenance, 5) How-Provenance, 6) Provenance as Annotations, 7) Event oriented view. See Chapter 3 of the survey Foundations of Provenance on the Web for a compendium of different views of provenance, also see other surveys.
Other definitions discussed by the group include:
- Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact --Yolanda (from overview slides)
- Provenance is a description of how things came to be, and how they came to be in the state they are in today. Statements about provenance can themselves be considered to have provenance. --Jim M
- Conceptually, the provenance of a piece of data is the process that led to that piece of data. Concretely, provenance is represented by asserted documentation. --pgroth
- (attempting a synthesis) Provenance is documentation of the set of artifacts, processes, and agents that have caused a artifact to be, and of the contexts of these entities. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducability and assertions of provenance can themselves become important records with their own provenance. -- Jim M
- (suggested language to contextualize a general definition - Jim M) While the core definition is broadly applicable, the nature of the resources and which processes are of interest vary by domain and by use case. For example:
- On the web, provenance would include information about the creation and publication of web resources as well as information about access of those resources, and activities related to their discussion, linking, and reuse.
- In art history, provenance may include information about an artifact's creation (who created it, when, where, why, and how) as well as descriptive metadata that can be correlated with time (e.g. chemical composition that could bound when a work could have been created) or with context (e.g. analysis of brush strokes that could link a painting with other works of the same artist).
- In scientific research, provenance may include the set of physical and computational processes applied to a sample that would allow repetition of an experiment as well as descriptive information about a sample (e.g. it's chemical composition) and the experimental protocol that would allow reproduction of the work (i.e. with a different sample, different software implementing the same algorithms, etc.)
- In business, provenance may include information about financial and legal processes (e.g. in contracts) as well as the electronic (e.g. online ordering) and physical (e.g. shipping) processes that have occurred.
To Learn More about Provenance
The group has a presentation with motivation and activities of the Provenance Incubator Group that motivates the importance of provenance, includes use case scenarios, and a brief summary of the state of the art.
The group has also created several reports on provenance requirements and the state of the art.
A number of resources created by the group are also useful resources to learn more about provenance:
- a compilation of surveys of existing research on provenance already available in the literature
- an overview of relevant technologies and standards
- a series of presentations on current work on provenance
- a bibliography collection compiled by the group of prior work on provenance
- Bibliography tags the group's tags to annotate bibliography entries in that collection]