Provenance and Web Architecture
From Provenance WG Wiki
There are no standard mechanisms to publish and access provenance on the Web. It is not sufficient to have an agreement on a language to represent provenance, there must also be an agreement on how the provenance can be found for any given resource.
This page (part of the Access and Query Task Force) discusses possibilities to integrate provenance into the Web architecture. It focuses on how provenance information about Web resources could be exposed as part of the HTTP based message exchange with which these resources can be accessed. More detailed discussion can be found at the Provenance and Web Architecture report.
According to the Web architecture, an HTTP server serves representations of Web resources. These resources might be static documents (i.e. files on the server), but, they can also be dynamically generated documents. Each resources has a URL. When an HTTP client does an HTTP GET request on that URL, the server responds with a representation of the resource referred to by the URL. This representation can be understood as a specific (negotiable) serialization that represents a specific state of the resource. Negotiation can be done on three dimensions: media type, encoding (i.e. charset), and language.
We aim to extend this access pattern in a way that enables clients to access provenance information about the retrieved representation and the represented Web resource.
Subject of Provenance Information
The provenance information could be about the Web resource. This option might be problematic because Web resources may change over time. Nonetheless, some provenance statements always hold, irrespective of the state of a Web resource. Note, OPM in its current form cannot be used to represent this kind of provenance because OPM focuses on "immutable pieces of state."
The provenance information could be about the representation of the Web resource. Each representation served by an HTTP server may have a unique provenance; this holds in particular for representations that are created on the fly. However, representations of the same state of a Web resource have at least some provenance information in common.
The provenance information could be about the state of the Web resource. While this option ignores the creation of the representation that the HTTP server actually serves, it might be more feasible for passing provenance by reference (see below) because it may avoid establishing a provenance record for each representation served.
We distinguish three patterns to pass provenance via the HTTP based message exchange: by value, by reference, mixed.
Passing by Value
The idea of this provenance passing pattern is to add all (known) provenance information about the served representation directly to the HTTP response. This information could be embedded in the HTTP response header or in the representation itself as discussed later.
- The provenance information is always in sync with the retrieved representation.
- Once the provenance information has been added to the response it can be forgotten by the server (i.e. no need to store it).
- Provenance may be much bigger than the representation itself, causing a lot of overhead.
The integration of a mechanism for provenance negotiation (see below) may address the disadvantages of this pattern.
Passing by Reference
The idea of this provenance passing pattern is to understand the provenance record as another Web resource and to add the URI of the corresponding record to the HTTP response. This reference could also be embedded in the HTTP response header or in the representation itself.
Supporting this provenance passing pattern requires a server to mint a new URI for each provenance record. Furthermore, the look-up of these provenance records has to be enabled. In response to such a look-up the server could either reconstruct the provenance information on the fly or it could access a provenance store to which it added provenance records, generated at the time when the original response has been sent. Reconstructing the provenance records might be problematic because the reconstructed record may provide a different account than what actually happened, and may vary depending on the reconstruction methods used.
- Very small overhead in the original response.
- Puts a burden on the server to maintain and keep provenance for all delivered representations (or for all states of the Web resources). This might be a big issue for resources that change frequently.
Passing Partially by Value and by Reference
The idea of this provenance passing pattern is to pass some provenance information by value while providing additional references. The references could be separate from the embedded provenance information and refer i) to the complete or ii) to a more detailed provenance record. Alternatively, it could be included in the embedded provenance information via URIs that identify common pieces of provenance (e.g. an agent, a common source artifact, etc), assuming that a look-up of these URIs yields additional information.
As mentioned before, the provenance information or references could be embedded at the HTTP level or in the representation itself. Both of these provenance embedding patterns can be used for each of the provenance passing patterns discussed above.
Embedded at the HTTP Level
It might be possible to pass provenance (by value or by reference) in the header of an HTTP response. This would require the use of an appropriate header field. For large provenance records passed by value, this option might not be feasible due to the limit on header size. Furthermore, it is not clear how provenance, passed by value, can be provided via a header field.
Alternatively, provenance could be embedded at the HTTP level via a multipart MIME message.
Embedded in the Representations
Instead of embedding provenance at the HTTP level, it can be embedded (by value or by reference) in the representations itself. This provenance embedding pattern is only possible for representations serialized using a media type with metadata capabilities. The actual approach as to how provenance is embedded in the representation would depend on the media type.
For representations of RDF graphs, serialized in RDF/XML, Turtle, N3, etc., it is not clear yet how the embedded provenance description can be associated with the embedding representation (i.e. what should be the subject of provenance statements). Once this problem is solved, provenance passed by value can be represented by additional RDF triples using a suitable provenance vocabulary. To pass provenance by reference an appropriate RDF property has to be established (dct:provenance might be an option).
For representations of Web pages, serialized in (X)HTML, provenance passed by reference could be embedded using the link element. This option requires the registration of a suitable link type (e.g. "provenance") that has to be used for the rel attribute. Passing provenance by value could be done using RDFa; however, as with the RDF graphs, it is also not clear yet how the embedded provenance description can be associated with the embedding representation (i.e. with the actual HTML serialization that embeds the provenance description).
Different clients / users may have different needs (e.g. provenance described at different levels of detail or no provenance at all). Hence, the aforementioned options do not have to be considered as exclusive, "one size has to fit all" approaches. Instead, it should be possible for clients to negotiate the kind of response they want to receive. This would require the introduction of another dimension of content negotiation. Such an extension of HTTP might be particularly relevant for provenance passed by value.