Provenance and Web Architecture
From XG Provenance Wiki
This document discusses possibilities to integrate provenance into the Web architecture. For this discussion we assume the existence of a vocabulary to describe provenance.
Some background for this topic:
- A presentation discussing possible approaches to situate provenance on the Web architecture
- A discussion with the W3C Team on Provenance and Web architecture
Integrating Provenance in the HTTP Request/Response Access Pattern
This section discusses possibilities to expose provenance information about Web resources as part of the HTTP based message exchange with which these resources can be accessed.
According to the Web architecture, an HTTP server serves representations of Web resources. These resources might be static documents (i.e. files on the server), but, they can also be dynamically generated documents. Each resources has a URL. When an HTTP client does an HTTP GET request on that URL, the server responds with a representation of the resource referred to by the URL. This representation can be understood as a specific (negotiable) serialization that represents a specific state of the resource. Negotiation can be done on three dimensions: media type, encoding (i.e. charset), and language.
We aim to extend this access pattern in a way that enables clients to access provenance information about the retrieved representation and the represented Web resource.
Subject of Provenance Information
The provenance information could be about the Web resource. This option might be problematic because Web resources may change over time. Nonetheless, some provenance statements always hold, irrespective of the state of a Web resource. Note, OPM in its current form cannot be used to represent this kind of provenance because OPM focuses on "immutable pieces of state."
The provenance information could be about the representation of the Web resource. Each representation served by an HTTP server may have a unique provenance; this holds in particular for representations that are created on the fly. However, representations of the same state of a Web resource have at least some provenance information in common.
The provenance information could be about the state of the Web resource. While this option ignores the creation of the representation that the HTTP server actually serves, it might be more feasible for passing provenance by reference (see below) because it may avoid establishing a provenance record for each representation served.
We distinguish three patterns to pass provenance via the HTTP based message exchange: by value, by reference, mixed.
Passing by Value
The idea of this provenance passing pattern is to add all (known) provenance information about the served representation directly to the HTTP response. This information could be embedded in the HTTP response header or in the representation itself as discussed later.
- The provenance information is always in sync with the retrieved representation.
- Once the provenance information has been added to the response it can be forgotten by the server (i.e. no need to store it).
- Provenance may be much bigger than the representation itself, causing a lot of overhead.
The integration of a mechanism for provenance negotiation (see below) may address the disadvantages of this pattern.
Passing by Reference
The idea of this provenance passing pattern is to understand the provenance record as another Web resource and to add the URI of the corresponding record to the HTTP response. This reference could also be embedded in the HTTP response header or in the representation itself.
Supporting this provenance passing pattern requires a server to mint a new URI for each provenance record. Furthermore, the look-up of these provenance records has to be enabled. In response to such a look-up the server could either reconstruct the provenance information on the fly or it could access a provenance store to which it added provenance records, generated at the time when the original response has been sent. Reconstructing the provenance records might be problematic because the reconstructed record may be a different account than what actually happened.
- Very small overhead in the original response.
- Puts a burden on the server to maintain and keep provenance for all delivered representations (or for all states of the Web resources). This might be a big issue for resources that change frequently.
Passing Partially by Value and by Reference
The idea of this provenance passing pattern is to pass some provenance information by value while providing additional references. The references could either be separate from the embedded provenance information and refer i) to the complete or ii) to a more detailed provenance record. Or, it could be included in the embedded provenance information via URIs that identify common pieces of provenance (e.g. an agent, a common source artifact, etc), assuming that a look-up of these URIs yields additional information.
As mentioned before, the provenance information or references could be embedded at the HTTP level or in the representation itself. Both of these provenance embedding patterns can be used for each of the provenance passing pattern discussed before.
Embedded at the HTTP Level
It might be possible to pass provenance (by value or by reference) in the header of an HTTP response. This would require the use of an appropriate header field. For large provenance records passed by value, this option might not be feasible due to a limit on header size. Furthermore, it is not clear how provenance, passed by value, can be provided via a header field.
Alternatively, provenance could be embedded at the HTTP level via a multipart MIME message.
Embedded in the Representations
Instead of embedding provenance at the HTTP level, it can be embedded (by value or by reference) in the representations itself. This provenance embedding pattern is only possible for representations serialized using a media type with metadata capabilities. The actual approach how provenance is embedded in the representation depends on the media type.
For representations of RDF graphs, serialized in RDF/XML, Turtle, N3, etc., it is not clear yet how the embedded provenance description can be associated with the embedding representation (i.e. what should be the subject of provenance statements). Once this problem is solved, provenance passed by value can be represented by additional RDF triples using a suitable provenance vocabulary. To pass provenance by reference an appropriate RDF property has to be established (dct:provenance might be an option).
For representations of Web pages, serialized in (X)HTML, provenance passed by reference could be embedded using the link element. This option requires the registration of a suitable link type (e.g. "provenance") that has to be used for the rel attribute. Passing provenance by value could be done using RDFa; however, as with the RDF graphs, it is also not clear yet how the embedded provenance description can be associated with the embedding representation (i.e. with the actual HTML serialization that embeds the provenance description).
Different clients / users may have different needs (e.g. provenance described at different levels of detail or no provenance at all). Hence, the aforementioned options do not have to be considered as exclusive, "one size has to fit all" approaches. Instead, it should be possible for clients to negotiate the kind of response they want to receive. This would require the introduction of another dimension of content negotiation. Such an extension of HTTP might be particularily relevant for provenance passed by value.
Integrating Provenance in the SOAP based Request/Response Access Pattern
Integrating Provenance Services
- Problem: The client / user has a representation of a Web resource without any indication of provenance and it wants to obtain provenance information about the representation (or about the represented Web resource).
- Provenance services may help
- Provenance services could provide a REST based API where clients request provenance information via HTTP GET
- Provenance services could provide a SPARQL interface where clients can query for the relevant pieces of provenance only
- A provenance service could either be provided by the namespace owner of the URI that identifies the resource or it could be provided by a third party.
- Each provenance service provides an account of the provenance
- Multiple such services may provide multiple accounts
- Provenance may be large
- Using HTTP scheme based URIs that are grounded in the Domain Name System (DNS) may be an issue for representing and managing provernance in the long term because the owner of domains can change.