Copyright © 2012 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is part of a set of specifications produced by the W3C provenance working group aiming to define interoperable interchange of provenance information in heterogeneous environments such as the Web. It describes the use of existing web mechanisms for discovery and retrieval of provenance information.This document was published by the Provenance Working Group as a First Public Working Draft. If you wish to make comments regarding this document, please send them to public-prov-comments@w3.org (subscribe, archives). All feedback is welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
The Provenance Data Model [PROV-DM] and Provenance Ontology [PROV-O] specifications define how to represent provenance information in the World Wide Web.
This note describes how existing web mechanisms may be used to locate, retrieve and query provenance information.
In defining the specification below, we make use of the following concepts.
Fundamentally, provenance information is about resources. In general, resources may vary over time and context. E.g., a resource describing the weather in London changes from day-to-day, or one listing restaurants near you will vary depending on your location. Provenance information, to be useful, must be persistent and not itself dependent on context. Yet we may still want to make provenance assertions about dynamic or context-dependent web resources (e.g. the weather forecast for London on a particular day may have been derived from a particular set of Meteorological Office data).
Provenance descriptions of dynamic and context-dependent resources are possible through the notion of entities. An entity is simply a web resource that is a contextualized view or instance of an original web resource. For example, a W3C specification typically undergoes several public revisions before it is finalized. A URI that refers to the "current" revision might be thought of as denoting the specification through its lifetime. Separate URIs for each individual revision would then be entity-URIs, denoting the specification at a particular stage in its development. Using these, we can make provenance assertions that a particular revision was published on a particular date, and was last modified by a particular editor. Entity-URIs may use any URI scheme, and are not required to be dereferencable.
Requests for provenance about a resource may return provenance information that uses one or more entity-URIs to refer to versions of that resource. Some given provenance information may use multiple entity-URIs if there are assertions referring to the same underlying resource in different contexts. For example, provenance information describing a W3C document might include information about all revisions of the document using statements that use the different entity-URIs of the various revisions.
In summary, a key notion within the concepts outlined above is that provenance information may be not universally applicable to a resource, but may be expressed with respect to that resource in a restricted context (e.g. at a particular time). This restricted view is called an entity, and an entity-URI is used to refer to it within provenance information.
Provenance information describes relationships between entities, activities and agents. As such, any given provenance information may contain information about several entities. Within some provenance information, the entities thus described are identified by their Entity-URIs.
When interpreting provenance information, it is important to be aware that statements about several entities may be present, and to be accordingly selective when using the information provided. (In some exceptional cases, it may be that the provenance information returned does not contain any information relating to a specific associated entity.)
Web applications may access provenance information in the same way as any web resource, by dereferencing its URI. Typically, this will be by performing an HTTP GET operation. Thus, any provenance information may be associated with a provenance-URI, and may be accessed by dereferencing that URI using normal web mechanisms.
Provenance assertions are about pre-determined activities involving entities; as such, they are not dynamic. Thus, provenance information returned at a given provenance-URI may commonly be static. But the availability of provenance information about a resource may vary (e.g. if there is insufficient storage to keep it indefinitely, or new information becomes available at a later date), so the provenance information returned at a given URI may change, provided that such change does not contradict any previously retrieved information.
How much or how little provenance information is returned in response to to a retrieval request is a matter for the provenance provider application. At a minimum, for as long as provenance information about an entity remains available, sufficient should be returned to enable a client application to walk the provenance graph per section 6. Incremental Provenance Retrieval.
When publishing provenance as a web resource, the provenance-URI should be discoverable using one or more of the mechanisms described in section 3. Locating provenance information.
If there is no URI for some particular provenance information, then alternative mechanisms may be needed. Possible mechanisms are suggested in section 4. Provenance services and section 5. Querying provenance information.
When provenance information is a resource that can be accessed using normal web retrieval, one needs to know a provenance-URI to dereference. If this is known in advance, there is nothing more to specify. If a provenance-URI is not known then a mechanism to discover one must be based on information that is available to the would-be accessor.
Provenance information may be provided by several parties other than the provider of the original resource, each using different provenance-URIs, and each with different concerns. (It is possible that these different parties may provide contradictory provenance information.)
Once provenance information information is retrieved, one also needs to know how to locate the view of that resource within that provenance information. This view is an entity and is identified by an entity-URI.
We start by considering mechanisms for the resource provider to indicate a provenance-URI along with a entity-URI. (Mechanisms that can be independent of the resource provision are discussed in section 4. Provenance services). Three mechanisms are described here:
The mechanisms specified for use with HTTP and HTML are similar to those proposed by POWDER [POWDER-DR] (sections 4.1.1 and 4.1.3).
For a document accessible using HTTP, provenance information may be indicated using an HTTP Link
header field, as defined by Web Linking (RFC 5988) [LINK-REL]. The Link
header field is included in the HTTP response to a GET or HEAD operation (other HTTP operations are not excluded, but are not considered here).
A provenance
link relation type for referencing provenance information is registered according to the template in section 7. IANA considerations, and may be used as shown::
Link: provenance-URI; rel="provenance"; anchor="entity-URI"When used in conjunction with an HTTP success response code (
2xx
), this HTTP header field indicates that provenance-URI
is the URI of some provenance information associated with the requested resource and that the associated entity is identified as entity-URI
. (See also section 1.3 Interpreting provenance information.)
If no anchor
link is provided then the entity-URI
is assumed to be the URI of the resource.
At this time, the meaning of these links returned with other HTTP response codes is not defined: future revisions of this specification may define interpretations for these.
An HTTP response may include multiple provenance
link header fields, indicating a number of different provenance resources that are known to the responding server, each providing provenance information about the accessed resource.
The presence of a provenance
link in an HTTP response does not preclude the possibility that other publishers may offer provenance information about the same resource. In such cases, discovery of the additional provenance information must use other means (e.g. see section 4. Provenance services).
Provenance resources indicated in this way are not guaranteed to be authoritative. Trust in the linked provenance data must be determined separately from trust in the original resource, just as in the web at large, it is a users' responsibility to determine an appropriate level of trust in any other linked resource; e.g. based on the domain that serves it, or an associated digital signature. (Ssee also section 8. Security considerations.)
This is a new proposal. It needs to be checked as to whether it is useful. GK/PG to review nature of provenance-service-URI.
The document provider may indicate that provenance information about the document is provided by a provenance service. This is done through the use of a provenance-service
link relation type following the same pattern as above:
Link: provenance-service-URI; anchor="entity-URI"; rel="provenance-service"
The provenance-service
link identifies the service-URI. Dereferencing this URI yields a service description that provides further information to enable a client to determine a provenance-URI or retrieve provenance information for an entity; see section 4. Provenance services for more details.
There may be multiple provenance-service
link header fields, and these may appear in the same document as provenance
links (though, in simple cases, we anticipate that provenance
and provenance-service
link relations will not be used together).
For a document presented as HTML or XHTML, without regard for how it has been obtained, provenance information may be associated with a resource by adding a <Link>
element to the HTML <head>
section.
Two new link relation types for referencing provenance information are registered according to the template in section 7. IANA considerations, and may be used as shown:
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <link rel="provenance" href="provenance-URI"> <link rel="anchor" href="entity-URI"> <title>Welcome to example.com</title> </head> <body> ... </body> </html>
The provenance-URI
given by the provenance
link element identifies the provenance-URI for the document.
The entity-URI
given by the anchor
link element specifies an identifier for the presented document view, and which may be used within the provenance information when referring to this document.
An HTML document header may include multiple "provenance" link elements, indicating a number of different provenance resources that are known to the creator of the document, each of which may provide provenance information about the document.
Likewise, the header may include multiple "anchor" link elements indicating that, e.g., different revisions of the document can be identified in the provenance information using the different entity-URIs
.
If no "anchor" link element is provided then the entity-URI
is assumed to be the URI of the document. It is recommended that this convention be used only when the document is static.
The document creator may specify that the provenance information about the document is provided by a provenance service. This is done through the use of a third link relation type following the same pattern as above:
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <link rel="provenance-service" href="service-URI"> <link rel="anchor" href="entity-URI"> <title>Welcome to example.com</title> </head> <body> ... </body> </html>
The provenance-service
link element identifies the service-URI. Dereferencing this URI yields a service description that provides further information to enable a client to access provenance information for an entity; see section 4. Provenance services for more details.
There may be multiple provenance-service
link elements, and these may appear in the same document as anchor
and provenance
link elements (though, in simple cases, we anticipate that provenance
and provenance-service
link relations would not be used together).
If a resource is represented as RDF (in any of its recognized syntaxes, including RDFa), it may contain references to its own provenance using additional RDF statements.
For this purpose a new RDF property, prov:hasProvenance
, is defined as a relation between two resources, where the object of the property is a resource that provides provenance information about the subject resource. Multiple prov:hasProvenance
assertions may be made about a subject resource.
Another new RDF property, prov:hasAnchor
, is defined to allow the RDF content to specify one or more entity-URIs of the RDF document for the purpose of provenance information (similar to the use of the "anchor" link relation in HTML).
@@TODO: document namespace. Check naming style. Use provenance model namespace? Define as part of model?
@@TODO: example, when vocabulary issues are settled.
We have so far decided not to try and define a common mechanism for arbitrary data, because it's not clear to us what the correct choice would be. Is this a reasonable position, or is there a real need for a generic solution for provenance discovery for arbitrary, non-web-accessible data objects?
If a resource is represented using a data format other than HTML or RDF, and no URI for the resource is known, provenance discovery becomes trickier to achieve. This specification does not define a specific mechanism for such arbitrary resources, but this section discusses some of the options that might be considered.
For formats which have provision for including metadata within the file (e.g. JPEG images, PDF documents, etc.), use the format-specific metadata to include a entity-URI, provenance-URI and/or service-URI. Format-specific metadata provision might also be used to include provenance information directly in the resource.
Use a generic packaging format that can combine an arbitrary data file with a separate metadata file in a known format, such as RDF. At this time, it is not clear what format that should be, but some possible candidates are:
Fix references in above text.
This section describes a REST API [REST-APIs] for a provenance service with facilities for discovery and/or retrieval of provenance information, which can be implemented independently of the original resource delivery channels (e.g. by a third party service).
All service implementations must respond with a service description (section 4.2.1 Service description) when the service URI is dereferenced. Service implementations may provide either discovery, retrieval or both of these services, indicated by presence of the corresponding service URI templates in the service description. Which of these services to provide is a choice for individual service implementations.
On the Web, the normal mechanism for retrieving information is to associate it with a URI, and dereference the URI using normal retrieval mechanisms. This approach is enabled using the provenance discovery service mechanism: given the URI of some resource for which provenance information is required, the service returns one or more URIs from which provenance information may be obtained. This approach may be preferred when the provenance service cannot specify the form of URIs used for identifying provenance information, or when there may be more than one source of provenance information known to the provenance service.
The provenance retrieval service returns provenance information directly. This mechanism may be preferred when the provenance information is not already presented directly to the web, or is stored in a database with a complex query protocol, or when the provenance service can control the URI from which provenance information is served and avoid the intermediate step of URI discovery.
This section describes general procedures for using the provenance service API. Later sections describe the resources presented by the API, and their representation using JSON. section B. Provenance service format examplesgives examples of alternative representations. Normal HTTP content negotiation mechanisms may be used to retrieve representations using formats convenient for the client application.
To use the provenance service to retrieve a list of provenance-URIs for a resource, starting with the service URI (service-URI
) and the URI of the resource or entity (entity-URI
):
service-URI
to obtain a representation of the service description.entity-URI
for template variable uri
to form provenance-locations-URI
.provenance-locations-URI
to obtain a provenance locations resource in one of the formats described below.Any or all of URIs in the returned provenance locations may be used to retrieve provenance information, per section 2. Accessing provenance information.
To use the provenance service to directly retrieve provenance information for a resource, starting with the service URI (service-URI
) and the URI of the resource or context (entity-URI
):
service-URI
to obtain a representation of the service description.entity-URI
for template variable uri
to form provenance-URI
.provenance-URI
to obtain provenance information.A provenance service description describes the provenance discovery and retrieval service and, in particular, provides URI templates [URI-template] for URIs to access provenance locations resources and/or provenance information. Dereferencing the service URI returns a representation of this service description. The service description may contain additional metadata about the service beyond that described here: API clients are expected to ignore any metadata elements they do not understand.
This example shows a provenance service description using JSON format [RFC4627], which is presented as MIME content-type application/json
.
Other examples may be seen in section B. Provenance service format examples.
{ "provenance_service_uri": "http://example.org/provenance_service/", "provenance_locations_template": "http://example.org/provenance_service/locations/?uri={uri}", "provenance_content_template": "http://example.org/provenance_service/provenance/?uri={uri}" }
Is there any point in including the provenance service URI here? It has been included for consistency with RDF representations, but is functionally redundant.
A provenance locations resource enumerates one or more provenance-URIs identifying provenance information associated with a given resource.
The examples below and in section B. Provenance service format examples are for a given resource URI http://example.org/qdata/
, and using the service description example above, its URI would be http://example.org/provenance_service/location/?uri=http%3A%2F%2Fexample.org%2Fqdata%2F
.
This example uses JSON format [RFC4627], presented as MIME content type application/json
.
Other examples may be seen in section B. Provenance service format examples.
{ "uri": "http://example.org/qdata/", "provenance": [ "http://source1.example.org/provenance/qdata/", "http://source2.example.org/prov/qdata/", "http://source3.example.com/prov?id=qdata" ] }
The template might use ?uri={+uri}
rather than just ?uri={uri}
, and thereby avoid %-escaping the :
and /
characters in the given URI, but this could cause difficulties for URIs containing query parameters and/or fragment identifiers. In this case, the client application would need to ensure that any such characters were %-escaped before being passed into a URI-template expansion processor.
Provenance information about a resource or resources may be returned in any format. It is recommended that the format be one defined by the Provenance Model specification [PROV-DM].
Assuming a given resource URI http://example.org/qdata/
, and
using the service description example above, the provenance URI would be http://example.org/provenance_service/provenance/?uri=http%3A%2F%2Fexample.org%2Fqdata%2F
.
This specification does not define any specific mechanism for discovering provenance services. Applications may use any appropriate mechanism, including but not limited to: prior configuration, search engines, service registries, etc.
Simply identifying and retrieving provenance information as a web resource may not always meet the requirements of a particular application or service, e.g.:
A provenance query service provides an alternative way to access provenance information and/or Provenance-URIs. An application will need a provenance query service URI, and some relevant information about the entity whose provenance is to be accessed.
The details of a provenance query service is an implementation choice, but for interoperability between different providers and users we recommend use of SPARQL [RDF-SPARQL-PROTOCOL] [RDF-SPARQL-QUERY]. The query service URI would then be the URI of a SPARQL endpoint (or, to use the SPARQL specification language, a SPARQL protocol service). The following subsections provide examples for what are considered to be some plausible common scenarios for using SPARQL, and are not intended to cover all possibilities.
If the requester has an entity-URI, a simple SPARQL query may be used to return the corresponding provenance-URI. E.g., if the original resource has a entity-URI http://example.org/resource
,
@prefix prov: <@@TBD>
SELECT ?provenance_uri WHERE
{
<http://example.org/resource> prov:hasProvenance ?provenance_uri
}
@@TODO: specific provenance namespace and property to be determined by the model or ontology specification?
If the requester has identifying information that is not the URI of the original resource, then they will need to construct a more elaborate query to locate an entity description and obtain its provenance-URI(s). The nature of identifying information that can be used in this way will depend upon the third party service used, further definition of which is out of scope for this specification. For example, a query for a document identified by a DOI, say 1234.5678
, using the PRISM vocabulary [PRISM] recommended by FaBio [FABIO], might look like this:
@prefix prov: <@@TBD> @prefix prism: <http://prismstandard.org/namespaces/basic/2.0/> SELECT ?provenance_uri WHERE { [ prism:doi "1234.5678" ] prov:hasProvenance ?provenance_uri }
@@TODO: specific provenance namespace and property to be determined by the model specification?
This scenario retrieves provenance information directly given the URI of a resource or entity, and may be useful where the provenance information has not been assigned a specific URI, or when the calling application is interested only in specific elements of provenance information.
If the original resource has an entity-URI http://example.org/resource
, a SPARQL query for provenance information might look like this:
@prefix prov: <@@TBD> CONSTRUCT { <http://example.org/resource> ?p ?v } WHERE { <http://example.org/resource> ?p ?v }This query essentially extracts all available properties and values available from the query service used that are directly about the specified entity, and returns them as an RDF graph. This may be fine if the service contains only provenance information about the indicated resource, or if the non-provenance information is also of interest. A more complex query using specific provenance vocabulary terms may be needed to selectively retrieve just provenance information when other kinds of information are also available.
@@TODO: specific provenance namespace and property to be determined by the model specification? The above query pattern assumes provenance information is included in direct properties about the entity. When an RDF provenance vocabulary is fully formulated, this may well turn out to not be the case. A better example would be one that retrieves specific provenance information when the vocabulary terms have been defined.
Provenance information may be large. While this specification does not define how to implement scalable provenance systems, it does allow for publishers to make available provenance in an incremental fashion. We now discuss two possibilities for incremental provenance retrieval.
Publishers are not required to publish all the provenance information associated with a given entity at a particular provenance-URI. The amount of provenance information exposed is application dependent. However, it is possible to incrementally retrieve (i.e. walk the provenance graph) by progressively looking up provenance information using HTTP. The pattern is as follows:
entity-uri-1
) retrieve it's associated provenance-uri-1
using the HTTP Link
header (section 3.1 Resource accessed by HTTP)provenance-uri-1
entity-uri-2
) with no provided provenance information, find its provenance-URI and continue from Step 1. (Note: an HTTP HEAD operation may be used to obtain the Link headers without retrieving the entity content.)To reduce the overhead of multiple HTTP requests, a provenance information publisher may link entities to their associated provenance information using the prov:hasProvenance
predicate. Thus, the same pattern above applies, except instead of having to retrieve a new Link
header field, one can immediately dereference the entity's associated provenance.
The same approach can be adopted when using the provenance service API (section 4. Provenance services). However, instead of performing an HTTP HEAD or GET against a resource one queries the provenance service using the given entity-uri.
Provenance information may be made available using a SPARQL endpoint (section 5. Querying provenance information) [RDF-SPARQL-PROTOCOL] [RDF-SPARQL-QUERY]. Using SPARQL queries, provenance can be selectively retrieved using combinations of filters and or path queries.
This document requests registration of new link relations, per section-6.2.1 of RFC 5988.
@@TODO The following templates should be completed and submitted to link-relations@ietf.org:
provenance
The name "anchor" has been used for the link relation name, despite the corresponding URI being described as an entity-URI. This terminology has been chosen to align with usage in the description of the HTTP Link
header field, per RFC 5988.
anchor
provenance-service
Provenance is central to establishing trust in data. If provenance information is corrupted, it may lead agents (human or software) to draw inappropriate and possibly harmful conclusions. Therefore, care is needed to ensure that the integrity of provenance data is maintained.
When using HTTP to access provenance information, or to determine a provenance URI, secure HTTP (https) should be used.
When retrieving a provenance URI from a document, steps should be taken to ensure the document itself is an accurate copy of the original whose author is being trusted (e.g. signature checking, or verifying its checksum against an author-provided secure web service).
@@TODO ... privacy, access control to provenance (note to self: discussed in Edinburgh linked data provenance workshop). In particular, note that the fact that a resource is openly accessible does not mean that its provenance information should also be.
The editors acknowledge the contribution and review from members of the provenance working group.
Many thanks to Robin Berjon for making our lives so much easier with his cool ReSpec tool.
In section 4. Provenance services, the provenance service description was represented as a JSON-formatted document. As noted, HTTP content negotiation may be enabled to retrieve the document in alternative formats. This appendix provides examples of service description document represented using RDF Turtle and XML syntaxes, and XML.
This example uses the RDF Turtle format [TURTLE], presented as MIME content-type text/turtle
.
@prefix provds: <@@TBD@@#> . <http://example.org/provenance_service/> a provds:Service_description ; provds:provenance_locations_template "http://example.org/provenance_service/locations/?uri={uri}" ; provds:provenance_content_template "http://example.org/provenance_service/provenance/?uri={uri}" .
The provenance URI templates are encoded in RDF as plain string literals, not as resource URIs.
Finalize URIs in the above example.
This is essentially the same as the Turtle example above, but encoded in RDF/XML [RDF-SYNTAX-GRAMMAR], and presented as MIME content-type application/xml+rdf
.
<rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#" xmlns:provds = "@@TBD@@#" > <provds:Service_description rdf:about="http://example.org/provenance_service/"> example.org <provds:provenance_locations_template>http://example.org/provenance_service/locations/?uri={uri}</provds:location_template> ; <provds:provenance_content_template>http://example.org/provenance_service/provenance/?uri={uri}</provds:provenance_template> ; </provds:Service_description> </rdf:RDF>
Finalize URIs in the above example.
@@TODO: provide example and schema
This example uses the RDF Turtle format [TURTLE], presented as MIME content type text/turtle
.
@prefix prov: <@@TBD@@#> . <http://example.org/qdata/> a prov:Entity ; prov:hasProvenance <http://source1.example.org/provenance/qdata/> ; prov:hasProvenance <http://source2.example.org/prov/qdata/> ; prov:hasProvenance <http://source3.example.com/prov?id=qdata> .
NOTE: The namespace URI used here for the provenance properties is different from that used in the service description. I am anticipating that it will be defined as part of the provenance model. If it is not defined as part of the provenance model, then a property name should be allocated in the provenance discovery service namespace.
@@TODO: revise to conform with Provenance Model vocabulary; review URIs
This is essentially the same as the Turtle example above, but encoded in RDF/XML [RDF-SYNTAX-GRAMMAR], and presented with MIME content type application/rdf+xml
.
<rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#" xmlns:prov = "@@TBD@@#" > <prov:Entity rdf:about="http://example.org/qdata/"> <prov:hasProvenance rdf:resource="http://source1.example.org/provenance/qdata/" /> ; <prov:hasProvenance rdf:resource="http://source2.example.org/prov/qdata/" /> ; <prov:hasProvenance rdf:resource="http://source3.example.com/prov?id=qdata" /> ; </prov:Entity> </rdf:RDF>
@@TODO: revise to conform with Provenance Model vocabulary
@@TODO: provide example and schema
No normative references.