Difference between revisions of "PROV-FAQ"

From Semantic Web Standards
Jump to: navigation, search
(Why doesn't PROV use FOAF?)
Line 81: Line 81:
  ex:flour2 prov:specializationOf ex:flour
  ex:flour2 prov:specializationOf ex:flour
=== Access and query - arbitrary data ===
=== Access and Query ===
==== Access and Query - arbitrary data ====
@@TODO: text lifted from what was section 3.4 of PROV-AQ
@@TODO: text lifted from what was section 3.4 of PROV-AQ
Line 96: Line 98:
=== Access and Query - alternatives ===
==== Access and Query - alternatives ====

Revision as of 12:09, 18 April 2013

Here are some answers to frequently asked questions about PROV and using it.

Relationship to other vocabularies

What is the relationship between Dublin Core and PROV?

The two specifications are complimentary. Dublin Core contains many provenance related terms many of which are more specific than those provided by PROV. For those terms, there is a mapping from Dublin Core Metadata Terms to PROV provided by a best practice document. If your system understands PROV, it will be able to understand the provenance terms of Dublin Core.

Why doesn't PROV use FOAF?

TODO: Tim Lebo

Modeling considerations

How can I define a sub activity?

PROV contains the notion of an activity. Some people have asked how to model sub-activities. We suggest using dcterms:hasPart

When should I use prov:Agent subtypes?

If you want to denote that some Person, Organization or Software Agent has responsibility for and activity or entity use the subtypes of prov:Agent. Otherwise, we suggest that use other ontologies like FOAF.

Can I infer Derivation from Usage and Generation?

No! Consider the trivial contrived example of an activity X that uses 3 input entities, a,b,c, and generates 2 output entities, one with a value of the sum of the first two (d = a+b), and a second with a value of the second two (e = b+c). You could express that:

used(X, a)
used(X, b)
used(X, c)
wasGeneratedBy(d, X)
wasGeneratedBy(e, X)

Given our description, we would also assert that:

wasDerivedFrom(d, a)
wasDerivedFrom(d, b)
wasDerivedFrom(e, b)
wasDerivedFrom(e, c)

But you could not infer from the Usages and Generations for activity X that:

wasDerivedFrom(d, c)


wasDerivedFrom(e, a)

Should I use wasInfluencedBy, wasInformedBy, or another relation?

The names of terms in the PROV data model and ontology are chosen as carefully as possible to distinctly represent the concepts intended. However, as a generic model intended to express provenance in many settings, it is impossible to find names that, read as colloquial English, match their use precisely in each setting. The terms wasInfluencedBy and wasInformedBy have been confused, due to the similar meanings of the words 'influence' and 'inform'.

First, in choosing a relation, please read the definitions rather than relying on the names, as these will help clarify their intended use. In general, we discourage the use of the wasInfluencedBy in expressing provenance unless very little information is known about what you would like to model in the provenance. wasInfluencedBy is a super-property of the relations for common use: used, wasGeneratedBy, wasAssociatedWith, wasInformedBy, etc. These properties relate entities, activities and agents in varying combinations. wasInfluencedBy is present more to help querying provenance data, at times where you do not care about which type of provenance relation is being expressed. wasInformedBy is a more specific relation connecting one activity to another implying "the exchange of some unspecified entity by two activities, one activity using some entity generated by the other", so that one activity was in some way dependent on the other.

If you find yourself using wasInfluencedBy in your provenance model, it is preferable to check what kinds of element (entities, activities, or agents) you are trying to relate and considering using the more specific relation. For example, if you wish to express the influence of one entity on another, consider using a wasDerivedFrom relation.

Examples of Provenance

Provenance of a Car

The prov-dm document gives the example of a car, moved from Boston to Cambridge (see example 5 CR document). For this car, we identify multiple entities exposing various facets of the thing: Joe's car, Joe's car in Boston, and Joe's car in Cambridge.

entity(joe-car-boston, [prov:location="boston"])
entity(joe-car-cambridge, [prov:location="cambridge"])
specializationOf(joe-car-cambridge, joe-car)
specialization(joe-car-boston, joe-car)

Joe-car-cambridge begins to exist when the car arrives is Cambridge, and Joe-car-boston ceases to exist (invalidation) once it leaves Boston. So joe-car-cambridge's generation time is defined as the time at which in arrives in Cambridge.

Provenance of Flour

If a change in a resource's state is something to be documented in the provenance, then that requires multiple entities.

If a change is to be documented in PROV, then multiple entities are used, e.g. the flour before and after baking. If it is not documented, then only one entity is required. There is no notion of a change which is "documented but not significant", because it is unclear what significance would be in general except for the decision to model/document it. As before, a general, mutable "flour" entity can exist that is connected to the flour before and after baking using prov:specializationOf. For example:

ex:baked prov:used ex:flour1
ex:flour2 prov:wasGeneratedBy ex:baked
ex:flour2 prov:wasDerivedFrom ex:flour1
ex:flour1 prov:specializationOf ex:flour
ex:flour2 prov:specializationOf ex:flour

Access and Query

Access and Query - arbitrary data

@@TODO: text lifted from what was section 3.4 of PROV-AQ

If a resource is represented using a data format other than HTML or RDF, and no URI for the resource is known, provenance discovery becomes trickier to achieve. This specification does not define a specific mechanism for such arbitrary resources, but this section discusses some of the options that might be considered.

For formats which have provision for including metadata within the file (e.g. JPEG images, PDF documents, etc.), use the format-specific metadata to include a <a class="internalDFN">target-URI</a>, <a class="internalDFN">provenance-URI</a> and/or <a class="internalDFN">service-URI</a>. Format-specific metadata provision might also be used to include <a class="internalDFN">provenance information</a> directly in the resource.

Use a generic packaging format that can combine an arbitrary data file with a separate metadata file in a known format such as RDF.

Access and Query - alternatives


The following text copied from PROV-AQ "Best Practices"

Using SPARQL for provenance queries

Simply identifying and retrieving provenance information as a resource on the Web may not always meet the requirements of a particular application or service, e.g.:

  • the resource for which provenance information is required is not identified by a known URI
  • the provenance information for a resource is not directly identified by a known URI
  • a requirement to access provenance information for a number of distinct but related resources in a single atomic operation
  • etc.

A provenance query service provides an alternative way to access provenance information and/or provenance-URIs. An application will need a URI for the provenance query service, and some relevant information about the resource whose provenance is to be accessed.

The details of a provenance query service is an implementation choice, but for interoperability between different providers and users we recommend use of SPARQL RDF-SPARQL-PROTOCOL RDF-SPARQL-QUERY. The query service URI would then be the URI of a <a href="http://www.w3.org/TR/rdf-sparql-protocol/#conformant-sparql-protocol-service" class="externalRef">SPARQL protocol service</a> (often referred to as a "SPARQL endpoint"). The following subsections provide examples for what are considered to be some plausible common scenarios for using SPARQL, and are not intended to cover all possibilities.

A SPARQL protocol service description may be published using the SPARQL 1.1 Service Description vocabulary SPARQL-SD.

The following subsections illustrate use cases for querying a SPARQL-based provenance query service.

Find a provenance-URI given a target-URI

If the requester has a <a class="internalDFN">target-URI</a>, a simple SPARQL query may be used to return the corresponding <a class="internalDFN">provenance-URI</a>. E.g., if the original resource has a target-URI http://example.org/resource:

  @prefix prov: <http://www.w3.org/ns/prov#>
  SELECT ?provenance_uri WHERE
    <http://example.org/resource> prov:hasProvenance ?provenance_uri

Find Provenance-URI given identifying information about a resource

If the requester has identifying information that is not the URI of the original resource, then they will need to construct a more elaborate query to locate a resource description and obtain its provenance-URI(s). The nature of identifying information that can be used in this way will depend upon the third party service used, further definition of which is out of scope for this specification. For example, a query for a document identified by a DOI, say 1234.5678, using the PRISM vocabulary PRISM might look like this:

  @prefix prov: <http://www.w3.org/ns/prov#>
  @prefix prism: <http://prismstandard.org/namespaces/basic/2.0/>
  SELECT ?provenance_uri WHERE
    [ prism:doi "1234.5678" ] prov:hasProvenance ?provenance_uri

Obtain provenance information directly given a target-URI

This scenario retrieves provenance information directly given the URI of a resource, and may be useful where the provenance information has not been assigned a specific URI, or when the calling application is interested only in specific elements of provenance information.

If the original resource has a URI http://example.org/resource, a SPARQL query for provenance information might look like this:

  @prefix prov: <http://www.w3.org/ns/prov#>
  SELECT ?generationStartTime WHERE {
      <http://example.org/resource> prov:wasGeneratedBy ?activity .
      ?activity prov:startedAtTime ?generationStartTime .

This query extracts a "generation start time" for an artifact by following links to the start time of the activity which generated it.

Incremental Provenance Retrieval

<a class="internalDFN">Provenance information</a> may be large. While this specification does not define how to implement scalable provenance systems, it does allow for publishers to make available provenance in an incremental fashion. We now discuss two possibilities for incremental provenance retrieval.

Via Web Retrieval

Publishers are not required to publish all the provenance information associated with a given resource at a particular <a class="internalDFN">provenance-URI</a>. The amount of provenance information exposed is application dependent. However, it is possible to incrementally retrieve (i.e. walk the provenance graph) by progressively looking up provenance information using HTTP. The pattern is as follows:

  1. For a given resource (resource-URI) retrieve it's associated provenance-URI-1 and its associated target-URI-1 using a returned HTTP Link: header field (<a href="#resource-accessed-by-http" class="sectionRef"></a>)
  2. Dereference provenance-URI-1
  3. Navigate the provenance information
  4. When reaching a dead-end during navigation, that is on encountering a reference to a resource (target-URI-2) with no provided provenance information, find its provenance-URI and continue from Step 2. (Note: an HTTP HEAD request for target-URI-2 may be used to obtain the Link: headers without retrieving the resource representation.)

To reduce the overhead of multiple HTTP requests, a provenance information publishers are encouraged to link entities to their associated provenance information using the prov:hasProvenance predicate. Thus, the same pattern above applies, except instead of having to retrieve a new Link header field, one can immediately access the resource's associated provenance.

The same approach can be adopted when using the <a class="internalDFN">provenance service</a> API (<a href="#provenance-services" class="sectionRef"></a>). However, instead of performing an HTTP HEAD or GET against a resource one queries the provenance service using the given <a class="internalDFN">target-URI</a>.

Via SPARQL Queries

Provenance information may be made available using a SPARQL endpoint (<a href="#querying-provenance-information" class="sectionRef"></a>) RDF-SPARQL-PROTOCOL RDF-SPARQL-QUERY. Using SPARQL queries, provenance can be selectively retrieved using combinations of filters and or path queries.

Validity and Bundles

The following example shows a document containing two bundles. This document is invalid because the first bundle ex:b1 contains a cycle of derivations, which is not allowed due to the PROV Constraint derivation-generation-generation-ordering (e1 is older than e2, but e2 is older than e1).

 prefix ex <http://example.org/>

 bundle ex:b1

 bundle ex:b2

In this second example, the document is valid, since the validity of each bundle is established independently of the other bundle. While there is a cycle of derivations, this cycle is not present in any single bundle.

 prefix ex <http://example.org/>

 bundle ex:b1

 bundle ex:b2

Finally, in the third example, the document is not syntactically correct since descriptions in bundle ex:b2 refer to prefixes ex1 and ex2, which are neither defined in bundle ex:b2 nor at the toplevel.

 prefix ex <http://example.org/>

 bundle ex:b1
   prefix ex1 <http://example.org/>


 bundle ex:b2


PROV-XML documents can include a reference to a GRDDL transformation, which -- when invoked with the document as input -- will produce a PROV-O representation of the PROV-XML. Include the xmlns:grddl and grddl:transformation attributes below, and point to any appropriate XSL stylesheet (the one shown is for demonstration purposes and is incomplete).


The WG could add a default XSL within the w3.org/ns... Adopters are encouraged to specify whichever @transformation they wish (including their own tailored to their flavor of prov).

How should I generate JAXB classes from the PROV-XML schemas?

JAXB, as of jaxb-ri-2.2.6, sometimes generates Java bindings with JAXBElement<T> types for storing elements within prov:Document and prov:BundleConstructor. These JAXBElement<T> may not be as user friendly to use. JAXB appears to do so in order to retain a round-trip marshalling/unmarshalling capability of elements--particularly when unbounded sequence of elements are used in the current prov-core.xsd.

For example, the prov:Document type:

<xs:element name="document" type="prov:Document" />
<xs:complexType name="Document">
    <xs:sequence maxOccurs="unbounded">
        <xs:group ref="prov:documentElements" minOccurs="0"/>
        <xs:element name="bundleContent" type="prov:BundleConstructor" minOccurs="0"/>
        <xs:any namespace="##other" processContents="lax" minOccurs="0" />

May yield the following Java bindings with JAXBElement<T> for storing the elements:

public class Document {
        @XmlElementRef(name = "hadPrimarySource", namespace = "http://www.w3.org/ns/prov#", type = JAXBElement.class, required = false),
        @XmlElementRef(name = "agent", namespace = "http://www.w3.org/ns/prov#", type = JAXBElement.class, required = false),
        @XmlElementRef(name = "activity", namespace = "http://www.w3.org/ns/prov#", type = JAXBElement.class, required = false),
        @XmlElementRef(name = "organization", namespace = "http://www.w3.org/ns/prov#", type = JAXBElement.class, required = false),
        @XmlElementRef(name = "softwareAgent", namespace = "http://www.w3.org/ns/prov#", type = JAXBElement.class, required = false),

JAXB's generation of JAXBElement<T> classes seems to be a wrapper approach to preserve sufficient information in the schema for round-trip marshaling & unmarshalling of values in XML instances. More specifically, it wraps the data with a QName and a nillable flag.

It appears that the a frequent cause of JAXB producing JAXBElement<T> is its attempt to preserve elements with both minOccurs=0 and nillable=true. JAXB needs to distinguish between the two cases where:

1. element missing, minOccurs=0, then jaxbElement==null

2. element present, xsi:nil=true, then jaxbElement.isNil()==true

It would not be possible to distinguish between these two states if the bindings were the raw types.

The repeating sequences are treated as a List<Object> of generic JAXBElements, where the JAXBElement's QName is used to distinguish elements with different names.

It would be possible to customize the JAXB bindings to ignore the full round-trip requirement. An example "bindings.xjb" customization file:

<jaxb:bindings version="2.1"
  <jaxb:bindings schemaLocation="prov-core.xsd"
    <jaxb:globalBindings generateElementProperty="false" />
$ xjc.sh -b bindings.xjb prov.xsd

However, it has been noted that the "generateElementProperty=false" customization option generates an alternate developer-friendly "but lossy binding".

A more viable solution is to remove usage of the abstractElement from the sequences and to leverage the use of prov:others.

TODO: link to to inline the modified JAXB-friendly schema.

Capitalization Conventions in PROV

Various conventions are adopted by various communities, and the group itself had its own set of requirements. Hence, the following conventions were adopted, as a compromise among those sometimes conflicting constraints.


1. In Figure 1 of prov-dm (http://www.w3.org/TR/prov-dm/#prov-core-structures-top) classes are capitalized.

2. Relations are expressed using a past tense verbal form (was derived from, was generated by, ...)

3. Figure 5 (http://www.w3.org/TR/prov-dm/#figure-component1) and subsequent figures show that most relations are reified (called Association Class in UML, and see Qualified Pattern in PROV-O), so in compliance with UML convention, binary association and corresponding association classes are capitalized.

4. Values that the prov:type attribute can take are classes, and therefore are capitalized prov:Bundle, prov:SoftwareAgent, etc

5. Definitions define terms informally, in English. Terms in definitions are not capitalized (unless they begin the sentence).


1. All expressions start with a predicate using camel case with the first letter as lower case.

wasDerivedFrom(ex:e2, ex:e1)

2. References to types are capitalized:

entity(ex:b1, [prov:type="prov:Bundle])  
agent(ex:ag1, [prov:type="prov:SoftwareAgent"])


1. Properties use camel case with the first letter as lower case.

2. Classes are capitalized

 ex:e2 prov:wasDerivedFrom ex:e1.
 ex:e2 a prov:Entity.


1. Elements use camel case with the first letter as lower case.

2. Classes in the XML schema are capitalized

 <prov:agent id="ex:ag">

How does PROV address the recommendations of the Provenance Incubator Group

Seven of the eight recommendations defined by the Provenance Incubator are directly addressed by PROV specification:

  • Core concepts (recommendation #1) addressed by Entity, Agent, Person, id, Activity in the PROV data model
  • Provenance publishing (recommendation #3) addressed by prov-aq
  • Provenance of provenance (recommendation #4) addressed by bundles in prov-dm
  • Reproducibility (recommendation #5) addressed by Activity, Usage, Generation, etc in the PROV data model
  • Versioning (recommendation #6) addressed by Derivation, Revision, Specialization, Alternate in the PROV data model
  • Procedure (recommendation #7) addressed by Activity and Plan in the PROV data model
  • Derivation (recommendation #8) addressed by Derivation in the PROV data model

Recommendation #2 was out of scope for standardization of activity of the working group, but it is believed that it can be supported by PROV extensibility points.