W3C

The PROV Data Model and Abstract Syntax Notation

W3C Working Draft 15 December 2011

This version:
http://www.w3.org/TR/2011/WD-prov-dm-20111215/
Latest published version:
http://www.w3.org/TR/prov-dm/
Latest editor's draft:
http://dvcs.w3.org/hg/prov/raw-file/default/model/ProvenanceModel.html
Previous version:
http://www.w3.org/TR/2011/WD-prov-dm-20111018/
Editors:
Luc Moreau, University of Southampton
Paolo Missier, Newcastle University
Contributors:
Khalid Belhajjame, University of Manchester
Stephen Cresswell, legislation.gov.uk
Yolanda Gil, Invited Expert
Ryan Golden, Oracle Corporation
Paul Groth, VU University of Amsterdam
Graham Klyne, University of Oxford
Jim McCusker, Rensselaer Polytechnic Institute
Simon Miles, Invited Expert
James Myers, Rensselaer Polytechnic Institute
Satya Sahoo, Case Western Reserve University

Abstract

PROV-DM is a data model for provenance for building representations of the entities, people and activities involved in producing a piece of data or thing in the world. PROV-DM is domain-agnotisc, but with well-defined extensibility points allowing further domain-specific and application-specific extensions to be defined. It is accompanied by PROV-ASN, a technology-independent abstract syntax notation, which allows serializations of PROV-DM instances to be created for human consumption, which facilitates its mapping to concrete syntax, and which is used as the basis for a formal semantics.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is part of a set of specifications aiming to define the various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. This document defines the PROV-DM data model for provenance, accompanied with a notation to express instances of that data model for human consumption. Three other documents are: 1) a normative serialization of PROV-DM in RDF, specified by means of a mapping to the OWL2 Web Ontology Language; 2) the mechanisms for accessing and querying provenance; 3) a primer for the provenance data model.

This document was published by the Provenance Working Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-prov-wg@w3.org (subscribe, archives). All feedback is welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1. Introduction

For the purpose of this specification, provenance is defined as a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world. In particular, the provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable: provenance can help those users to make trust judgments.

The idea that a single way of representing and collecting provenance could be adopted internally by all systems does not seem to be realistic today. Instead, a pragmatic approach is to consider a core data model for provenance that allows domain and application specific representations of provenance to be translated into such a data model and exchanged between systems. Heterogeneous systems can then export their provenance into such a core data model, and applications that need to make sense of provenance in heterogeneous systems can then import it, process it, and reason over it.

Thus, the vision is that different provenance-aware systems natively adopt their own model for representing their provenance, but a core provenance data model can be readily adopted as a provenance interchange model across such systems.

A set of specifications define the various aspects that are necessary to achieve this vision in an inter-operable way, the first of which is contained in this document:

The PROV-DM data model for provenance consists of a set of core concepts, and a few common relations, based on these core concepts. PROV-DM is a domain-agnotisc model, but with well-defined extensibility points allowing further domain-specific and application-specific extensions to be defined.

This specification also introduces PROV-ASN, an abstract syntax that is primarily aimed at human consumption. PROV-ASN allows serializations of PROV-DM instances to be written in a technology independent manner, it facilitates its mapping to concrete syntax, and it is used as the basis for a formal semantics. This specification uses instances of provenance written in PROV-ASN to illustrate the data model.

1.1 Structure of this Document

In section 2, a set of preliminaries are introduced, including concepts that underpin PROV-DM and motivations for the PROV-ASN notation.

Section 3 provides an overview of PROV-DM listing its core types and their relations.

In section 4, PROV-DM is applied to a short scenario, encoded in PROV-ASN, and illustrated graphically.

Section 5 provides the normative definition of PROV-DM and the notation PROV-ASN.

Section 6 introduces common relations used in PROV-DM, including relations for data collections and common domain-independent common relations.

Section 7 summarizes PROV-DM extensibility points.

Section 8 discusses how PROV-DM can be applied to the notion of resource.

1.2 PROV-DM Namespace

The PROV-DM namespace is http://www.w3.org/ns/prov-dm/ (TBC).

All the elements, relations, reserved names and attributes introduced in this specification belong to the PROV-DM namespace.

There is a desire to use a single namespace that all specs can share to refer to common provenance terms.

1.3 Conventions

The key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" in this document are to be interpreted as described in [RFC2119].

2. Preliminaries

2.1 A Conceptualization of the World

2.1.1 Entity, Activity, Agent

This specification is based on a conceptualization of the world that is described in this section. In the world (whether real or not), there are things, which can be physical, digital, conceptual, or otherwise, and activities involving things.

When we talk about things in the world in natural language and even when we assign identifiers, we are often imprecise in ways that make it difficult to clearly and unambiguously report provenance: a resource with a URL may be understood as referring to a report available at that URL, the version of the report available there today, the report independent of where it is hosted over time, etc.

Hence, to accommodate different perspectives on things and their situation in the world as perceived by us, we introduce the idea of a characterized thing, which refers to a thing and its situation in the world, as characterized by someone. We then define an entity as an identifiable characterized thing. An entity fixes some aspects of a thing and its situation in the world, so that it becomes possible to express its provenance, and what causes these specific aspects to be as such. An alternative entity may fix other aspects, and its provenance may be different.

Different users may take different perspectives on a resource with a URL. These perspectives in this conceptualization of the world are referred to as entities. Three such entities may be expressed:
  • a report available at URL: fixes the nature of the thing, i.e. a document, and its location;
  • the version of the report available there today: fixes its version number, contents, and its date;
  • the report independent of where it is hosted and of its content over time: fixes the nature of the thing as a conceptual artifact.
The provenance of these three entities may differ, and may be along the following lines:
  • the provenance of a report available at URL may include: the act of publishing it and making it available at a given location, possibly under some license and access control;
  • the provenance of the version of the report available there today may include: the authorship of the specific content, and reference to imported content;
  • the provenance of the report independent of where it is hosted over time may include: the motivation for writing the report, the overall methodology for producing it, and the broad team involved in it.

We do not assume that any characterization is more important than any other, and in fact, it is possible to describe the processing that occurred for the report to be commissioned, for individual versions to be created, for those versions to be published at the given URL, etc., each via a different entity that characterizes the report appropriately.

In the world, activities involve entities in multiple ways: they consume them, they process them, they transform them, they modify them, they change them, they relocate them, they use them, they generate them, they are controlled by them, etc.

An agent is a type of entity that takes an active role in an activity such that it can be assigned some degree of responsibility for the activity taking place. This definition intentionally stays away from using concepts such as enabling, causing, initiating, affecting, etc, because any entities also enable, cause, initiate, and affect in some way the activities. So the notion of having some degree of responsibility is really what makes an agent.

Even software agents can be assigned some responsibility for the effects they have in the world, so for example if one is using a Text Editor and one's laptop crashes, then one would say that the Text Editor was responsible for crashing the laptop. If one invokes a service to buy a book, that service can be considered responsible for drawing funds from one's bank to make the purchase (the company that runs the service and the web site would also be responsible, but the point here is that we assign some measure of responsibility to software as well). So when someone models software as an agent for an activity in our model, they mean the agent has some responsibility for that activity.

In this specification, the qualifier 'identifiable' is implicit whenever a reference is made to an activity, agent, or an entity.

2.1.2 Time and Event

Time is critical in the context of provenance, since it can help corroborate provenance claims. For instance, if an entity is claimed to be obtained by transforming another, then the latter must have existed before the former. If it is not the case, then there is something wrong in such a provenance claim.

Although time is critical, we should also recognize that provenance can be used in many different contexts: in a single system, across the Web, or in spatial data management, to name a few. Hence, it is a design objective of PROV-DM to minimize the assumptions about time, so that PROV-DM can be used in varied contexts.

Furthermore, consider two activities that started at the same time instant. Just by referring to that instant, we cannot distinguish which activity start we refer to. This is particularly relevant if we try to explain that the start of these activities had different reasons. We need to be able to refer to the start of an activity as a first class concept, so that we can talk about it and about its relation with respect to other similar starts.

Hence, in our conceptualization of the world, an instantaneous event, or event for short, happens in the world and marks a change in the world, in its activities and in its entities. The term "event" is commonly used in process algebra with a similar meaning. For instance, in CSP [CSP], events represent communications or interactions; they are assumed to be atomic and instantaneous.

2.1.2.1 Types of Events

Four kinds of events underpin the PROV-DM data model. The activity start and activity end events demarcate the beginning and the end of activities, respectively. The entity generation and entity usage events demarcate the characterization interval for entities. More specifically:

An entity generation event is the event that marks the final instant of an entity's creation timespan, after which it becomes available for use.

An entity usage event is the event that marks the first instant of an entity's consumption timespan by an activity.

An activity start event is the event that marks the instant an activity starts.

An activity end event is the event that marks the instant an activity ends.

2.1.2.2 Event Ordering

To allow for minimalistic clock assumptions, like Lamport [CLOCK], PROV-DM relies on a notion of relative ordering of events, without using physical clocks. This specification assumes that a partial order exists between events.

Specifically, follows is a partial order between events, indicating that an event occurs after another. For symmetry, precedes is defined as the inverse of follows.

How such partial order is realized in practice is beyond the scope of this specification. This specification only assumes that each event can be mapped to an instant in some form of timeline. The actual mapping is not in scope of this specification. Likewise, whether this timeline is formed of a single global timeline or whether it consists of multiple Lamport's style clocks is also beyond this specification. It is anticipated that follows and precedes correspond to some ordering over this timeline.

This specification introduces a set of "temporal interpretation" rules allowing to derive event ordering constraints from provenance records. According to such temporal interpretation, provenance records must satisfy such constraints. We note that the actual verification of such temporal constraints is also outside the scope of this specification.

PROV-DM also allows for time observations to be inserted in specific provenance records, for each recognized event introduced in this specification. The presence of a time observation for a given event fixes the mapping of this event to the timeline. It can also help with the verification of associated temporal constraints (though, again, this verification is outside the scope of this specfication).

2.2 PROV-ASN: The Provenance Abstract Syntax Notation

This specification defines PROV-DM, a data model for provenance, consisting of records describing how people, entities, and activities, were involved in producing, influencing, or delivering a piece of data or a thing in the world.

This specification also relies on a language, PROV-ASN, the Provenance Abstract Syntax Notation, to express instances of that data model. For each construct of PROV-DM, a corresponding ASN expression is introduced, by way of a production in the ASN grammar.

PROV-ASN is an abstract syntax, whose goals are:

This specification provides a grammar for PROV-ASN. Each record of the PROV-DM data model is explained in terms of the production of this grammar.

The formal semantics of PROV-DM is defined at [PROV-SEMANTICS] and its encoding in the OWL2 Web Ontology Language at [PROV-O].

2.3 Representation, Assertion, and Inference

PROV-DM is a provenance data model designed to express representations of the world.

A file at some point during its lifecycle, which includes multiple edits by multiple people, can be represented by its location in the file system, a creator, and content.

These representations are relative to an asserter, and in that sense constitute assertions stating properties of the world, as represented by an asserter. Different asserters will normally contribute different representations. This specification does not define a notion of consistency between different sets of assertions (whether by the same asserter or different asserters). The data model provides the means to associate attribution to assertions.

An alternative representation of the above file is a set of blocks in a hard disk.

The data model is designed to capture activities that happened in the past, as opposed to activities that may or will happen. However, this distinction is not formally enforced. Therefore, all PROV-DM assertions should be interpreted as a record of what has happened, as opposed to what may or will happen.

This specification does not prescribe the means by which an asserter arrives at assertions; for example, assertions can be composed on the basis of observations, reasoning, or any other means.

Sometimes, inferences about the world can be made from representations conformant to the PROV-DM data model. When this is the case, this specification defines such inferences, allowing new provenance records to be inferred from existing ones. Hence, representations of the world can result either from direct assertions by asserters or from application of inferences defined by this specification.

2.4 Grammar Notation

This specification includes a grammar for PROV-ASN expressed using the Extended Backus-Naur Form (EBNF) notation.

Each rule in the grammar defines one symbol, in the form:

E ::= expression

Within the expression on the right-hand side ofa rule, the follwoing expressions are used to match strings of one or more characters:
  • E: matches term satisfying rule for symbol E.
  • abc: matches the literal string inside the single quotes.
  • expression: matches expression or nothing; optional expression.
  • expression: matches one or more occurrences of expression.
  • expression: matches zero or more occurrences of expression.

3. PROV-DM: An Overview

The following ER diagram provides a high level overview of the structure of PROV-DM records. Examples of provenance assertions that conform to this schema are provided in the next section.

PROV-DM overview
Overview diagram does not represent the sub-relations -- proposal to use a UML notation instead of ER.

The model includes the following elements:

A set of attribute-value pairs can be associated to elements and relations of the PROV model in order to further characterize their nature. The wasComplementOf relationship is used to denote that two entities complement each other, in the sense that they each represent a partial, but mutually compatible characterization of the same thing. The attributes role and type are pre-defined.

The set of relations presented here forms a core, which is further extended with additional relations, defined in Section Common Relations.

The model includes a further additional element: notes. These are also structured as sets of attribute-value pairs. Notes are used to provide additional, "free-form" information regarding any identifiable construct of the model, with no prescribed meaning. Notes are described in detail here.

Attributes and notes are the main extensibility points in the model: individual interest groups are expected to extend PROV-DM by introducing new attributes and notes as needed to address applications-specific provenance modelling requirements.

4. Example

This section is non-normative.

There is a suggestion that a better example should be adopted for this document. Possibly, several shorter examples. This is ISSUE-132
To illustrate PROV-DM, this section presents an example encoded according to PROV-ASN. For more detailed explanations of how PROV-DM should be used, and for more examples, we refer the reader to the Provenance Primer [PROV-PRIMER].
Comments on section 3.2. This is ISSUE-71

4.1 A File Scenario

This scenario is concerned with the evolution of a crime statistics file (referred to as e0) stored on a shared file system and which journalists Alice, Bob, Charles, David, and Edith can share and edit. We consider various events in the evolution of file e0; events listed below follow each other, unless otherwise specified.

Event evt1: Alice creates (a0) an empty file in /share/crime.txt. We denote this file e1.

Event evt2: Bob appends (a1) the following line to /share/crime.txt:

There was a lot of crime in London last month.

We denote the revised file e2.

Event evt3: Charles emails (a2) the contents of /share/crime.txt, as an attachment, which we refer to as e4. (We specifically refer to a copy of the file that is uploaded on the mail server.)

Event evt4: David edits (a3) file /share/crime.txt as follows.

There was a lot of crime in London and New-York last month.

We denote the revised file e3.

Event evt5: Edith emails (a4) the contents of /share/crime.txt as an attachment, referred to as e5.

Event evt6: between events evt4 and evt5, someone (unspecified) runs a spell checker (a5) on the file /share/crime.txt. The file after spell checking is referred to as e6.

4.2 Encoding using PROV-ASN

In this section, the example is encoded according to the provenance data model (specified in section PROV-DM: The Provenance Data Model) and expressed in PROV-ASN.

Entity Records (described in Section Entity). The file in its various forms and its copies are modelled as entity records, corresponding to multiple characterizations, as per scenario. The entity records are identified by e0, ..., e6.

entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])
entity(e1, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="" ])
entity(e2, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="There was a lot of crime in London last month."])
entity(e3, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="There was a lot of crime in London and New York last month."])
entity(e4)
entity(e5)
entity(e6, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="There was a lot of crime in London and New York last month.", ex:spellchecked="yes"])

These entity records list attributes that have been given values during intervals delimited by events; such intervals are referred to as characterization intervals. The following table lists all entity identifiers and their corresponding characterization intervals. When the end of the characterization interval is not delimited by an event described in this scenario, it is marked by "...".

EntityCharacterization Interval
e0evt1 - ...
e1evt1 - evt2
e2evt2 - evt4
e3evt4 - ...
e4evt3 - ...
e5evt5 - ...
e6evt6 - ...

Activity Records (described in Section Activity) represent activities in the scenario.

activity(a0, create-file,          2011-11-16T16:00:00,)
activity(a1, add-crime-in-london,  2011-11-16T16:05:00,)
activity(a2, email,                2011-11-16T17:00:00,)
activity(a3, edit-London-New-York, 2011-11-17T09:00:00,)
activity(a4, email,                2011-11-17T09:30:00,)
activity(a5, spellcheck,,)

Generation Records (described in Section Generation) represent the event at which a file is created in a specific form. Attributes are used to describe the modalities according to which a given entity is generated by a given activity. The interpretation of attributes is application specific. Illustrations of such attributes for the scenario are: no attribute is provided for e0; e2 was generated by the editor's save function; e4 can be found on the smtp port, in the attachment section of the mail message; e6 was produced on the standard output of a5. Two identifiers g1 and g2 identify the generation records referenced in derivations introduced below.

wasGeneratedBy(e0, a0)
wasGeneratedBy(e1, a0, [ex:fct="create"])
wasGeneratedBy(e2, a1, [ex:fct="save"])     
wasGeneratedBy(e3, a3, [ex:fct="save"])     
wasGeneratedBy(g1, e4, a2, [ex:port="smtp", ex:section="attachment"])  
wasGeneratedBy(g2, e5, a4, [ex:port="smtp", ex:section="attachment"])    
wasGeneratedBy(e6, a5, [ex:file="stdout"])

Usage Records (described in Section Usage) represent the event by which a file is read by an activity. Likewise, attributes describe the modalities according to which the various entities are used by activities. Illustrations of such attributes are: e1 is used in the context of a1's load functionality; e2 is used by a2 in the context of its attach functionality; e3 is used on the standard input by a5. Two identifiers u1 and u2 identify the Usage records referenced in derivations introduced below.

used(a1,e1,[ex:fct="load"])
used(a3,e2,[ex:fct="load"])
used(u1,a2,e2,[ex:fct="attach"])
used(u2,a4,e3,[ex:fct="attach"])
used(a5,e3,[ex:file="stdin"])

Derivation Records (described in Section Derivation Relation) express that an entity is derived from another. The first two are expressed in their compact version, whereas the following two are expressed in their full version, including the activity underpinning the derivation, and associated usage (u1, u2) and generation (g1, g2) records.

wasDerivedFrom(e2,e1)
wasDerivedFrom(e3,e2)
wasDerivedFrom(e4,e2,a2,g1,u1)
wasDerivedFrom(e5,e3,a4,g2,u2)

wasComplementOf: (this relation is described in Section wasComplementOf). The crime statistics file (e0) has various contents over its existence (e1, e2, e3); the entity records identified by e1, e2, e3 complement e0 with an attribute content. Likewise, the one denoted by e6 complements the record denoted by e3 with an attribute spellchecked.

wasComplementOf(e1,e0)
wasComplementOf(e2,e0)
wasComplementOf(e3,e0)
wasComplementOf(e6,e3) 

Agent Records (described at Section Agent): the various users are represented as agents, themselves being a type of entity.

agent(ag1, [ prov:type="prov:Person" %% xsd:QName, ex:name="Alice" ])

agent(ag2, [ prov:type="prov:Person" %% xsd:QName, ex:name="Bob" ])

agent(ag3, [ prov:type="prov:Person" %% xsd:QName, ex:name="Charles" ])

agent(ag4, [ prov:type="prov:Person" %% xsd:QName, ex:name="David" ])

agent(ag5, [ prov:type="prov:Person" %% xsd:QName, ex:name="Edith" ])

Activity Assocation Records (described in Section Activity Association): the association of an agent with an activity is expressed with , and the nature of this association is described by attributes. Illustrations of such attributes include the role of the participating agent, as creator, author and communicator (role is a reserved attribute in PROV-DM).

wasAssociatedWith(a0, ag1, [prov:role="creator"])
wasAssociatedWith(a1, ag2, [prov:role="author"])
wasAssociatedWith(a2, ag3, [prov:role="communicator"])
wasAssociatedWith(a3, ag4, [prov:role="author"])
wasAssociatedWith(a4, ag5, [prov:role="communicator"])

4.3 Graphical Illustration

Provenance assertions can be illustrated graphically. The illustration is not intended to represent all the details of the model, but it is intended to show the essence of a set of provenance assertions. Therefore, it cannot be seen as an alternate notation for expressing provenance.

The graphical illustration takes the form of a graph. Entities, activities and agents are represented as nodes, with oval, rectangular, and half-hexagonal shapes, respectively. Usage, Generation, Derivation, Activity Association, and Complementarity are represented as directed edges.

Entities are layed out according to the ordering of their generation event. We endeavor to show time progressing from left to right. This means that edges for Usage, Generation and Derivation typically point from right to left.

example

example

5. PROV-DM Core

This section contains the normative specification of PROV-DM core, the core of the PROV data model.

In a next iteration of this document, it is proposed to reorganize section 5 as follows. First, the presentation of the data model alone. Second, its temporal interpretation. Third, the constraints and inferences associated with well-formed accounts.

5.1 Record

PROV-DM consists of a set of constructs, referred to as records, to formulate representations of the world and constraints that must be satisfied by them.

Furthermore, PROV-DM includes a "house-keeping construct", a record container, used to wrap PROV-DM records and facilitate their interchange.

In PROV-ASN, such representations of the world must be conformant with the toplevel production record of the grammar. These records are grouped in three categories: elementRecord (see section Element), relationRecord (see section Relation), and accountRecord (see section Account).

record ::= elementRecord | relationRecord | accountRecord

elementRecord ::= entityRecord | activityRecord | agentRecord | noteRecord

relationRecord ::= generationRecord | usageRecord | derivationRecord | activityAssociationRecord | responsibilityRecord | startRecord | endRecord | complementRecord | annotationRecord

In PROV-ASN, a record container is compliant with the production recordContainer (see section Record Container).

5.2 Element

This section describes all the PROV-DM records referred to as element records. (They are conformant to the elementRecord production of the grammar.)

5.2.1 Entity Record

In PROV-DM, an entity record is a representation of an entity.

Examples of entities include a linked data set, a sparse-matrix matrix of floating-point numbers, a document in a directory, the same document published on the Web, and meta-data embedded in a document.

An entity record, noted entity(id, [ attr1=val1, ...]) in PROV-ASN, contains:

  • id: an identifier id identifying an entity; the identifier of the entity record is defined to be the same as the identifier of the entity;
  • attributes: an optional set of attribute-value pairs [ attr1=val1, ...], representing this entity's situation in the world.

The assertion of an entity record, entity(id, [ attr1=val1, ...]), states, from a given asserter's viewpoint, the existence of an entity, whose situation in the world is represented by the attribute-value pairs, which remain unchanged during a characterization interval, i.e. a continuous interval between two events in the world.

In PROV-ASN, an entity record's text matches the entityRecord production of the grammar defined in this specification document.

entityRecord ::= entity ( identifier optional-attribute-values )

optional-attribute-values ::= , [ attribute-values ]
attribute-values ::= attribute-value | attribute-value , attribute-values
attribute-value ::= attribute = Literal

The following entity record,

entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])
states the existence of an entity, denoted by identifier e0, with type File and path /shared/crime.txt in the file system, and creator alice The attributes path and creator are application specific, whereas the attribute type is reserved in the PROV-DM namespace.
Further considerations:
  • If an asserter wishes to characterize an entity with the same attribute-value pairs over several intervals, then they are required to assert multiple entity records, each with its own identifier (so as to allow potential dependencies between the various entity records to be expressed).
  • There is no assumption that the set of attributes is complete and that the attributes are independent/orthogonal of each other.
  • A characterization interval may collapse into a single instant.
  • An entity assertion is about a thing, whose situation in the world may be variant. An entity record is asserted at a particular point and is invariant, in the sense that its attributes are given a value as part of that assertion.
  • Activities are not represented by entity records, but instead by activity records, as explained below.
The characterization interval of an entity record is currently implicit. Making it explicit would allow us to define wasComplementOf more precisely. It would also allow us to address ISSUE-108. Beginning and end of characterization interval could be expressed by attributes (similarly to activities).

5.2.2 Activity Record

In PROV-DM, an activity record is a representation of an identifiable activity, which performs a piece of work.

An activity, represented by an activity record, is delimited by its start and its end events; hence, it occurs over an interval delimited by two events. However, an activity record need not mention time information, nor duration, because they may not be known.

Such start and end times constitute attributes of an activity, where the interpretation of attribute in the context of an activity record is the same as the interpretation of attribute for entity record: an activity record's attribute remains constant for the duration of the activity it represents. Further characteristics of the activity in the world can be represented by other attribute-value pairs, which must also remain unchanged during the activity duration.

Examples of activities include assembling a data set based on a set of measurements, performing a statistical analysis over a data set, sorting news items according to some criteria, running a sparql query over a triple store, editing a file, and publishing a web page.

An activity record, written activity(id, rl, st, et, [ attr1=val1, ...]) in PROV-ASN, contains:

  • id: an identifier id identifying an activity; the identifier of the activity record is defined to be the same as the identifier of the activity;
  • recipeLink: an optional recipe link rl, which consists of a domain specific specification of the activity;
  • startTime: an optional time st indicating the start of the activity;
  • endTime: an optional time et indicating the end of the activity;
  • attributes: a set of attribute-value pairs [ attr1=val1, ...], representing other attributes of this activity that hold for its whole duration.

In PROV-ASN, an activity record's text matches the activityRecord production of the grammar defined in this specification document.

activityRecord ::= activity ( identifier , recipeLink , time , time optional-attribute-values )

The following activity assertion

activity(a1,add-crime-in-london,2011-11-16T16:05:00,2011-11-16T16:06:00,[ex:host="server.example.org",prov:type="ex:edit" %% xsd:QName])

identified by identifier a1, states the existence of an activity with recipe link add-crime-in-london, start time 2011-11-16T16:05:00, and end time 2011-11-16T16:06:00, running on host server.example.org, and of type edit (declared in some namespace with prefix ex). The attribute host is application specific, but must hold for the duration of activity. The attribute type is a reserved attribute of PROV-DM, allowing for subtyping to be expressed.

The mere existence of an activity assertion entails some event ordering in the world, since an activity start event always precedes the corresponding activity end event. This is expressed by constraint start-precedes-end.

The following temporal constraint holds for any activity record: the start event precedes the end event.

An activity record is not an entity record. Indeed, an entity record represents an entity that exists in full at any point in its characterization interval, persists during this interval, and preserves the characteristics that makes it identifiable. Alternatively, an activity in something that happens, unfolds or develops through time, but is typically not identifiable by the characteristics it exhibits at any point during its duration. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [Logic].

5.2.3 Agent Record

An agent record is a representation of an agent, which is an entity that can be assigned some degree of responsibility for an activity taking place.

Many agents can have an association with a given activity. An agent may do the ordering of the activity, another agent may do its design, another agent may push the button to start it, another agent may run it, etc. As many agents as one wishes to mention can occur in the provenance record, if it is important to indicate that they were associated with the activity.

From an inter-operability perspective, it is useful to define some basic categories of agents since it will improve the use of provenance records by applications. There should be very few of these basic categories to keep the model simple and accessible. There are three types of agents in the model:

  • Person: agents of type Person are people. (This type is equivalent to a "foaf:person" [FOAF])
  • Organization: agents of type Organization are social institutions such as companies, societies etc. (This type is equivalent to a "foaf:organization" [FOAF])
  • SoftwareAgent: a software agent is a piece of software.

These types are mutually exclusive, though they do not cover all kinds of agent.

An agent record, noted agent(id, [ attr1=val1, ...]) in PROV-ASN, contains:

  • id: an identifier id identifying an agent; the identifier of the agent record is defined to be the same as the identifier of the agent;
  • attributes: contains a set of attribute-value pairs [ attr1=val1, ...], representing this agent's situation in the world.

In PROV-ASN, an agent record's text matches the agentRecord production of the grammar defined in this specification document.

agentRecord ::= agent ( identifier optional-attribute-values )

With the following assertions,

agent(e1, [ex:employee="1234", ex:name="Alice", prov:type="prov:Person" %% xsd:QName])

entity(e2) and wasStartedBy(a1,e2,[prov:role="author"])

entity(e3) and wasAssociatedWith(a1,e3,[prov:role="sponsor"])

the agent record identified by e1 is an explicit agent assertion that holds irrespective of activities it may be associated with. On the other hand, from the entity records identified by e2 and e3, one can infer agent records, as per the following inference.

One can assert an agent record or alternatively, one can infer an agent record by its association with an activity.

If the records entity(e,attrs) and wasAssociatedWith(a,e) hold for some identifiers a, e, and attribute-values attrs, then the record agent(e,attrs) also holds.

5.2.4 Note Record

As provenance records are exchanged between systems, it may be useful to add extra-information about such records. For instance, a "trust service" may add value-judgements about the trustworthiness of some of the assertions made. Likewise, an interactive visualization component may want to enrich a set of provenance records with information helping reproduce their visual representation. To help with inter-operability, PROV-DM introduces a simple annotation mechanism allowing any identifiable record to be associated with notes.

An note record is a set of attribute-value pairs, whose meaning is application specific. It may or may not be a representation of something in the world.

In PROV-ASN, a note record's text matches the noteRecord production of the grammar defined in this specification document.

noteRecord ::= note ( identifier , attribute-values )

A separate PROV-DM record is used to associate a note with an identifiable record (see Section on annotation). A given note may be associated with multiple records.

The following note record

note(ann1,[ex:color="blue", ex:screenX=20, ex:screenY=30])

consists of a set of application-specific attribute-value pairs, intended to help the rendering of the record it is associated with, by specifying its color and its position on the screen. In this example, these attribute-value pairs do not constitute a representation of something in the world; they are just used to help render provenance.

Attribute-value pairs occurring in notes differ from attribute-value pairs occurring in entity records and activity records. In entity and activity records, attribute-value pairs must be a representation of something in the world, which remain constant for the duration of the characterization interval (for entity record) or the activity duration (for activity records). In note records, it is optional for attribute-value pairs to be representations of something in the world. If they are a representation of something in the world, then it may change value for the corresponding duration. If attribute-value pairs of a note record are a representation of something in the world that does not change, they are not regarded as determining characteristics of an entity or activity, for the purpose of provenance.

5.3 Relation

This section describes all the PROV-DM records representing relations between the elements introduced in Section Element. While these relations are not binary, they all involve two primary elements. They can be summarized as follows.

PROV-DM Core Relation Summary
EntityActivityAgentNote
EntitywasDerivedFrom
wasComplementOf
wasGeneratedBy-hasAnnotation
Activityused-wasStartedBy
wasEndedBy
wasAssociatedWith
hasAnnotation
Agent--actedOnBehalfOfhasAnnotation
Note---hasAnnotation

In PROV-ASN, all these relation records are conformant to the relationRecord production of the grammar.

5.3.1 Activity-Entity Relation

5.3.1.1 Generation Record

In PROV-DM, a generation record is a representation of a world event, the creation of a new entity by an activity. This entity did not exist before creation. The representation of this event encompasses a description of the modalities of generation of this entity by this activity.

A generation event may be, for example, the creation of a file by a program, the creation of a linked data set, the production of a new version of a document, and the sending of a value on a communication channel.

A generation record, written wasGeneratedBy(id,e,a,attrs,t) in PROV-ASN, has the following components:

  • id: an optional identifier id identifying the generation record;
  • entity: an identifier e identifying an entity record that represents the entity that is created;
  • activity: an identifier a identifying an activity record that represents the activity that creates the entity;
  • time: an optional "generation time" t, the time at which the entity was created;
  • attributes: an optional set of attribute-value pairs attrs that describes the modalities of generation of this entity by this activity.

In PROV-ASN, a generation record's text matches the generationRecord production of the grammar defined in this specification document.

generationRecord ::= wasGeneratedBy ( identifier , eIdentifier , aIdentifier , time optional-attribute-values )

A generation record's id is optional. It must be used when annotating generation records (see Section Annotation Record) or when defining precise-1 derivations (see Derivation Record).

The following generation assertions

  wasGeneratedBy(e1,a1, 2001-10-26T21:32:52, [ex:port="p1", ex:order=1])
  wasGeneratedBy(e2,a1, 2001-10-26T10:00:00, [ex:port="p1", ex:order=2])

state the existence of two events in the world (with respective times 2001-10-26T21:32:52 and 2001-10-26T10:00:00), at which new entities, represented by entity records identified by e1 and e2, are created by an activity, itself represented by an activity record identified by a1. The first one is available as the first value on port p1, whereas the other is the second value on port p1. The semantics of port and order in these records are application specific.

The assertion of a generation record implies ordering of events in the world.

If an assertion wasGeneratedBy(x,a,attrs) or wasGeneratedBy(x,a,attrs,t) holds, then the following temporal constraint also holds: the generation of the entity denoted by x precedes the end of a and follows the start of a.

A given entity record can be referred to in a single generation record in the scope of a given account. The rationale for this constraint is as follows. If two activities sequentially set different values to some attribute by means of two different generation events, then they generate distinct entities. Alternatively, for two activities to generate an entity simultaneously, they would require some synchronization by which they agree the entity is released for use; the end of this synchronization would constitute the actual generation of the entity, but is performed by a single activity. This unicity constraint is formalized as follows.

Given an entity record denoted by e, two activity records denoted by a1 and a2, and two sets of attribute-value pairs attrs1 and attrs2, if the records wasGeneratedBy(e,a1,attrs1) and wasGeneratedBy(e,a2,attrs2) exist in the scope of a given account, then a1=a2 and attrs1=attrs2.
TODO: Introduce the well-formed-ness constraint in a entirely separate section.
5.3.1.2 Usage Record

In PROV-DM, a usage record is a representation of a world event: the consumption of an entity by an activity. The representation includes a description of the modalities of usage of this entity by this activity.

A usage event may be the consumption of a parameter by a procedure, the reading of a value on a port by a service, the reading of a configuration file by a program, or the adding of an ingredient, such as eggs, in a baking activity. Usage may entirely consume an entity (e.g. eggs are not longer available after being added to the mix), or leave it as such, ready for further uses (e.g. a file on a file system can be read indefinitely).

A usage record, written used(id,a,e,attrs,t) in PROV-ASN, has the following constituent:

  • id: an optional identifier id identifying the usage record;
  • activity: an identifier a for an activity record, which represents the consuming activity;
  • entity: an identifier e for an entity record, which represents the entity that is consumed;
  • time: an optional "usage time" t, the time at which the entity was used;
  • attributes: an OPTIONIAL set of attribute-value pairs attrs that describe the modalities of usage of this entity by this activity;

In PROV-ASN, a usage record's text matches the usageRecord production of the grammar defined in this specification document.

usageRecord ::= used ( identifier , aIdentifier , eIdentifier , time optional-attribute-values )

A usage record's id is optional, but comes handy when annotating usage records (see Section Annotation Record) or when defining derivations.

The following usage records

  used(a1,e1,2011-11-16T16:00:00,[ex:parameter="p1"])
  used(a1,e2,2011-11-16T16:00:01,[ex:parameter="p2"])

state that the activity, represented by the activity record identified by a1, consumed two entities, represented by entity records identified by e1 and e2, at times 2011-11-16T16:00:00 and 2011-11-16T16:00:01, respectively; the first one was found as the value of parameter p1, whereas the second was found as value of parameter p2. The semantics of parameter in these records is application specific.

A usage record's id is optional. It must be present when annotating usage records (see Section Annotation Record) or when defining precise-1 derivations (see Derivation Record).

A reference to a given entity record may appear in multiple usage records that share a given activity record identifier.

For any entity, the following temporal constraint holds: the generation of an entity always precedes any of its usages.
Given an activity record identified by a, an entity record identified by e, a set of attribute-value pairs attrs, and optional time t, if assertion used(a,e,attrs) or used(a,e,attrs,t) holds, then the following temporal constraint holds: the usage of the entity represented by entity record identified by e precedes the end of activity represented by record identified by a and follows its start.
Should we define a taxonomy of use? This is ISSUE-23.

5.3.2 Activity-Agent Relation

5.3.2.1 Activity Association Record

The key purpose of agents in PROV-DM is to assign responsibility for activities. It is important to reflect that there is a degree in the responsibility of agents, and that is a major reason for distinguishing among all the agents that have some association with an activity and determine which ones are really the originators of the entity. For example, a programmer and a researcher could both be associated with running a workflow, but it may not matter what programmer clicked the button to start the workflow while it would matter a lot what researcher told the programmer to do so. Another example: a student publishing a web page describing an academic department could result in both the student and the department being agents associated with the activity, and it may not matter what student published a web page but it matters a lot that the department told the student to put up the web page. So there is some notion of responsibility that needs to be captured.

To this end, PROV-DM offers two kinds of records. The first, introduced in this section, represents an association between an agent and an activity; the second, introduced in Section Responsibility record, represents the fact that an agent was acting on behalf of another, in the context of an activity.

Examples of activity association include designing, participation, initiation and termination, timetabling or sponsoring.

An activity association record, written wasAssociatedWith(a,ag2,attrs) in PROV-ASN, has the following constituents:

  • id: an optional identifier id identifying the activity association record;
  • activity: an identifier a for an activity record;
  • attributes: an optional set of attribute-value pairs attrs that describe the modalities of association of this activity with this agent;
  • agent: an identifier ag2 for an agent record, which represents the agent associated with the activity.

In PROV-ASN, an activity association record's text matches the activityAssociationRecord productions of the grammar defined in this specification document.

activityAssociationRecord ::= wasAssociatedWith ( identifier, aIdentifier, agIdentifier optional-attribute-values )
In the following example, a programmer and a researcher agents are asserted to be associated with an activity.
activity(a,[prov:type="workflow"])
agent(ag1,[prov:type="programmer"])
agent(ag2,[prov:type="researcher"])
wasAssociatedWith(a,ag1,[prov:role="loggedInUser", ex:how="webapp"])
wasAssociatedWith(a,ag2,[prov:role="designer", ex:context="phd"])
5.3.2.2 Start and End Records

A start record is a representation of an agent starting an activity. An end record is a representation of an agent ending an activity. Both relations are specialized forms of wasAssociatedWith. They contain attributes describing the modalities of acting/ending activities.

A start record, written wasStartedBy(id,a,ag,attrs) in PROV-ASN, contains:

  • id: an optional identifier id identifying the start record;
  • activity: an identifier a denoting an activity record, representing the started activity;
  • agent: an identifier ag for an agent record, representing the starting agent;
  • attributes: an optional set of attribute-value pairs attrs, describing modalities according to which the agent started the activity.

An end record, written wasEndedBy(id,a,ag,attrs) in PROV-ASN, contains:

  • id: an optional identifier id identifying the end record;
  • activity: an identifier a denoting an activity record, representing the ended activity;
  • agent: an identifier ag for an agent record, representing the ending agent;
  • attributes: an optional set of attribute-value pairs attrs, describing modalities according to which the agent ended the activity.

In PROV-ASN, start and end record's texts match the startRecord and endRecord productions of the grammar defined in this specification document.

startRecord ::= wasStartedBy ( identifier, aIdentifier, agIdentifier optional-attribute-values )
endRecord ::= wasEndedBy ( identifier, aIdentifier, agIdentifier optional-attribute-values )

The following assertions

wasStartedBy(a,ag,[ex:mode="manual"])
wasEndedby(a,ag,[ex:mode="manual"])

state that the activity, represented by the activity record denoted by a was started and ended by an agent, represented by record denoted by ah, in "manual" mode, an application specific characterization of these relations.

Temporal constraints for these relations remain to be written. The temporal constraints should ensure that the agent started its existence before the effect it may have on the activity.

5.3.3 Entity-Entity or Agent-Agent Relation

5.3.3.1 Responsibility Record

To promote take-up, PROV-DM offers a mild version of responsibility in the form of a relation to represent when an agent acted on another agent's behalf. So in the example of someone running a mail program, the program is an agent of that activity and the person is also an agent of the activity, but we would also add that the mail software agent is running on the person's behalf. In the other example, the student acted on behalf of his supervisor, who acted on behalf of the department chair, who acts on behalf of the university, and all those agents are responsible in some way for the activity to take place but we don't say explicitly who bears responsibility and to what degree.

We could also say that an agent can act on behalf of several other agents (a group of agents). This would also make possible to indirectly reflect chains of responsibility. This also indirectly reflects control without requiring that control is explicitly indicated. In some contexts there will be a need to represent responsibility explicitly, for example to indicate legal responsibility, and that could be added as an extension to this core model. Similarly with control, since in particular contexts there might be a need to define specific aspects of control that various agents exert over a given activity.

Given an activity association record wasAssociatedWith(a,ag2,attrs), a responsibility record, written actedOnBehalfOf(id,ag2,ag1,a,attrs) in PROV-ASN, has the following constituents:

  • id: an optional identifier id identifying the responsibility record;
  • subordinate: an identifier ag2 for an agent record, which represents an agent associated with an activity, acting on behalf of the responsible agent;
  • responsible: an identifier ag1 for an agent record, which represents the agent on behalf of which the subordinate agent ag2 acts;
  • activity: an optional identifier a of an activity record for which the responsibility record holds;
  • attributes: an optional set of attribute-value pairs attrs that describe the modalities of this relation.
responsibilityRecord ::= actedOnBehalfOf ( identifier, agIdentifier, agIdentifier, aIdentifier optional-attribute-values )
In the following example, a programmer, a researcher and a funder agents are asserted. The porgrammer and researcher are associated with a workflow activity. The programmer acts on behalf of the researcher (delegation) encoding the commands specified by the researcher; the researcher acts on behalf of the funder, who has an contractual agreement with the researcher.
activity(a,[prov:type="workflow"])
agent(ag1,[prov:type="programmer"])
agent(ag2,[prov:type="researcher"])
agent(ag3,[prov:type="funder"])
wasAssociatedWith(a,ag1,[prov:role="loggedInUser"])
wasAssociatedWith(a,ag2)
actedOnBehalfOf(ag1,ag2,a,[prov:type="delegation"])
actedOnBehalfOf(ag2,ag3,a,[prov:type="contract"])
5.3.3.2 Derivation Record

In PROV-DM, a derivation record is a representation that some entity is transformed from, created from, or affected by another entity in the world.

Examples of derivation include the transformation of a canvas into a painting, the transportation of a person from London to New-York, the transformation of a relational table into a linked data set, and the melting of ice into water.

According to Section Conceptualization, for an entity to be transformed from, created from, or affected by another in some way, there must be some underpinning activities performing the necessary actions resulting in such a derivation. However, asserters may not assert or have knowledge of these activities and associated details: they may not assert or know their number, they may not assert or know their identity, they may not assert or know the attributes characterizing how the relevant entities are used or generated. To accommodate the varying circumstances of the various asserters, PROV-DM allows more or less precise records of derivation to be asserted. Hence, PROV-DM uses the terms precise and imprecise to characterize the different kinds of derivation record. We note that the derivation itself is exact (i.e., deterministic, non-probabilistic), but it is its description, expressed in a derivation record, that may be imprecise.

The lack of precision may come from two sources:

  • the number of activities that underpin a derivation is not asserted or known, or
  • any of the other details that are involved in the derivation is not asserted or known; these include activity identities, generation and usage records, and their attributes.

Hence, given a precision axis, with values precise and imprecise, and an activity axis, with values one activity and n activities, we can then form a matrix of possible derivations, precise or imprecise, or corresponding to one activity or n activities. Out of the four possibilities, PROV-DM offers three forms of derivation, while the fourth one is not meaningful. The following table summarises names for the three kinds of derivation, which we then explain.

PROV-DM Derivation Type Summary
precision axis
preciseimprecise
activity
axis
one activityprecise-1 derivation recordimprecise-1 derivation record
n activities---imprecise-n derivation record
  • The asserter asserts that derivation is due to exactly one activity, and all the details are asserted. We call this a precise-1 derivation record.
  • The asserter asserts that derivation is due to exactly one activity, but other details, whether known or unknown, are not asserted. We call this an imprecise-1 derivation record.
  • The asserter does not know how many activities are involved in the derivation, and other details, whether known or unknown, are also not asserted. We call this an imprecise-n derivation record.

We note that the fourth theoretical case of a precise derivation, where the number of activities is not known or asserted cannot occur.

The three kinds of derivation records are successively introduced. To minimize the number of relation types in PROV-DM, we introduce a PROV-DM reserved attribute steps, which allows us to distinguish the various derivation types.

A precise-1 derivation record, written wasDerivedFrom(id, e2, e1, a, g2, u1, attrs) in PROV-ASN, contains:

  • id: an optional identifier id identifying the derivation record;
  • generatedEntity: the identifier e2 of an entity record, which is a representation of the generated entity;
  • usedEntity: the identifier e1 of an entity record, which is a representation of the used entity;
  • activity: an identifier a of an activity record, which is a representation of the activity using and generating the above entities;
  • generation: an identifier g2 of the generation record pertaining to e2 and a;
  • usage: an identifier u1 of the usage record pertaining to e1 and a.
  • attributes: an optional set of attribute-value pairs attrs that describe the modalities of this derivation, optionally including the attribute-value pair prov:steps="1".

It is optional to include the attribute prov:steps in a precise-1 derivation since the record already refers to the one and only one activity underpinning the derivation.

An imprecise-1 derivation record, written wasDerivedFrom(id, e2,e1, attrs) in PROV-ASN, contains:

  • id: an optional identifier id identifying the derivation record;
  • generatedEntity: the identifier e2 of an entity record, which is a representation of the generated entity;
  • usedEntity: the identifier e1 of an entity record, which is a representation of the used entity.
  • attributes: a set of attribute-value pairs attrs that describe the modalities of this derivation; it must include the attribute-value pair prov:steps="1".

An imprecise-1 derivation must include the attribute prov:steps, since it is the only means to distinguish this record from an imprecise-n derivation record.

An imprecise-n derivation record, written wasDerivedFrom(id, e2, e1, attrs) in PROV-ASN, contains:

  • id: an optional identifier id identifying the derivation record;
  • generatedEntity: the identifier e2 of an entity record, which is a representation of the generated entity;
  • usedEntity: the identifier e1 of an entity record, which is a representation of the used entity.
  • attributes: an optional set of attribute-value pairs attrs that describe the modalities of this derivation; it optionally includes the attribute-value pair prov:steps="n".

It is optional to include the attribute prov:steps in an imprecise-n derivation record. It defaults to prov:steps="n".

None of the three kinds of derivation is defined to be transitive. Domain-specific specializations of these derivations may be defined in such a way that the transitivity property holds.

In PROV-ASN, a derivation record's text matches the derivationRecord production of the grammar defined in this specification document.

derivationRecord ::= precise-1-derivationRecord | imprecise-1-derivationRecord | imprecise-n-derivationRecord

precise-1-derivationRecord ::= wasDerivedFrom ( identifier, eIdentifier , eIdentifier , aIdentifier , gIdentifier , uIdentifier optional-attribute-values )
imprecise-1-derivationRecord::= wasDerivedFrom ( identifier, eIdentifier , eIdentifier , attribute-values )
imprecise-n-derivationRecord::= wasDerivedFrom ( identifier, eIdentifier , eIdentifier optional-attribute-values )
The grammar should make it clear that attribute prov:steps="1" is required for imprecise-1-derivationRecord.
PM: suggestion -- remove the distinction between imprecise-1 and imprecise-n in the grammar and instead explain that the qualification (1 vs n) is through attribute prov:steps.

The following assertions state the existence of derivations.

wasDerivedFrom(e5,e3,a4,g2,u2,[])
wasDerivedFrom(e5,e3,a4,g2,u2,[prov:steps="1"])

wasDerivedFrom(e3,e2,[prov:steps="1"])

wasDerivedFrom(e2,e1,[])
wasDerivedFrom(e2,e1,[prov:steps="n"])

The first two are precise-1 derivation records expressing that the activity represented by the activity a4, by using the entity denoted by e3 according to usage record u2 derived the entity denoted by e5 and generated it according to generation record g2. The third record is an imprecise-1 derivation, which is similar for e3 and e2, but it leaves the activity record and associated attributes implicit. The fourth and fifth records are imprecise-n derivation records between e2 and e1, but no information is provided as to the number and identity of activities underpinning the derivation.

An precise-1 derivation record is richer than an imprecise-1 derivation record, itself, being more informative that an imprecise-n derivation record. Hence, the following implications hold.

Given two entity records denoted by e1 and e2, if the assertion wasDerivedFrom(e2, e1, a, g2, u1, attrs) holds for some generation record identified by g2, and usage record identified by u1, then wasDerivedFrom(e2,e1,[prov:steps="1"] ∪ attrs) also holds.
Given two entity records denoted by e1 and e2, if the assertion wasDerivedFrom(e2, e1, [prov:steps="1"] ∪ attrs) holds, then wasDerivedFrom(e2,e1,attrs) also holds.

If a derivation record holds for e2 and e1, then this means that the entity represented by entity record identified by e1 has an influence on the entity represented entity record identified by e2, which at the minimum implies temporal ordering, specified as follows. First, we consider one-activity derivations.

Given an activity record identified by a, entity records identified by e1 and e2, generation record identified by g2, and usage record identified by u1, if the record wasDerivedFrom(e2,e1,a,g2,u1,attrs) or wasDerivedFrom(e2,e1,[prov:steps="1"] ∪ attrs) holds, then the following temporal constraint holds: the usage of entity denoted by e1 precedes the generation of the entity denoted by e2.

Then, imprecise-n derivations.

Given two entity records denoted by e1 and e2, if the record wasDerivedFrom(e2,e1,[prov:steps="n"] ∪ attrs) holds, then the following temporal constraint holds: the generation event of the entity denoted by e1 precedes the generation event of the entity denoted by e2.

Note that temporal ordering is between generations of e1 and e2, as opposed to precise-1 derivation, which implies temporal ordering between the usage of e1 and generation of e2. Indeed, in the case of imprecise-n derivation, nothing is known about the usage of e1, since there is no associated activity.

The imprecise-1 derivation has the same meaning as the precise-1 derivation, except that an activity is known to exist, though it does not need to be asserted. This is formalized by the following inference rule, referred to as activity introduction:

If wasDerivedFrom(e2,e1) holds, then there exist an activity record identified by a, a usage record identified by u, and a generation record identified by g such that:
activity(a,aAttrs)
wasGeneratedBy(g,e2,a,gAttrs)
used(u,a,e1,uAttrs)
for sets of attribute-value pairs gAttrs, uAttrs, and aAttrs.

Note that inferring derivation from usage and generation does not hold in general. Indeed, when a generation wasGeneratedBy(g, e2, a, attrs2) precedes used(u, a, e1, attrs1), for some e1, e2, attrs1, attrs2, and a, one cannot infer derivation wasDerivedFrom(e2, e1, a, g, u) or wasDerivedFrom(e2,e1) since of e2 cannot possibly be determined by of e1, given the creation of e2 precedes the use of e1.

The following property holds for account where generation-unicity applies. Move it to separate section with all related material.

A further inference is permitted from the imprecise-1 derivation record:

Given an activity record identified by pe, entity records identified by e1 and e2, and set of attribute-value pairs attrs2, if wasDerivedFrom(e2,e1, [prov:steps="1"]) and wasGeneratedBy(e2,pe,attrs2) hold, then used(pe,e1,attrs1) also holds for some set of attribute-value pairs attrs1.

This inference is justified by the fact that the entity represented by entity record identified by e2 is generated by at most one activity in a given account (see generation-unicity). Hence, this activity record is also the one referred to in the usage record of e1.

We note that the converse inference, does not hold. From wasDerivedFrom(e2,e1) and used(pe,e1), one cannot derive wasGeneratedBy(e2,pe,attrs2) because identifier e1 may occur in usage records referring to many activity records, but they may not be referred to in generation records containing identifier e2.

Should derivation have a time? Which time? This is ISSUE-43.
5.3.3.3 Complementarity Record
While the working group recognizes the importance of the complementarity record concept, its name and its exact semantics are still being discussed.

A complementarity record is a relationship between two entities stated to have compatible characterization over some continuous interval between two events.

The rationale for introducing this relationship is that in general, at any given time, for an entity in the world, there may be multiple ways of characterizing it, and hence multiple representations can be asserted by different asserters. In the example that follows, suppose thing "Royal Society" is represented by two asserters, each using a different set of attributes. If the asserters agree that both representations refer to "The Royal Society", the question of whether any correspondence can be established between the two representations arises naturally. This is particularly relevant when (a) the sets of attributes used by the two representations overlap partially, or (b) when one set is subsumed by the other. In both these cases, we have a situation where each of the two asserters has a partial view of "The Royal Society", and establishing a correspondence between them on the shared attributes is beneficial, as in case (a) each of the two representation complements the other, and in case (b) one of the two (that with the additional attributes) complements the other.

This intuition is made more precise by considering the entities that form the representations of entities at a certain point in time. An entity record represents, by means of attribute-value pairs, a thing and its situation in the world, which remain constant over a characterization interval. As soon as the thing's situation changes, this marks the end of the characterization interval for the entity record representing it. The thing's novel situation is represented by an attribute with a new value, or an entirely different set of attribute-value pairs, embodied in another entity record, with a new characterization interval. Thus, if we overlap the timelines (or, more generally, the sequences of value-changing events) for the two entities, we can hope to establish correspondences amongst the entity records that represent them at various points along that events line. The figure below illustrates this intuition.

illustration complementOf

Relation complement-of between two entity records is intended to capture these correspondences, as follows. Suppose entity records A and B share a set P of attributes, and each of them has other attributes in addition to P. If the values assigned to each attribute in P are compatible between A and B, then we say that A is-complement-of B, and B is-complement-of A, in a symmetrical fashion. In the particular case where the set P of attributes of B is a strict superset of A's attributes, then we say that B is-complement-of A, but in this case the opposite does not hold. In this case, the relation is not symmetric. (as a special case, A and B may not share any attributes at all, and yet the asserters may still stipulate that they are representing the same thing "Royal Society". The symmetric relation may hold trivially in this case).

The term compatible used above means that a mapping can be established amongst the values of attributes in P and found in the two entity expession. This generalizes to the case where attribute sets P1 and P2 of A, and B, respectively, are not identical but they can be mapped to one another. The simplest case is the identity mapping, in which A and B share attribute set P, and furthermore the values assigned to attributes in P match exactly.

It is important to note that the relation holds only for the characterization intervals of the entity expessions involved As soon as one attribute changes value in one of them, new correspondences need to be found amongst the new entities. Thus, the relation has a validity span that can be expressed in terms of the event lines of the entity.

A complementarity record is written wasComplementOf(e2,e1), where e1 and e2 are two identifiers denoting entity records.

The following example illustrates the entity "Royal Society"and its perspectives at various points in time.

entity(rs,[ex:created=1870])

entity(rs_l1,[prov:location="loc2"])
entity(rs_l2,[prov:location="The Mall"])

entity(rs_m1,[ex:membership=250, ex:year=1900])
entity(rs_m2,[ex:membership=300, ex:year=1945])
entity(rs_m3,[ex:membership=270, ex:year=2010])

wasComplementOf(rs_m3, rs_l2)
wasComplementOf(rs_m2, rs_l1)
wasComplementOf(rs_m2, rs_l2)
wasComplementOf(rs_m1, rs_l1)

wasComplementOf(rs_m3, rs)
wasComplementOf(rs_m2, rs)
wasComplementOf(rs_m1, rs)
wasComplementOf(rs_l1, rs)
wasComplementOf(rs_l2, rs)
An assertion "wasComplementOf(B,A)" holds over the temporal intersection of A and B, only if:
  1. if a mapping can be established from an attribute X of entity record identified by B to an attribute Y of entity record identified by A, then the values of A and B must be consistent with that mapping;
  2. entity record identified by B has some attribute that entity record identified by A does not have.

The complementarity relation is not transitive. Let us consider identifiers e1, e2, and e3 identifying three entity records such that wasComplementOf(e3,e2) and wasComplementOf(e2,e1) hold. The record wasComplementOf(e3,e1) may not hold because the characterization intervals of the denoted entity records may not overlap.

In PROV-ASN, a complementarity record's text matches the complementarityRecord production of the grammar defined in this specification document.

complementarityRecord ::= wasComplementOf ( eIdentifier , eIdentifier optional-attribute-values )
| wasComplementOf ( eIdentifier , accIdentifier , eIdentifier , accIdentifier optional-attribute-values )

An entity record identifier can optionally be accompanied by an account identifier. When this is the case, it becomes possible to link two entity record identifiers that are appear in different accounts. (In particular, the entity record identifiers in two different account are allowed to be the same.). When account identifiers are not available, then the linking of entity records through complementarity can only take place within the scope of a single account.

In the following example, the same description of the Royal Society is structured according to two different accounts. In the second account, we find a complementarity record linking rs_m1 in account ex:acc2 to rs in account ex:acc1.

account(ex:acc1,
        http://example.org/asserter1, 

    ...
    entity(rs,[ex:created=1870])
    ...
    )


account(ex:acc2,
        http://example.org/asserter2, 

    ...
    entity(rs_m1,[ex:membership=250, ex:year=1900])
    ...
    wasComplementOf(rs_m1, ex:acc2, rs, ex:acc1)

)
It is suggested that the name 'wasComplementOf' does not capture the meaning of this relation adequately. No concrete suggestion has been made so far. Furthermore, there is a suggestion that an alternative relation that is transitive may also be useful. This is raised in the following email.
A discussion on alternative definition of wasComplementOf has not reached a satisfactory conclusion yet. This is ISSUE-29
Comments on ivpof in ISSUE-57.

5.3.4 Annotation Record

An annotation record establishes a link between an identifiable PROV-DM record and a note record referred to by its identifier. Multiple note records can be associated with a given PROV-DM record; symmetrically, multiple PROV-DM records can be associated with a given note record. Since note records have identifiers, they can also be annotated. The annotation mechanism (with note record and the annotation record) forms a key aspect of the extensibility mechanism of PROV-DM (see extensibility section).

An annotation record, written hasAnnotation(r,n,attrs) in PROV-ASN, has the following constituents:

  • record: an identifier r of the record being annnotated;
  • note: an identifier n of a note record;
  • attributes: an optional set attrs of attribute-value pairs to further describe this record.

In PROV-ASN, a note record's text matches the noteRecord production of the grammar defined in this specification document.

annotationRecord ::= hasAnnotation ( identifier , nIdentifier optional-attribute-values )

The interpretation of notes is application-specific. See Section Note for a discussion of the difference between note attributes and other records attributes. We also note the present tense in this term to indicate that it may not denote something in the past.

The following records

entity(e1,[prov:type="document"])
entity(e2,[prov:type="document"])
activity(a,transform,t1,t2,[])
used(u1,a,e1,[ex:file="stdin"])
wasGeneratedBy(e2, a, [ex:file="stdout"])

note(n1,[ex:icon="doc.png"])
hasAnnotation(e1,n1)
hasAnnotation(e2,n1)

note(n2,[ex:style="dotted"])
hasAnnotation(u1,n2)

assert the existence of two documents in the world (attribute-value pair: prov:type="document") represented by entity records identified by e1 and e2, and annotate these records with a note indicating that the icon (an application specific way of rendering provenance) is doc.png. It also asserts an activity, its usage of the first entity, and its generation of the second entity. The usage record is annotated with a style (an application specific way of rendering this edge graphically). To be able to express this annotation, the usage record was provided with an identifier u1, which was then referred to in hasAnnotation(u1,n2).

5.4 Bundle

In this section, two constructs are introduced to group PROV-DM records. The first one, account record is itself a record, whereas the second one record container is not.

5.4.1 Account Record

In PROV-DM, an account record is a wrapper of records with a dual purpose:

  • It is the mechanism by which attribution of provenance can be assserted; it allows asserters to bundle up their assertions, and assert suitable attribution;
  • It provides a scoping mechanism for record identifiers and for some contraints (such as generation-unicity and derivation-use).

An account record, written account(id, assertIRI, recs, attrs) in PROV-ASN, contains:

  • id: an identifier id that identifies this account globally;
  • asserter: an IRI, denoted by assertIRI, to identify an asserter; such IRI has no specific interpretation in the context of PROV-DM;
  • records: a set recs of provenance records;
  • attributes: an optional set attrs of attribute-value pairs to further describe this record.

In PROV-ASN, an account record's text matches the accountRecord production of the grammar defined in this specification document.

accountRecord ::= account ( identifier , asserter , record optional-attribute-values )
Currently, the non-terminal asserter is defined as IRI and its interpretation is outside PROV-DM. We may want the asserter to be an agent instead, and therefore use PROV-DM to express the provenance of PROV-DM assertions. The editors seek inputs on how to resolve this issue.

The following account record

account(ex:acc0,
        http://example.org/asserter, 
          entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])
          ...
          wasDerivedFrom(e2,e1)
          ...
          activity(a0,create-file,t)
          ...
          wasGeneratedBy(e0,a0,[])     
          ...
          wasAssociatedWith(a4, ag5, [prov:role="communicator"])  )

contains the set of provenance records of section example-prov-asn-encoding, is asserted by agent http://example.org/asserter, and is identified by identifier ex:acc0.

Account records constitue a scope for record identifiers. A record identifier within the scope of an account is intended to denote a single record. However, nothing prevents an asserter from asserting an account containing, for example, multiple entity records with a same identifier but different attribute-values. In that case, they should be understood as a single entity record with this identifier and the union of all attributes values, as formalized in identified-entity-in-account.

Given an entity record identifier e, two sets of attribute-values denoted by av1 and av2, two entity records entity(e,av1) and entity(e,av2) occurring in an account are equivalent to the entity record entity(e,av) where av is the set of attribute-value pairs formed by the union of av1 and av2.

Whilst constraint identified-entity-in-account specifies how to understand multiple entity records with a same identifier within a given account, it does not guarantee that the entity record formed with the union of all attribute-value pairs is consistent. Indeed, a given attribute may be assigned multiple values, resulting in an inconsistent entity record, as illustrated by the following example.

In the following account record, we find two entity records with a same identifier e.

account(ex:acc1,
        http://example.org/id,
          entity(e,[prov:type="person", ex:age=20])
          entity(e,[prov:type="person", ex:age=30])
          ...)

Application of identified-entity-in-account results in an entity record containing the attribute-value pairs age=20 and age=30. This results in an inconsistent characterization of a person. We note that deciding whether a set of attribute-values is consistent or not is application specific and outside the scope of this specification.

Account records can be nested since an account record can occur among the records being wrapped by another account.

An account is said to be well-formed if it satisfies the constraints generation-unicity and derivation-use.

The union of two accounts is another account, containing the unions of their respective records, where records with a same identifier should be understood according to constraint identified-entity-in-account. Well-formed accounts are not closed under union because the constraint generation-unicity may no longer be satisfied in the resulting union.

Indeed, let us consider another account record

account(ex:acc2,
        http://example.org/asserter2, 
          entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])
          ...
          activity(a1,create-file,t1)
          ...
          wasGeneratedBy(e0,a1,[ex:fct="create"])     
          ... )

with identifier ex:acc2, containing assertions by asserter by http://example.org/asserter2 stating that the entity represented by entity record identified by e0 was generated by an activity represented by activity record identified by a1 instead of a0 in the previous account ex:acc0. If accounts ex:acc0 and ex:acc2 are merged together, the resulting set of records violates generation-unicity.

Account records constitute a scope for record identifiers. Since accounts can be nested, scopes can also be nested; thus, the scope of record identifiers should be understood in the context of such nested scopes. When a record with an identifier occurs directly within an account, then its identifier denotes this record in the scope of this account, except in sub-accounts where records with the same identifier occur.

The following account record is inspired from section example-prov-asn-encoding. This account, identified by ex:acc3, declares entity record with identifier e0, which is being referred to in the nested account ex:acc4. The scope of identifier e0 is account ex:acc3, including subaccount ex:acc4.

account(ex:acc3,
        http://example.org/asserter1, 
          entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])
          activity(a0,create-file,t)
          wasGeneratedBy(e0,a0,[])  
          account(ex:acc4,
                  http://example.org/asserter2,
                    entity(e1, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="" ])
                    activity(a0,copy-file,t)
                    wasGeneratedBy(e1,a0,[ex:fct="create"])
                    wasComplementOf(e1,e0)))

Alternatively, an activity record identified by a0 occurs in each of the two accounts. Therefore, each activity record is asserted in a separate scope, and therefore may represent different activities in the world.

The account record is the hook by which further meta information can be expressed about provenance, such as asserter, time of creation, signatures. The annotation mechanism can be used for this purpose, but how general meta-information is expressed is beyond the scope of this specification, except for asserters.

5.4.2 Record Container

A record container is a house-keeping construct of PROV-DM, also capable of bundling PROV-DM records. A record container is not a record, but can be exploited to return assertions in response to a request for the provenance of something ([PROV-PAQ]).

A record container, written container decls recs endContainer in PROV-ASN, contains:

  • namespaceDeclarations: a set decls of namespace declarations, declaring namespaces and associated prefixes, which can be used in attributes and identifiers occurring inside recs;
  • records: a non-empty set of records recs.

All the records in recs are implictly wrapped in a default account, scoping all the record identifiers they declare directly, and constituting a toplevel account, in the hierarchy of accounts. Consequently, every provenance record is always expressed in the context of an account, either explicitly in an asserted account, or implicitly in a container's default account.

In PROV-ASN, a record container's text matches the recordContainer production of the grammar defined in this specification document.

recordContainer ::= container namespaceDeclarations record endContainer

The following container

container
  prefix ex: http://example.org/,

  account(ex:acc1,http://example.org/asserter1,...)
  account(ex:acc2,http://example.org/asserter1,...)
endContainer

illustrates how two accounts with identifiers ex:acc1 and ex:acc2 can be returned in a PROV-ASN serialization of the provenance of something.

Asserter needs to be defined. This is ISSUE-51.
Scope and Identifiers. This is ISSUE-81.

5.5 Further Terms in Records

This section further terms in PROV-DM records.

5.5.1 Attribute

An attribute is a qualified name. A qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part (see detailed rule in [RDF-SPARQL-QUERY], Section 4.1.1).

attribute ::= qualifiedName
qualifiedName  ::= prefixedName | unprefixedName
prefixedName  ::= prefix : localPart
unprefixedName  ::= localPart
prefix  ::= a name without colon compatible with the NC_NAME production [XML-NAMES]
localPart  ::= a name without colon compatible with the NC_NAME production [XML-NAMES]

A qualified name's prefix is optional. If a prefix occurs in a qualified name, it refers to a namespace declared in the record container. In the absence of prefix, the qualified name refers to the default namespace declared in the container.

Note that XML NC_NAME don't allow local identifiers to start with a number. Instead, should we use the productions used in SPARQL or TURTLE?

From this specification's viewpoint, the interpretation of an attribute declared in a namespace other than prov-dm is out of scope.

The PROV data model introduces a fixed set of attributes in the PROV-DM namespace:

  • The attribute prov:role denotes the function of an entity with respect to an activity, in the context of a usage, generation, activity association, start, end record. The value associated with a prov:role attribute must be conformant with Literal.

    The following start record describes the role of the agent identified by ag in this start relation with activity a.

       wasStartedBy(a,ag, [prov:role="program-operator"])
    
  • The attribute prov:type provides further typing information for the element or relation asserted in the record. The value associated with a prov:type attribute must be conformant with Literal.

    The following record declares an agent of type software agent

       agent(ag, [prov:type="prov:SoftwareAgent" %% xsd:QName])
    
  • The attribute prov:steps defines the level of precision associated with a derivation record. The value associated with a prov:steps attribute must be "1" or "n".

5.5.2 Identifier

An identifier is a qualified name. A qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part (see detailed rule in [RDF-SPARQL-QUERY], Section 4.1.1).

identifier ::= qualifiedName
eIdentifier ::= identifier (intended to denote an entity record)
aIdentifier ::= identifier (intended to denote an activity record)
agIdentifier ::= identifier (intended to denote an agent record)
gIdentifier::= identifier (intended to denote a generation record)
uIdentifier::= identifier (intended to denote a usage record)
nIdentifier::= identifier (intended to denote a note record)
accIdentifier::= identifier (intended to denote an account record)

5.5.3 Literal

A PROV-DM Literal represents a data value such as a particular string or number. A PROV-DM Literal represents a value whose interpretation is outside the scope of PROV-DM.

In PROV-ASN, a Literal's text matches the Literal production of the grammar defined in this specification document.

Literal  ::= typedLiteral | convenienceNotation
typedLiteral ::= quotedString %% datatype
datatype ::= qualifiedName
convenienceNotation  ::= stringLiteral | intLiteral
stringLiteral ::= quotedString
quotedString ::= a finite sequence of characters in which " (U+22) and \ (U+5C) occur only in pairs of the form \" (U+5C, U+22) and \\ (U+5C, U+5C), enclosed in a pair of " (U+22) characters
intLiteral ::= a finite-length sequence of decimal digits (#x30-#x39) with an optional leading negative sign (-)

The non terminals stringLiteral and intLiteral are syntactic sugar for quoted strings with datatype xsd:string and xsd:int, respectively.

In particular, a PROV-DM Literal may be an IRI-typed string (with datatype xsd:anyURI); such IRI has no specific interpretation in the context of PROV-DM.

The following examples respectively are the string "abc" (expressed using the convenience notation), the string "abc", the integer number 1, the integer number 1 (expressed using the convenience notation) and the IRI "http://example.org/foo".

  "abc"
  "abc" %% xsd:string
  "1" %% xsd:int
  1
  "http://example.org/foo" %% xsd:anyURI
The following example shows a literal of type xsd:QName (see QName [XMLSCHEMA-2]). The prefix ex must be bound to a namespace declared in the record container.
  "ex:value" %% xsd:QName
Should we define structural equivalence of literals as in OWL2? [OWL2-SYNTAX] (see section Literals).

5.5.4 Time

Time instants are defined according to xsd:dateTime [XMLSCHEMA-2].

It is optional to assert time in usage, generation, and activity records.

5.5.5 Asserter

An asserter is a creator of PROV-DM records. An asserter is denoted by an IRI. Such IRI has no specific interpretation in the context of PROV-DM.

asserter ::= IRI
IRI ::= an IRI compatible with production IRI in [IRI], enclosed in a pair of < (U+3C) and > (U+3E) characters
Currently, the non-terminal asserter is defined as IRI. We may want the asserter to be an agent instead, and therefore use PROV-DM to express the provenance of PROV-DM. We seek inputs on how to resolve this issue.

5.5.6 Namespace Declaration

A PROV-DM namespace is identified by an IRI reference [IRI]. In PROV-DM, attributes, identifiers, and literals of with datatype xsd:QName can be placed in a namespace using the mechanisms described in this specification.

A namespace declaration consists of a binding between a prefix and a namespace. Every qualified name with this prefix in the scope of this declaration refers to this namespace. A default namespace declaration consists of a namespace. Every unprefixed qualified name in the scope of this default namespace declaration refers to this namespace.

namespaceDeclarations ::= | defaultNamespaceDeclaration | namespaceDeclaration namespaceDeclaration
namespaceDeclaration ::= prefix prefix IRI
defaultNamespaceDeclaration ::= default IRI

5.5.8 Location

Location is an identifiable geographic place (ISO 19112). As such, there are numerous ways in which location can be expressed, such as by a coordinate, address, landmark, row, column, and so forth. This document does not specify how to concretely express locations, but instead provide a mechanism to introduce locations in assertions.

Location is an optional attribute of entity records and activity records. The value associated with a attribute location must be a Literal, expected to denote a location.

6. PROV-DM Common Relations

This section contains the normative specification of common relations of PROV-DM.

We have defined a set of common relation, in response to ISSUE-44. Is this set complete?
The types of these relations need to be made explicit.

The following figure summarizes the additional relations described in subsections 6.2 onwards.

common relations

6.1 Collections

The purpose of this section is to enable modelling of part-of relationships amongst entities. In particular, a form of collection entity type is introduced, consisting of a set of key-value pairs. Key-value pairs provide a generic indexing structure that can be used to model commonly used data structures, including associative lists (also known as "dictionaries" in some programming languages), relational tables, ordered lists, and more.
The relations introduced here are all specializations of the wasDerivedFrom relation, specifically precise-1 or imprecise-1. They are designed to model: A collection record is defined as follows.
collectionRecord ::= collectionDerivationRecord | keyDerivationRecord | entityMembershipRecord
collectionDerivationRecord ::= wasAddedTo_Coll ( identifier , identifier ) | wasRemovedFrom_Coll ( identifier , identifier )
keyDerivationRecord ::= wasAddedTo_Key ( identifier , identifier ) | wasRemovedFrom_Key ( identifier , identifier )
entityMembershipRecord ::= wasAddedTo_Entity ( identifier , identifier )

Record: wasAddedTo_Coll(c2,c1) (resp. wasRemovedFrom_Coll(c2,c1)) denotes that collection c2 is an updated version of collection c1, following an insertion (resp. deletion) operation.

Record: wasAddedTo_Key(c,k) (resp. wasRemovedFrom_Key(c,k)) denotes that collection c had a new value with key k added to (resp. removed from) it.

Record: wasAddedTo_Entity(c,e) denotes that collection c had entity e added to it.

Consider the following assertions:


  wasAddedTo_Coll(c2,c1)
  wasAddedTo_Key(c2,k1)
  wasAddedTo_Entity(c2,e1)

  wasAddedTo_Coll(c3,c2)
  wasAddedTo_Key(c3,k2)
  wasAddedTo_Entity(c3,e2)

  wasRemovedFrom_Coll(c4,c3)
  wasRemovedFrom_Key(c4,k1)

The corresponding graphical representation is shown below.

collections

With these assertions:

  • c2 is known to contain the key-value pair (k1,e1)
  • c3 is known to contain pairs (k1,e1) and (k2,e2).
  • c4 is known not to contain pair (k1,e1) and to contain pair (k2,e2).

6.2 Traceability Record

A traceability record states the existence of a "dependency path" between two entities, indicating that one entity can be shown to be in the lineage of another, and may have influenced it in some way. This relation is transitive.

A traceability record, written tracedTo(id,e2,e1,attrs) in PROV-ASN:

In PROV-ASN, a traceability record's text matches the traceabilityRecord production of the grammar defined in this specification document.

traceabilityRecord ::= tracedTo ( identifier , eIdentifier , eIdentifier optional-attribute-values )

A traceability record can be inferred from existing relations, or can be asserted stating that such a dependency path exists without the asserter knowing its individual steps, as expressed by the following constraints.

Given two identifiers e2 and e1 identifying entity records, the following statements hold:
  1. If wasDerivedFrom(e2,e1,a,g2,u1) holds, for some a, g2, u1, then tracedTo(e2,e1) also holds.
  2. If wasDerivedFrom(e2,e1) holds, then tracedTo(e2,e1) also holds.
  3. If wasGeneratedBy(e2,a,gAttr) and wasAssociatedWith(a,e1) hold, for some a and gAttr, then tracedTo(e2,e1) also holds.
  4. If wasGeneratedBy(e2,a,gAttr), wasAssociatedWith(a,e) and actedOnBehalfOf(e,e1) hold, for some a, e, and gAttr, then tracedTo(e2,e1) also holds.
  5. If wasGeneratedBy(e2,a,gAttr) and wasStartedBy(a,e1,sAttr) hold, for some a, e, and gAttr, and sAttr, then tracedTo(e2,e1) also holds.
  6. If tracedTo(e2,e) and tracedTo(e,e1) hold for some e, then tracedTo(e2,e1) also holds.
If the record tracedTo(r2,r1) holds for two identifiers r2 and r1 identifying entity records, then there exist e0, e1, ..., en for n≥1, with e0=r2 and en=r1, and for any i such that 0≤i≤n-1, at least of the following statements holds:
  • wasDerivedFrom(ei,ei+1,a,g2,u1) holds, for some a, g2, u1, or
  • wasDerivedFrom(ei,ei+1) holds, or
  • wasBasedOn(ei,ei+1) holds, or
  • wasGeneratedBy(ei,a,gAttr) and wasAssociatedWith(a,ei+1) hold, for some a and gAttr, or
  • wasGeneratedBy(ei,a,gAttr), wasAssociatedWith(a,e) and actedOnBehalfOf(e,ei+1) hold, for some a, e and gAttr, or
  • wasGeneratedBy(ei,a,gAttr) and wasStartedBy(a,ei+1,sAttr) hold, for some a, e, and gAttr, and sAttr.

We note that the previous constraint is not really an inference rule, since there is nothing that we can actually infer. Instead, this constraint should simply be seen as part of the definition of the traceability record.

6.3 Activity Ordering Record

PROV-DM allows dependencies amongst activities to be expressed. An information flow ordering record is a representation that an entity was generated by an activity, before it was used by another activity. A control ordering record is a representation that an activity was initiated by another activity.

In PROV-ASN, an activity ordering record's text matches the activityOrderingRecord production of the grammar defined in this specification document.

activityOrderingRecord ::= informationFlowOrderingRecord | controlOrderingRecord
informationFlowOrderingRecord  ::= wasInformedBy ( identifier , aIdentifier , aIdentifier optional-attribute-values )
controlOrderingRecord  ::= wasStartedBy ( identifier , aIdentifier , aIdentifier optional-attribute-values )

An information flow ordering record, written as wasInformedBy(id,a2,a1,attrs) in PROV-ASN, contains:

An information flow ordering record is formally defined as follows.

Given two activity records identified by a1 and a2, the record wasInformedBy(a2,a1) holds, if and only if there is an entity record identified by e and sets of attribute-value pairs attrs1 and attrs2, such that wasGeneratedBy(e,a1,attrs1) and used(a2,e,attrs2) hold.
Given two activity records denoted by a1 and a2, if the record wasInformedBy(a2,a1) holds, then the following temporal constraint holds: the start event of the activity record denoted by a1 precedes the end event of the activity record denoted by a2.

The relationship wasInformedBy is not transitive. Indeed, consider the following records.

wasInformedBy(a2,a1)
wasInformedBy(a3,a2)

We cannot infer wasInformedBy(a3,a1) from them. Indeed, from wasInformedBy(a2,a1), we know that there exists e1 such that e1 was generated by a1 and used by a2. Likewise, from wasInformedBy(a3,a2), we know that there exists e2 such that e2 was generated by a2 and used by a3. The following illustration shows a case where transitivity cannot hold. The horizontal axis represents time. We see that e1 was generated after e2 was used. Furthermore, the illustration also shows that a3 completes before a1. So it is impossible for a3 to have used an entity generated by a1.

non transitivity of wasInformedBy
The relation wasScheduledAfter was dropped, and replaced by a simplier relation wasStartedBy(a2,a1). It is intentional that the name wasStartedBy is overloaded.

A control ordering record, written as wasStartedBy(a2,a1) in PROV-ASN, contains:

Such a record states control ordering between a2 and a1, specified as follows.

Given two activity records identified by a1 and a2, the record wasStartedBy(a2,a1) holds if and only if there exist an entity record identified by e and some attributes gAttr and sAttr, such that wasGeneratedBy(e,a1,gAttr) and wasStartedBy(a2,e,sAttr) hold.

In the following assertions, we find two activity records, identified by a1 and a2, representing two activities, which took place on two separate hosts. The third record indicates that the latter activity was started by the former.

activity(a1,workflow,t1,t2,[ex:host="server1.example.org"])
activity(a2,sub-workflow,t3,t4,[ex:host="server2.example.org"])
wasStartedBy(a2,a1)

Alternatively, we could have asserted the existence of an entity, representing a request to create a sub-workflow. This request, issued by a1, triggered the start of a2.

entity(e,[prov:type="creation-request"])
wasGeneratedBy(e,a1)
wasStartedBy(a2,e)
Given two activity records denoted by a1 and a2, if the record wasStartedBy(a2,a1) holds, then the following temporal constraint holds: the start event of the activity record denoted by a1 precedes the start event of the activity record denoted by a2.
Suggested definition for process ordering. This is ISSUE-50.

6.4 Revision Record

A revision record is a representation of the creation of an entity considered to be a variant of another. Deciding whether something is made available as a revision of something else usually involves an agent who represents someone in the world who takes responsibility for approving that the former is a due variant of the latter.

A revision record, written wasRevisionOf(e2,e1,ag,attrs) in PROV-ASN, contains:

In PROV-ASN, a revision record's text matches the revisionRecord production of the grammar defined in this specification document.

revisionRecord ::= wasRevisionOf ( eIdentifier , eIdentifier , agIdentifier optional-attribute-values )

A revision record needs to satisfy the following constraint, linking the two entity records by a derivation, and stating them to be a complement of a third entity record.

Given two identifiers old and new identifying two entities, and an identifier ag identifying an agent, if a record wasRevisionOf(new,old,ag) is asserted, then there exists an entity record identifier e and attribute-values eAttrs, dAttrs, such that the following records hold:
  • wasDerivedFrom(new,old,dAttrs);
  • entity(e,eAttrs);
  • wasComplementOf(new,e);
  • wasComplementOf(old,e).
The derivation record may be imprecise-1 or imprecise-n.

wasRevisionOf is a strict sub-relation of wasDerivedFrom since two entities e2 and e1 may satisfy wasDerivedFrom(e2,e1) without being a variant of each other.

The following revision assertion

agent(ag,[prov:type="QualityController"])
entity(e1,[prov:type="document"])
entity(e2,[prov:type="document"])
wasRevisionOf(e2,e1,ag)

states that the document represented by entity record identified by e2 is a revision of document represented by entity record identified by e1; agent denoted by ag is responsible for this new versioning of the document.

Revision should be a class not a property. This is ISSUE-48.

6.5 Attribution Record

An attribution record represents that an entity is ascribed to an agent and is compliant with the attributionRecord production.

An attribution record, written wasAttributedTo(e,ag,attr), contains the following components:

Attribution models the notion of an activity generating an entity identified by e being controlled by an agent ag, which takes responsibility for generating e. Formally, this is expressed as the following necessary condition.

In PROV-ASN, an attribution record's text matches the attributionRecord production of the grammar.

attributionRecord ::= wasAttributedTo ( eIdentifier , agIdentifier optional-attribute-values )
If wasAttributedTo(e,ag) holds for some identifiers e and ag, then there exists an activity identified by pe such that the following statements hold:
activity(pe,recipe,t1,t2,attr1)
wasGenerateBy(e,pe)
wasAssociatedWith(pe,ag,attr2)
for some sets of attribute-value pairs attr1 and attr2, time t1, and t2.

6.6 Quotation Record

A quotation record is a representation of the repeating or copying of some part of an entity, compatible with the quotationRecord production.

A quotation record, written wasQuotedFrom(e2,e1,ag2,ag1,attrs), contains:

In PROV-ASN, a quotation record's text matches the quotationRecord production of the grammar.

quotationRecord ::= wasQuotedFrom ( eIdentifier , eIdentifier , agIdentifier , agIdentifier optional-attribute-values )
If wasQuotedFrom(e2,e1,ag2,ag1) holds for some identifiers e2, e1, ag2, ag1, then the following records hold:
wasDerivedFrom(e2,e1)
wasAttributedTo(e2,ag2)
wasAttributedTo(e1,ag1)

6.7 Summary Record

A summary record represents that an entity is a synopsis or abbreviation of another entity. A summary record is compliant with the summaryRecord production.

An assertion wasSummaryOf, written wasSummaryOf(e2,e1,attrs), contains:

In PROV-ASN, a summary record's text matches the summaryRecord production of the grammar.

summaryRecord ::= wasSummaryOf ( eIdentifier , eIdentifier optional-attribute-values )

wasSummaryOf is a strict sub-relation of wasDerivedFrom.

6.8 Original Source Record

An original source record represents an entity in which another entity first appeared. A original-source record is compliant with the originalSourceRecord production.

An assertion hadOriginalSource, written hadOriginalSource(e2,e1,attrs), contains:

hasOriginalSource is a strict sub-relation of wasDerivedFrom.

In PROV-ASN, an original source record's text matches the originalSourceRecord production of the grammar.

originalSourceRecord ::= hadOriginalSource ( eIdentifier , eIdentifier optional-attribute-values )

7. PROV-DM Extensibility Points

The PROV data model provides several extensibility points that allow designers to specialize it to specific applications or domains. We summarize these extensibility points here:

The PROV data model is designed to be application and technology independent, but specializations of PROV-DM are welcome and encouraged. To ensure inter-operability, specializations of the PROV data model that exploit the extensibility points summarized in this section must preserve the semantics specified in this document. For instance, a qualified attribute on a domain specific entity record must represent an aspect of an entity and this aspect must remain unchanged during the characterization's interval of this entity record.

8. Resources, URIs, Entities, Identifiers, and Scope

This specification introduces the notion of an identifiable entity in the world. In PROV-DM, an entity record is a representation of such an identifiable entity. An entity record includes an identifier identifying this entity. Identifiers are qualified names, which can be mapped to IRIs.

The term 'resource' is used in a general sense for whatever might be identified by a URI [RFC3986]. On the Web, a URI denotes a resource, without any expectation that the resource is accessed.

The purpose of this section is to clarify the relationship between resource and the notions of entity and entity record.

In the context of PROV-DM, a resource is just a thing in the world. One may take multiple perspectives on such a thing and its situation in the world, fixing some its aspects.

We refer to the example of section 2.1 for a resource (at some URL) and three different perspectives, referred to as entities. Three different entity records can be expressed for this report, which in the PROV-ASN sample below, are expressed within a same account.

container
prefix app urn:example:
prefix cr  http://example.org/crime/

   account(acc1,
           http://example.org/asserter1,

           entity(app:0, [ prov:type="Document", cr:path="http://example.org/crime.txt" ])
           entity(app:1, [ prov:type="Document", cr:path="http://example.org/crime.txt", cr:version="2.1", cr:content="...", cr:date="2011-10-07" ])
           entity(app:2, [ prov:type="Document", cr:author="John" ])
        ...)
endContainer

Each entity record contains an idenfier that identifies the entity it represents. In this example, three identifiers were minted, and their prefix uses the URN syntax with "example" namespace.

Given that the report is a resource denoted by the URI http://example.org/crime.txt, we could simply use this URI as the identifier of an entity. This would avoid us minting new URIs. Hence, the report URI would play a double role: as a URI it denotes a resource accessible at that URI, and as a PROV-DM identifier, it identifies a specific characterization of this report. A given identifier identifies a single entity record within the scope of an account. Hence, below, all entities records have been given the same identifier but appear in the scope of different accounts.

container 
prefix app http://example.org/
prefix cr  http://example.org/crime/

   account(acc2,
           http://example.org/asserter1,

           entity(app:crime.txt, [ prov:type="Document", cr:path="http://example.org/crime.txt" ])
           ...)

   account(acc3,
           http://example.org/asserter1,

           entity(app:crime.txt, [ prov:type="Document", cr:path="http://example.org/crime.txt", cr:version="2.1", cr:content="...", cr:date="2011-10-07" ])
           ...)

   account(acc4,
           http://example.org/asserter1,
           entity(app:crime.txt, [ prov:type="Document", cr:author="John" ])
           ...)
endContainer

In this case, the qualified name app:crime.txt maps to URI http://example.org/crime.txt still denotes the same resource; however, the perspective we take about that resource is expressed as a different entity record, happening to have the same identifier in different accounts.

Alternatively, if we need to assert the existence of two different perspectives on the report within the same account, then alternate identifiers must be used, one of them being allowed to be the resource URI.

container 
 prefix app  http://example.org/
 prefix app2 urn:example:
 prefix cr   http://example.org/crime/

   account(acc5,
           http://example.org/asserter1,

           entity(app:crime.txt, [ prov:type="Document", cr:path="http://example.org/crime.txt" ])
           entity(app2:1, [ prov:type="Document", cr:path="http://example.org/crime.txt", cr:version="2.1", cr:content="...", cr:date="2011-10-07" ])

           ...)
endContainer

A. Changes Since First Public Working Draft

B. Acknowledgements

WG membership to be listed here.

C. References

C.1 Normative references

[IRI]
M. Duerst, M. Suignard. Internationalized Resource Identifiers (IRI). January 2005. Internet RFC 3987. URL: http://www.ietf.org/rfc/rfc3987.txt
[OWL2-SYNTAX]
Boris Motik; Peter F. Patel-Schneider; Bijan Parsia. OWL 2 Web Ontology Language:Structural Specification and Functional-Style Syntax. 27 October 2009. W3C Recommendation. URL: http://www.w3.org/TR/2009/REC-owl2-syntax-20091027/
[RDF-SPARQL-QUERY]
Andy Seaborne; Eric Prud'hommeaux. SPARQL Query Language for RDF. 15 January 2008. W3C Recommendation. URL: http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Internet RFC 2119. URL: http://www.ietf.org/rfc/rfc2119.txt
[RFC3986]
T. Berners-Lee; R. Fielding; L. Masinter. Uniform Resource Identifier (URI): Generic Syntax. January 2005. Internet RFC 3986. URL: http://www.ietf.org/rfc/rfc3986.txt
[XML-NAMES]
Richard Tobin; et al. Namespaces in XML 1.0 (Third Edition). 8 December 2009. W3C Recommendation. URL: http://www.w3.org/TR/2009/REC-xml-names-20091208/
[XMLSCHEMA-2]
Paul V. Biron; Ashok Malhotra. XML Schema Part 2: Datatypes Second Edition. 28 October 2004. W3C Recommendation. URL: http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

C.2 Informative references

[CLOCK]
Lamport, L. Time, clocks, and the ordering of events in a distributed system.Communications of the ACM 21 (7): 558–565. 1978 URL: http://research.microsoft.com/users/lamport/pubs/time-clocks.pdf DOI: doi:10.1145/359545.359563.
[CSP]
Hoare, C. A. R. Communicating Sequential Processes.Prentice-Hall. 1985URL: http://www.usingcsp.com/cspbook.pdf
[FOAF]
Dan Brickley, Libby Miller. FOAF Vocabulary Specification 0.98. 9 August 2010. URL: http://xmlns.com/foaf/spec/
[Logic]
W. E. JohnsonLogic: Part III.1924. URL: http://www.ditext.com/johnson/intro-3.html
[PROV-O]
Satya Sahoo and Deborah McGuinness Provenance Formal Model. 2011, Work in progress. URL: http://www.w3.org/TR/prov-o/
[PROV-PAQ]
Graham Klyne and Paul Groth Provenance Access and Query. 2011, Work in progress. URL: http://dvcs.w3.org/hg/prov/tip/paq/prov-aq.html
[PROV-PRIMER]
Yolanda Gil and Simon Miles Prov Model Primer. 2011, Work in progress. URL: http://dvcs.w3.org/hg/prov/raw-file/default/primer/Primer.html
[PROV-SEMANTICS]
James Cheney Formal Semantics Strawman. 2011, Work in progress. URL: http://www.w3.org/2011/prov/wiki/FormalSemanticsStrawman