Copyright © 2011 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
PROV-DM is a data model for provenance for building representations of the entities, people and activities involved in producing a piece of data or thing in the world. PROV-DM is domain-agnotisc, but with well-defined extensibility points allowing further domain-specific and application-specific extensions to be defined. It is accompanied by PROV-ASN, a technology-independent abstract syntax notation, which allows serializations of PROV-DM instances to be created for human consumption, which facilitates its mapping to concrete syntax, and which is used as the basis for a formal semantics.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is part of a set of specifications aiming to define the various aspects that are necessary to achieve the vision of inter-operable interchange of provenance information in heterogeneous environments such as the Web. This document defines the PROV-DM data model for provenance, accompanied with a notation to express instances of that data model for human consumption. Three other documents are: 1) a normative serialization of PROV-DM in RDF, specified by means of a mapping to the OWL2 Web Ontology Language; 2) the mechanisms for accessing and querying provenance; 3) a primer for the provenance data model.This document was published by the Provenance Working Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-prov-wg@w3.org (subscribe, archives). All feedback is welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
For the purpose of this specification, provenance is defined as a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world. In particular, the provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable: provenance can help those users to make trust judgments.
The idea that a single way of representing and collecting provenance could be adopted internally by all systems does not seem to be realistic today. Instead, a pragmatic approach is to consider a core data model for provenance that allows domain and application specific representations of provenance to be translated into such a data model and exchanged between systems. Heterogeneous systems can then export their provenance into such a core data model, and applications that need to make sense of provenance in heterogeneous systems can then import it, process it, and reason over it.
Thus, the vision is that different provenance-aware systems natively adopt their own model for representing their provenance, but a core provenance data model can be readily adopted as a provenance interchange model across such systems.
A set of specifications define the various aspects that are necessary to achieve this vision in an inter-operable way, the first of which is contained in this document:
The PROV-DM data model for provenance consists of a set of core concepts, and a few common relations, based on these core concepts. PROV-DM is a domain-agnotisc model, but with well-defined extensibility points allowing further domain-specific and application-specific extensions to be defined.
This specification also introduces PROV-ASN, an abstract syntax that is primarily aimed at human consumption. PROV-ASN allows serializations of PROV-DM instances to be written in a technology independent manner, it facilitates its mapping to concrete syntax, and it is used as the basis for a formal semantics. This specification uses instances of provenance written in PROV-ASN to illustrate the data model.
In section 2, a set of preliminaries are introduced, including concepts that underpin PROV-DM and motivations for the PROV-ASN notation.
Section 3 provides an overview of PROV-DM listing its core types and their relations.
In section 4, PROV-DM is applied to a short scenario, encoded in PROV-ASN, and illustrated graphically.
Section 5 provides the normative definition of PROV-DM and the notation PROV-ASN.
Section 6 introduces common relations used in PROV-DM, including relations for data collections and common domain-independent common relations.
Section 7 summarizes PROV-DM extensibility points.
Section 8 discusses how PROV-DM can be applied to the notion of resource.
The PROV-DM namespace is http://www.w3.org/ns/prov-dm/ (TBC).
All the elements, relations, reserved names and attributes introduced in this specification belong to the PROV-DM namespace.
The key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" in this document are to be interpreted as described in [RFC2119].
This specification is based on a conceptualization of the world that is described in this section. In the world (whether real or not), there are things, which can be physical, digital, conceptual, or otherwise, and activities involving things.
When we talk about things in the world in natural language and even when we assign identifiers, we are often imprecise in ways that make it difficult to clearly and unambiguously report provenance: a resource with a URL may be understood as referring to a report available at that URL, the version of the report available there today, the report independent of where it is hosted over time, etc.
Hence, to accommodate different perspectives on things and their situation in the world as perceived by us, we introduce the idea of a characterized thing, which refers to a thing and its situation in the world, as characterized by someone. We then define an entity as an identifiable characterized thing. An entity fixes some aspects of a thing and its situation in the world, so that it becomes possible to express its provenance, and what causes these specific aspects to be as such. An alternative entity may fix other aspects, and its provenance may be different.
We do not assume that any characterization is more important than any other, and in fact, it is possible to describe the processing that occurred for the report to be commissioned, for individual versions to be created, for those versions to be published at the given URL, etc., each via a different entity that characterizes the report appropriately.
In the world, activities involve entities in multiple ways: they consume them, they process them, they transform them, they modify them, they change them, they relocate them, they use them, they generate them, they are controlled by them, etc.
An agent is a type of entity that takes an active role in an activity such that it can be assigned some degree of responsibility for the activity taking place. This definition intentionally stays away from using concepts such as enabling, causing, initiating, affecting, etc, because any entities also enable, cause, initiate, and affect in some way the activities. So the notion of having some degree of responsibility is really what makes an agent.
Even software agents can be assigned some responsibility for the effects they have in the world, so for example if one is using a Text Editor and one's laptop crashes, then one would say that the Text Editor was responsible for crashing the laptop. If one invokes a service to buy a book, that service can be considered responsible for drawing funds from one's bank to make the purchase (the company that runs the service and the web site would also be responsible, but the point here is that we assign some measure of responsibility to software as well). So when someone models software as an agent for an activity in our model, they mean the agent has some responsibility for that activity.
In this specification, the qualifier 'identifiable' is implicit whenever a reference is made to an activity, agent, or an entity.
Time is critical in the context of provenance, since it can help corroborate provenance claims. For instance, if an entity is claimed to be obtained by transforming another, then the latter must have existed before the former. If it is not the case, then there is something wrong in such a provenance claim.
Although time is critical, we should also recognize that provenance can be used in many different contexts: in a single system, across the Web, or in spatial data management, to name a few. Hence, it is a design objective of PROV-DM to minimize the assumptions about time, so that PROV-DM can be used in varied contexts.
Furthermore, consider two activities that started at the same time instant. Just by referring to that instant, we cannot distinguish which activity start we refer to. This is particularly relevant if we try to explain that the start of these activities had different reasons. We need to be able to refer to the start of an activity as a first class concept, so that we can talk about it and about its relation with respect to other similar starts.
Hence, in our conceptualization of the world, an instantaneous event, or event for short, happens in the world and marks a change in the world, in its activities and in its entities. The term "event" is commonly used in process algebra with a similar meaning. For instance, in CSP [CSP], events represent communications or interactions; they are assumed to be atomic and instantaneous.
Four kinds of events underpin the PROV-DM data model. The activity start and activity end events demarcate the beginning and the end of activities, respectively. The entity generation and entity usage events demarcate the characterization interval for entities. More specifically:
An entity generation event is the event that marks the final instant of an entity's creation timespan, after which it becomes available for use.
An entity usage event is the event that marks the first instant of an entity's consumption timespan by an activity.
An activity start event is the event that marks the instant an activity starts.
An activity end event is the event that marks the instant an activity ends.
To allow for minimalistic clock assumptions, like Lamport [CLOCK], PROV-DM relies on a notion of relative ordering of events, without using physical clocks. This specification assumes that a partial order exists between events.
Specifically, follows is a partial order between events, indicating that an event occurs after another. For symmetry, precedes is defined as the inverse of follows.
How such partial order is realized in practice is beyond the scope of this specification. This specification only assumes that each event can be mapped to an instant in some form of timeline. The actual mapping is not in scope of this specification. Likewise, whether this timeline is formed of a single global timeline or whether it consists of multiple Lamport's style clocks is also beyond this specification. It is anticipated that follows and precedes correspond to some ordering over this timeline.
This specification introduces a set of "temporal interpretation" rules allowing to derive event ordering constraints from provenance records. According to such temporal interpretation, provenance records must satisfy such constraints. We note that the actual verification of such temporal constraints is also outside the scope of this specification.
PROV-DM also allows for time observations to be inserted in specific provenance records, for each recognized event introduced in this specification. The presence of a time observation for a given event fixes the mapping of this event to the timeline. It can also help with the verification of associated temporal constraints (though, again, this verification is outside the scope of this specfication).
This specification defines PROV-DM, a data model for provenance, consisting of records describing how people, entities, and activities, were involved in producing, influencing, or delivering a piece of data or a thing in the world.
This specification also relies on a language, PROV-ASN, the Provenance Abstract Syntax Notation, to express instances of that data model. For each construct of PROV-DM, a corresponding ASN expression is introduced, by way of a production in the ASN grammar.
PROV-ASN is an abstract syntax, whose goals are:
This specification provides a grammar for PROV-ASN. Each record of the PROV-DM data model is explained in terms of the production of this grammar.
The formal semantics of PROV-DM is defined at [PROV-SEMANTICS] and its encoding in the OWL2 Web Ontology Language at [PROV-O].
PROV-DM is a provenance data model designed to express representations of the world.
These representations are relative to an asserter, and in that sense constitute assertions stating properties of the world, as represented by an asserter. Different asserters will normally contribute different representations. This specification does not define a notion of consistency between different sets of assertions (whether by the same asserter or different asserters). The data model provides the means to associate attribution to assertions.
The data model is designed to capture activities that happened in the past, as opposed to activities that may or will happen. However, this distinction is not formally enforced. Therefore, all PROV-DM assertions should be interpreted as a record of what has happened, as opposed to what may or will happen.
This specification does not prescribe the means by which an asserter arrives at assertions; for example, assertions can be composed on the basis of observations, reasoning, or any other means.
Sometimes, inferences about the world can be made from representations conformant to the PROV-DM data model. When this is the case, this specification defines such inferences, allowing new provenance records to be inferred from existing ones. Hence, representations of the world can result either from direct assertions by asserters or from application of inferences defined by this specification.
This specification includes a grammar for PROV-ASN expressed using the Extended Backus-Naur Form (EBNF) notation.
Each rule in the grammar defines one symbol, in the form:
E ::= expression
Within the expression on the right-hand side ofa rule, the follwoing expressions are used to match strings of one or more characters:The following ER diagram provides a high level overview of the structure of PROV-DM records. Examples of provenance assertions that conform to this schema are provided in the next section.
The model includes the following elements:
A set of attribute-value pairs can be associated to elements and relations of the PROV model in order to further characterize their nature. The wasComplementOf relationship is used to denote that two entities complement each other, in the sense that they each represent a partial, but mutually compatible characterization of the same thing. The attributes role and type are pre-defined.
The set of relations presented here forms a core, which is further extended with additional relations, defined in Section Common Relations.
The model includes a further additional element: notes. These are also structured as sets of attribute-value pairs. Notes are used to provide additional, "free-form" information regarding any identifiable construct of the model, with no prescribed meaning. Notes are described in detail here.
Attributes and notes are the main extensibility points in the model: individual interest groups are expected to extend PROV-DM by introducing new attributes and notes as needed to address applications-specific provenance modelling requirements.
This section is non-normative.
This scenario is concerned with the evolution of a crime statistics file (referred to as e0) stored on a shared file system and which journalists Alice, Bob, Charles, David, and Edith can share and edit. We consider various events in the evolution of file e0; events listed below follow each other, unless otherwise specified.
Event evt1: Alice creates (a0) an empty file in /share/crime.txt. We denote this file e1.
Event evt2: Bob appends (a1) the following line to /share/crime.txt:
There was a lot of crime in London last month.
We denote the revised file e2.
Event evt3: Charles emails (a2) the contents of /share/crime.txt, as an attachment, which we refer to as e4. (We specifically refer to a copy of the file that is uploaded on the mail server.)
Event evt4: David edits (a3) file /share/crime.txt as follows.
There was a lot of crime in London and New-York last month.
We denote the revised file e3.
Event evt5: Edith emails (a4) the contents of /share/crime.txt as an attachment, referred to as e5.
Event evt6: between events evt4 and evt5, someone (unspecified) runs a spell checker (a5) on the file /share/crime.txt. The file after spell checking is referred to as e6.
Entity Records (described in Section Entity). The file in its various forms and its copies are modelled as entity records, corresponding to multiple characterizations, as per scenario. The entity records are identified by e0, ..., e6.
entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ]) entity(e1, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="" ]) entity(e2, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="There was a lot of crime in London last month."]) entity(e3, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="There was a lot of crime in London and New York last month."]) entity(e4) entity(e5) entity(e6, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="There was a lot of crime in London and New York last month.", ex:spellchecked="yes"])
These entity records list attributes that have been given values during intervals delimited by events; such intervals are referred to as characterization intervals. The following table lists all entity identifiers and their corresponding characterization intervals. When the end of the characterization interval is not delimited by an event described in this scenario, it is marked by "...".
Entity Characterization Interval e0 evt1 - ... e1 evt1 - evt2 e2 evt2 - evt4 e3 evt4 - ... e4 evt3 - ... e5 evt5 - ... e6 evt6 - ...
Activity Records (described in Section Activity) represent activities in the scenario.
activity(a0, create-file, 2011-11-16T16:00:00,) activity(a1, add-crime-in-london, 2011-11-16T16:05:00,) activity(a2, email, 2011-11-16T17:00:00,) activity(a3, edit-London-New-York, 2011-11-17T09:00:00,) activity(a4, email, 2011-11-17T09:30:00,) activity(a5, spellcheck,,)
Generation Records (described in Section Generation) represent the event at which a file is created in a specific form. Attributes are used to describe the modalities according to which a given entity is generated by a given activity. The interpretation of attributes is application specific. Illustrations of such attributes for the scenario are: no attribute is provided for e0; e2 was generated by the editor's save function; e4 can be found on the smtp port, in the attachment section of the mail message; e6 was produced on the standard output of a5. Two identifiers g1 and g2 identify the generation records referenced in derivations introduced below.
wasGeneratedBy(e0, a0) wasGeneratedBy(e1, a0, [ex:fct="create"]) wasGeneratedBy(e2, a1, [ex:fct="save"]) wasGeneratedBy(e3, a3, [ex:fct="save"]) wasGeneratedBy(g1, e4, a2, [ex:port="smtp", ex:section="attachment"]) wasGeneratedBy(g2, e5, a4, [ex:port="smtp", ex:section="attachment"]) wasGeneratedBy(e6, a5, [ex:file="stdout"])
Usage Records (described in Section Usage) represent the event by which a file is read by an activity. Likewise, attributes describe the modalities according to which the various entities are used by activities. Illustrations of such attributes are: e1 is used in the context of a1's load functionality; e2 is used by a2 in the context of its attach functionality; e3 is used on the standard input by a5. Two identifiers u1 and u2 identify the Usage records referenced in derivations introduced below.
used(a1,e1,[ex:fct="load"]) used(a3,e2,[ex:fct="load"]) used(u1,a2,e2,[ex:fct="attach"]) used(u2,a4,e3,[ex:fct="attach"]) used(a5,e3,[ex:file="stdin"])
Derivation Records (described in Section Derivation Relation) express that an entity is derived from another. The first two are expressed in their compact version, whereas the following two are expressed in their full version, including the activity underpinning the derivation, and associated usage (u1, u2) and generation (g1, g2) records.
wasDerivedFrom(e2,e1) wasDerivedFrom(e3,e2) wasDerivedFrom(e4,e2,a2,g1,u1) wasDerivedFrom(e5,e3,a4,g2,u2)
wasComplementOf: (this relation is described in Section wasComplementOf). The crime statistics file (e0) has various contents over its existence (e1, e2, e3); the entity records identified by e1, e2, e3 complement e0 with an attribute content. Likewise, the one denoted by e6 complements the record denoted by e3 with an attribute spellchecked.
wasComplementOf(e1,e0) wasComplementOf(e2,e0) wasComplementOf(e3,e0) wasComplementOf(e6,e3)
Agent Records (described at Section Agent): the various users are represented as agents, themselves being a type of entity.
agent(ag1, [ prov:type="prov:Person" %% xsd:QName, ex:name="Alice" ]) agent(ag2, [ prov:type="prov:Person" %% xsd:QName, ex:name="Bob" ]) agent(ag3, [ prov:type="prov:Person" %% xsd:QName, ex:name="Charles" ]) agent(ag4, [ prov:type="prov:Person" %% xsd:QName, ex:name="David" ]) agent(ag5, [ prov:type="prov:Person" %% xsd:QName, ex:name="Edith" ])
Activity Assocation Records (described in Section Activity Association): the association of an agent with an activity is expressed with , and the nature of this association is described by attributes. Illustrations of such attributes include the role of the participating agent, as creator, author and communicator (role is a reserved attribute in PROV-DM).
wasAssociatedWith(a0, ag1, [prov:role="creator"]) wasAssociatedWith(a1, ag2, [prov:role="author"]) wasAssociatedWith(a2, ag3, [prov:role="communicator"]) wasAssociatedWith(a3, ag4, [prov:role="author"]) wasAssociatedWith(a4, ag5, [prov:role="communicator"])
Provenance assertions can be illustrated graphically. The illustration is not intended to represent all the details of the model, but it is intended to show the essence of a set of provenance assertions. Therefore, it cannot be seen as an alternate notation for expressing provenance.
The graphical illustration takes the form of a graph. Entities, activities and agents are represented as nodes, with oval, rectangular, and half-hexagonal shapes, respectively. Usage, Generation, Derivation, Activity Association, and Complementarity are represented as directed edges.
Entities are layed out according to the ordering of their generation event. We endeavor to show time progressing from left to right. This means that edges for Usage, Generation and Derivation typically point from right to left.
This section contains the normative specification of PROV-DM core, the core of the PROV data model.
PROV-DM consists of a set of constructs, referred to as records, to formulate representations of the world and constraints that must be satisfied by them.
Furthermore, PROV-DM includes a "house-keeping construct", a record container, used to wrap PROV-DM records and facilitate their interchange.
In PROV-ASN, such representations of the world must be conformant with the toplevel production record of the grammar. These records are grouped in three categories: elementRecord (see section Element), relationRecord (see section Relation), and accountRecord (see section Account).
In PROV-ASN, a record container is compliant with the production recordContainer (see section Record Container).
This section describes all the PROV-DM records referred to as element records. (They are conformant to the elementRecord production of the grammar.)
In PROV-DM, an entity record is a representation of an entity.
Examples of entities include a linked data set, a sparse-matrix matrix of floating-point numbers, a document in a directory, the same document published on the Web, and meta-data embedded in a document.
An entity record, noted entity(id, [ attr1=val1, ...]) in PROV-ASN, contains:
The assertion of an entity record, entity(id, [ attr1=val1, ...]), states, from a given asserter's viewpoint, the existence of an entity, whose situation in the world is represented by the attribute-value pairs, which remain unchanged during a characterization interval, i.e. a continuous interval between two events in the world.
In PROV-ASN, an entity record's text matches the entityRecord production of the grammar defined in this specification document.
The following entity record,
entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ])states the existence of an entity, denoted by identifier e0, with type File and path /shared/crime.txt in the file system, and creator alice The attributes path and creator are application specific, whereas the attribute type is reserved in the PROV-DM namespace.
In PROV-DM, an activity record is a representation of an identifiable activity, which performs a piece of work.
An activity, represented by an activity record, is delimited by its start and its end events; hence, it occurs over an interval delimited by two events. However, an activity record need not mention time information, nor duration, because they may not be known.
Such start and end times constitute attributes of an activity, where the interpretation of attribute in the context of an activity record is the same as the interpretation of attribute for entity record: an activity record's attribute remains constant for the duration of the activity it represents. Further characteristics of the activity in the world can be represented by other attribute-value pairs, which must also remain unchanged during the activity duration.
Examples of activities include assembling a data set based on a set of measurements, performing a statistical analysis over a data set, sorting news items according to some criteria, running a sparql query over a triple store, editing a file, and publishing a web page.
An activity record, written activity(id, rl, st, et, [ attr1=val1, ...]) in PROV-ASN, contains:
In PROV-ASN, an activity record's text matches the activityRecord production of the grammar defined in this specification document.
The following activity assertion
activity(a1,add-crime-in-london,2011-11-16T16:05:00,2011-11-16T16:06:00,[ex:host="server.example.org",prov:type="ex:edit" %% xsd:QName])
identified by identifier a1, states the existence of an activity with recipe link add-crime-in-london, start time 2011-11-16T16:05:00, and end time 2011-11-16T16:06:00, running on host server.example.org, and of type edit (declared in some namespace with prefix ex). The attribute host is application specific, but must hold for the duration of activity. The attribute type is a reserved attribute of PROV-DM, allowing for subtyping to be expressed.
The mere existence of an activity assertion entails some event ordering in the world, since an activity start event always precedes the corresponding activity end event. This is expressed by constraint start-precedes-end.
An activity record is not an entity record. Indeed, an entity record represents an entity that exists in full at any point in its characterization interval, persists during this interval, and preserves the characteristics that makes it identifiable. Alternatively, an activity in something that happens, unfolds or develops through time, but is typically not identifiable by the characteristics it exhibits at any point during its duration. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [Logic].
An agent record is a representation of an agent, which is an entity that can be assigned some degree of responsibility for an activity taking place.
Many agents can have an association with a given activity. An agent may do the ordering of the activity, another agent may do its design, another agent may push the button to start it, another agent may run it, etc. As many agents as one wishes to mention can occur in the provenance record, if it is important to indicate that they were associated with the activity.
From an inter-operability perspective, it is useful to define some basic categories of agents since it will improve the use of provenance records by applications. There should be very few of these basic categories to keep the model simple and accessible. There are three types of agents in the model:
These types are mutually exclusive, though they do not cover all kinds of agent.
An agent record, noted agent(id, [ attr1=val1, ...]) in PROV-ASN, contains:
In PROV-ASN, an agent record's text matches the agentRecord production of the grammar defined in this specification document.
With the following assertions,
agent(e1, [ex:employee="1234", ex:name="Alice", prov:type="prov:Person" %% xsd:QName]) entity(e2) and wasStartedBy(a1,e2,[prov:role="author"]) entity(e3) and wasAssociatedWith(a1,e3,[prov:role="sponsor"])
the agent record identified by e1 is an explicit agent assertion that holds irrespective of activities it may be associated with. On the other hand, from the entity records identified by e2 and e3, one can infer agent records, as per the following inference.
One can assert an agent record or alternatively, one can infer an agent record by its association with an activity.
As provenance records are exchanged between systems, it may be useful to add extra-information about such records. For instance, a "trust service" may add value-judgements about the trustworthiness of some of the assertions made. Likewise, an interactive visualization component may want to enrich a set of provenance records with information helping reproduce their visual representation. To help with inter-operability, PROV-DM introduces a simple annotation mechanism allowing any identifiable record to be associated with notes.
An note record is a set of attribute-value pairs, whose meaning is application specific. It may or may not be a representation of something in the world.
In PROV-ASN, a note record's text matches the noteRecord production of the grammar defined in this specification document.
A separate PROV-DM record is used to associate a note with an identifiable record (see Section on annotation). A given note may be associated with multiple records.
The following note record
note(ann1,[ex:color="blue", ex:screenX=20, ex:screenY=30])
consists of a set of application-specific attribute-value pairs, intended to help the rendering of the record it is associated with, by specifying its color and its position on the screen. In this example, these attribute-value pairs do not constitute a representation of something in the world; they are just used to help render provenance.
Attribute-value pairs occurring in notes differ from attribute-value pairs occurring in entity records and activity records. In entity and activity records, attribute-value pairs must be a representation of something in the world, which remain constant for the duration of the characterization interval (for entity record) or the activity duration (for activity records). In note records, it is optional for attribute-value pairs to be representations of something in the world. If they are a representation of something in the world, then it may change value for the corresponding duration. If attribute-value pairs of a note record are a representation of something in the world that does not change, they are not regarded as determining characteristics of an entity or activity, for the purpose of provenance.
This section describes all the PROV-DM records representing relations between the elements introduced in Section Element. While these relations are not binary, they all involve two primary elements. They can be summarized as follows.
Entity | Activity | Agent | Note | |
Entity | wasDerivedFrom wasComplementOf | wasGeneratedBy | - | hasAnnotation |
Activity | used | - | wasStartedBy wasEndedBy wasAssociatedWith | hasAnnotation |
Agent | - | - | actedOnBehalfOf | hasAnnotation |
Note | - | - | - | hasAnnotation |
In PROV-ASN, all these relation records are conformant to the relationRecord production of the grammar.
In PROV-DM, a generation record is a representation of a world event, the creation of a new entity by an activity. This entity did not exist before creation. The representation of this event encompasses a description of the modalities of generation of this entity by this activity.
A generation event may be, for example, the creation of a file by a program, the creation of a linked data set, the production of a new version of a document, and the sending of a value on a communication channel.
A generation record, written wasGeneratedBy(id,e,a,attrs,t) in PROV-ASN, has the following components:
In PROV-ASN, a generation record's text matches the generationRecord production of the grammar defined in this specification document.
A generation record's id is optional. It must be used when annotating generation records (see Section Annotation Record) or when defining precise-1 derivations (see Derivation Record).
The following generation assertions
wasGeneratedBy(e1,a1, 2001-10-26T21:32:52, [ex:port="p1", ex:order=1]) wasGeneratedBy(e2,a1, 2001-10-26T10:00:00, [ex:port="p1", ex:order=2])
state the existence of two events in the world (with respective times 2001-10-26T21:32:52 and 2001-10-26T10:00:00), at which new entities, represented by entity records identified by e1 and e2, are created by an activity, itself represented by an activity record identified by a1. The first one is available as the first value on port p1, whereas the other is the second value on port p1. The semantics of port and order in these records are application specific.
The assertion of a generation record implies ordering of events in the world.
A given entity record can be referred to in a single generation record in the scope of a given account. The rationale for this constraint is as follows. If two activities sequentially set different values to some attribute by means of two different generation events, then they generate distinct entities. Alternatively, for two activities to generate an entity simultaneously, they would require some synchronization by which they agree the entity is released for use; the end of this synchronization would constitute the actual generation of the entity, but is performed by a single activity. This unicity constraint is formalized as follows.
In PROV-DM, a usage record is a representation of a world event: the consumption of an entity by an activity. The representation includes a description of the modalities of usage of this entity by this activity.
A usage event may be the consumption of a parameter by a procedure, the reading of a value on a port by a service, the reading of a configuration file by a program, or the adding of an ingredient, such as eggs, in a baking activity. Usage may entirely consume an entity (e.g. eggs are not longer available after being added to the mix), or leave it as such, ready for further uses (e.g. a file on a file system can be read indefinitely).
A usage record, written used(id,a,e,attrs,t) in PROV-ASN, has the following constituent:
In PROV-ASN, a usage record's text matches the usageRecord production of the grammar defined in this specification document.
A usage record's id is optional, but comes handy when annotating usage records (see Section Annotation Record) or when defining derivations.
The following usage records
used(a1,e1,2011-11-16T16:00:00,[ex:parameter="p1"]) used(a1,e2,2011-11-16T16:00:01,[ex:parameter="p2"])
state that the activity, represented by the activity record identified by a1, consumed two entities, represented by entity records identified by e1 and e2, at times 2011-11-16T16:00:00 and 2011-11-16T16:00:01, respectively; the first one was found as the value of parameter p1, whereas the second was found as value of parameter p2. The semantics of parameter in these records is application specific.
A usage record's id is optional. It must be present when annotating usage records (see Section Annotation Record) or when defining precise-1 derivations (see Derivation Record).
A reference to a given entity record may appear in multiple usage records that share a given activity record identifier.
The key purpose of agents in PROV-DM is to assign responsibility for activities. It is important to reflect that there is a degree in the responsibility of agents, and that is a major reason for distinguishing among all the agents that have some association with an activity and determine which ones are really the originators of the entity. For example, a programmer and a researcher could both be associated with running a workflow, but it may not matter what programmer clicked the button to start the workflow while it would matter a lot what researcher told the programmer to do so. Another example: a student publishing a web page describing an academic department could result in both the student and the department being agents associated with the activity, and it may not matter what student published a web page but it matters a lot that the department told the student to put up the web page. So there is some notion of responsibility that needs to be captured.
To this end, PROV-DM offers two kinds of records. The first, introduced in this section, represents an association between an agent and an activity; the second, introduced in Section Responsibility record, represents the fact that an agent was acting on behalf of another, in the context of an activity.
Examples of activity association include designing, participation, initiation and termination, timetabling or sponsoring.
An activity association record, written wasAssociatedWith(a,ag2,attrs) in PROV-ASN, has the following constituents:
In PROV-ASN, an activity association record's text matches the activityAssociationRecord productions of the grammar defined in this specification document.
activity(a,[prov:type="workflow"]) agent(ag1,[prov:type="programmer"]) agent(ag2,[prov:type="researcher"]) wasAssociatedWith(a,ag1,[prov:role="loggedInUser", ex:how="webapp"]) wasAssociatedWith(a,ag2,[prov:role="designer", ex:context="phd"])
A start record is a representation of an agent starting an activity. An end record is a representation of an agent ending an activity. Both relations are specialized forms of wasAssociatedWith. They contain attributes describing the modalities of acting/ending activities.
A start record, written wasStartedBy(id,a,ag,attrs) in PROV-ASN, contains:
An end record, written wasEndedBy(id,a,ag,attrs) in PROV-ASN, contains:
In PROV-ASN, start and end record's texts match the startRecord and endRecord productions of the grammar defined in this specification document.
The following assertions
wasStartedBy(a,ag,[ex:mode="manual"]) wasEndedby(a,ag,[ex:mode="manual"])
state that the activity, represented by the activity record denoted by a was started and ended by an agent, represented by record denoted by ah, in "manual" mode, an application specific characterization of these relations.
To promote take-up, PROV-DM offers a mild version of responsibility in the form of a relation to represent when an agent acted on another agent's behalf. So in the example of someone running a mail program, the program is an agent of that activity and the person is also an agent of the activity, but we would also add that the mail software agent is running on the person's behalf. In the other example, the student acted on behalf of his supervisor, who acted on behalf of the department chair, who acts on behalf of the university, and all those agents are responsible in some way for the activity to take place but we don't say explicitly who bears responsibility and to what degree.
We could also say that an agent can act on behalf of several other agents (a group of agents). This would also make possible to indirectly reflect chains of responsibility. This also indirectly reflects control without requiring that control is explicitly indicated. In some contexts there will be a need to represent responsibility explicitly, for example to indicate legal responsibility, and that could be added as an extension to this core model. Similarly with control, since in particular contexts there might be a need to define specific aspects of control that various agents exert over a given activity.
Given an activity association record wasAssociatedWith(a,ag2,attrs), a responsibility record, written actedOnBehalfOf(id,ag2,ag1,a,attrs) in PROV-ASN, has the following constituents:
activity(a,[prov:type="workflow"]) agent(ag1,[prov:type="programmer"]) agent(ag2,[prov:type="researcher"]) agent(ag3,[prov:type="funder"]) wasAssociatedWith(a,ag1,[prov:role="loggedInUser"]) wasAssociatedWith(a,ag2) actedOnBehalfOf(ag1,ag2,a,[prov:type="delegation"]) actedOnBehalfOf(ag2,ag3,a,[prov:type="contract"])
In PROV-DM, a derivation record is a representation that some entity is transformed from, created from, or affected by another entity in the world.
Examples of derivation include the transformation of a canvas into a painting, the transportation of a person from London to New-York, the transformation of a relational table into a linked data set, and the melting of ice into water.
According to Section Conceptualization, for an entity to be transformed from, created from, or affected by another in some way, there must be some underpinning activities performing the necessary actions resulting in such a derivation. However, asserters may not assert or have knowledge of these activities and associated details: they may not assert or know their number, they may not assert or know their identity, they may not assert or know the attributes characterizing how the relevant entities are used or generated. To accommodate the varying circumstances of the various asserters, PROV-DM allows more or less precise records of derivation to be asserted. Hence, PROV-DM uses the terms precise and imprecise to characterize the different kinds of derivation record. We note that the derivation itself is exact (i.e., deterministic, non-probabilistic), but it is its description, expressed in a derivation record, that may be imprecise.
The lack of precision may come from two sources:
Hence, given a precision axis, with values precise and imprecise, and an activity axis, with values one activity and n activities, we can then form a matrix of possible derivations, precise or imprecise, or corresponding to one activity or n activities. Out of the four possibilities, PROV-DM offers three forms of derivation, while the fourth one is not meaningful. The following table summarises names for the three kinds of derivation, which we then explain.
precision axis | |||
precise | imprecise | ||
activity axis | one activity | precise-1 derivation record | imprecise-1 derivation record |
n activities | --- | imprecise-n derivation record |
We note that the fourth theoretical case of a precise derivation, where the number of activities is not known or asserted cannot occur.
The three kinds of derivation records are successively introduced. To minimize the number of relation types in PROV-DM, we introduce a PROV-DM reserved attribute steps, which allows us to distinguish the various derivation types.
A precise-1 derivation record, written wasDerivedFrom(id, e2, e1, a, g2, u1, attrs) in PROV-ASN, contains:
It is optional to include the attribute prov:steps in a precise-1 derivation since the record already refers to the one and only one activity underpinning the derivation.
An imprecise-1 derivation record, written wasDerivedFrom(id, e2,e1, attrs) in PROV-ASN, contains:
An imprecise-1 derivation must include the attribute prov:steps, since it is the only means to distinguish this record from an imprecise-n derivation record.
An imprecise-n derivation record, written wasDerivedFrom(id, e2, e1, attrs) in PROV-ASN, contains:
It is optional to include the attribute prov:steps in an imprecise-n derivation record. It defaults to prov:steps="n".
None of the three kinds of derivation is defined to be transitive. Domain-specific specializations of these derivations may be defined in such a way that the transitivity property holds.
In PROV-ASN, a derivation record's text matches the derivationRecord production of the grammar defined in this specification document.
The following assertions state the existence of derivations.
wasDerivedFrom(e5,e3,a4,g2,u2,[]) wasDerivedFrom(e5,e3,a4,g2,u2,[prov:steps="1"]) wasDerivedFrom(e3,e2,[prov:steps="1"]) wasDerivedFrom(e2,e1,[]) wasDerivedFrom(e2,e1,[prov:steps="n"])
The first two are precise-1 derivation records expressing that the activity represented by the activity a4, by using the entity denoted by e3 according to usage record u2 derived the entity denoted by e5 and generated it according to generation record g2. The third record is an imprecise-1 derivation, which is similar for e3 and e2, but it leaves the activity record and associated attributes implicit. The fourth and fifth records are imprecise-n derivation records between e2 and e1, but no information is provided as to the number and identity of activities underpinning the derivation.
An precise-1 derivation record is richer than an imprecise-1 derivation record, itself, being more informative that an imprecise-n derivation record. Hence, the following implications hold.
If a derivation record holds for e2 and e1, then this means that the entity represented by entity record identified by e1 has an influence on the entity represented entity record identified by e2, which at the minimum implies temporal ordering, specified as follows. First, we consider one-activity derivations.
Then, imprecise-n derivations.
Note that temporal ordering is between generations of e1 and e2, as opposed to precise-1 derivation, which implies temporal ordering between the usage of e1 and generation of e2. Indeed, in the case of imprecise-n derivation, nothing is known about the usage of e1, since there is no associated activity.
The imprecise-1 derivation has the same meaning as the precise-1 derivation, except that an activity is known to exist, though it does not need to be asserted. This is formalized by the following inference rule, referred to as activity introduction:
activity(a,aAttrs) wasGeneratedBy(g,e2,a,gAttrs) used(u,a,e1,uAttrs)for sets of attribute-value pairs gAttrs, uAttrs, and aAttrs.
Note that inferring derivation from usage and generation does not hold in general. Indeed, when a generation wasGeneratedBy(g, e2, a, attrs2) precedes used(u, a, e1, attrs1), for some e1, e2, attrs1, attrs2, and a, one cannot infer derivation wasDerivedFrom(e2, e1, a, g, u) or wasDerivedFrom(e2,e1) since of e2 cannot possibly be determined by of e1, given the creation of e2 precedes the use of e1.
A further inference is permitted from the imprecise-1 derivation record:
Given an activity record identified by pe, entity records identified by e1 and e2, and set of attribute-value pairs attrs2, if wasDerivedFrom(e2,e1, [prov:steps="1"]) and wasGeneratedBy(e2,pe,attrs2) hold, then used(pe,e1,attrs1) also holds for some set of attribute-value pairs attrs1.
This inference is justified by the fact that the entity represented by entity record identified by e2 is generated by at most one activity in a given account (see generation-unicity). Hence, this activity record is also the one referred to in the usage record of e1.
We note that the converse inference, does not hold. From wasDerivedFrom(e2,e1) and used(pe,e1), one cannot derive wasGeneratedBy(e2,pe,attrs2) because identifier e1 may occur in usage records referring to many activity records, but they may not be referred to in generation records containing identifier e2.
A complementarity record is a relationship between two entities stated to have compatible characterization over some continuous interval between two events.
The rationale for introducing this relationship is that in general, at any given time, for an entity in the world, there may be multiple ways of characterizing it, and hence multiple representations can be asserted by different asserters. In the example that follows, suppose thing "Royal Society" is represented by two asserters, each using a different set of attributes. If the asserters agree that both representations refer to "The Royal Society", the question of whether any correspondence can be established between the two representations arises naturally. This is particularly relevant when (a) the sets of attributes used by the two representations overlap partially, or (b) when one set is subsumed by the other. In both these cases, we have a situation where each of the two asserters has a partial view of "The Royal Society", and establishing a correspondence between them on the shared attributes is beneficial, as in case (a) each of the two representation complements the other, and in case (b) one of the two (that with the additional attributes) complements the other.
This intuition is made more precise by considering the entities that form the representations of entities at a certain point in time. An entity record represents, by means of attribute-value pairs, a thing and its situation in the world, which remain constant over a characterization interval. As soon as the thing's situation changes, this marks the end of the characterization interval for the entity record representing it. The thing's novel situation is represented by an attribute with a new value, or an entirely different set of attribute-value pairs, embodied in another entity record, with a new characterization interval. Thus, if we overlap the timelines (or, more generally, the sequences of value-changing events) for the two entities, we can hope to establish correspondences amongst the entity records that represent them at various points along that events line. The figure below illustrates this intuition.
Relation complement-of between two entity records is intended to capture these correspondences, as follows. Suppose entity records A and B share a set P of attributes, and each of them has other attributes in addition to P. If the values assigned to each attribute in P are compatible between A and B, then we say that A is-complement-of B, and B is-complement-of A, in a symmetrical fashion. In the particular case where the set P of attributes of B is a strict superset of A's attributes, then we say that B is-complement-of A, but in this case the opposite does not hold. In this case, the relation is not symmetric. (as a special case, A and B may not share any attributes at all, and yet the asserters may still stipulate that they are representing the same thing "Royal Society". The symmetric relation may hold trivially in this case).
The term compatible used above means that a mapping can be established amongst the values of attributes in P and found in the two entity expession. This generalizes to the case where attribute sets P1 and P2 of A, and B, respectively, are not identical but they can be mapped to one another. The simplest case is the identity mapping, in which A and B share attribute set P, and furthermore the values assigned to attributes in P match exactly.
It is important to note that the relation holds only for the characterization intervals of the entity expessions involved As soon as one attribute changes value in one of them, new correspondences need to be found amongst the new entities. Thus, the relation has a validity span that can be expressed in terms of the event lines of the entity.
A complementarity record is written wasComplementOf(e2,e1), where e1 and e2 are two identifiers denoting entity records.
The following example illustrates the entity "Royal Society"and its perspectives at various points in time.
entity(rs,[ex:created=1870]) entity(rs_l1,[prov:location="loc2"]) entity(rs_l2,[prov:location="The Mall"]) entity(rs_m1,[ex:membership=250, ex:year=1900]) entity(rs_m2,[ex:membership=300, ex:year=1945]) entity(rs_m3,[ex:membership=270, ex:year=2010]) wasComplementOf(rs_m3, rs_l2) wasComplementOf(rs_m2, rs_l1) wasComplementOf(rs_m2, rs_l2) wasComplementOf(rs_m1, rs_l1) wasComplementOf(rs_m3, rs) wasComplementOf(rs_m2, rs) wasComplementOf(rs_m1, rs) wasComplementOf(rs_l1, rs) wasComplementOf(rs_l2, rs)
The complementarity relation is not transitive. Let us consider identifiers e1, e2, and e3 identifying three entity records such that wasComplementOf(e3,e2) and wasComplementOf(e2,e1) hold. The record wasComplementOf(e3,e1) may not hold because the characterization intervals of the denoted entity records may not overlap.
In PROV-ASN, a complementarity record's text matches the complementarityRecord production of the grammar defined in this specification document.
An entity record identifier can optionally be accompanied by an account identifier. When this is the case, it becomes possible to link two entity record identifiers that are appear in different accounts. (In particular, the entity record identifiers in two different account are allowed to be the same.). When account identifiers are not available, then the linking of entity records through complementarity can only take place within the scope of a single account.
In the following example, the same description of the Royal Society is structured according to two different accounts. In the second account, we find a complementarity record linking rs_m1 in account ex:acc2 to rs in account ex:acc1.
account(ex:acc1, http://example.org/asserter1, ... entity(rs,[ex:created=1870]) ... ) account(ex:acc2, http://example.org/asserter2, ... entity(rs_m1,[ex:membership=250, ex:year=1900]) ... wasComplementOf(rs_m1, ex:acc2, rs, ex:acc1) )
An annotation record establishes a link between an identifiable PROV-DM record and a note record referred to by its identifier. Multiple note records can be associated with a given PROV-DM record; symmetrically, multiple PROV-DM records can be associated with a given note record. Since note records have identifiers, they can also be annotated. The annotation mechanism (with note record and the annotation record) forms a key aspect of the extensibility mechanism of PROV-DM (see extensibility section).
An annotation record, written hasAnnotation(r,n,attrs) in PROV-ASN, has the following constituents:
In PROV-ASN, a note record's text matches the noteRecord production of the grammar defined in this specification document.
The interpretation of notes is application-specific. See Section Note for a discussion of the difference between note attributes and other records attributes. We also note the present tense in this term to indicate that it may not denote something in the past.
The following records
entity(e1,[prov:type="document"]) entity(e2,[prov:type="document"]) activity(a,transform,t1,t2,[]) used(u1,a,e1,[ex:file="stdin"]) wasGeneratedBy(e2, a, [ex:file="stdout"]) note(n1,[ex:icon="doc.png"]) hasAnnotation(e1,n1) hasAnnotation(e2,n1) note(n2,[ex:style="dotted"]) hasAnnotation(u1,n2)
assert the existence of two documents in the world (attribute-value pair: prov:type="document") represented by entity records identified by e1 and e2, and annotate these records with a note indicating that the icon (an application specific way of rendering provenance) is doc.png. It also asserts an activity, its usage of the first entity, and its generation of the second entity. The usage record is annotated with a style (an application specific way of rendering this edge graphically). To be able to express this annotation, the usage record was provided with an identifier u1, which was then referred to in hasAnnotation(u1,n2).
In this section, two constructs are introduced to group PROV-DM records. The first one, account record is itself a record, whereas the second one record container is not.
In PROV-DM, an account record is a wrapper of records with a dual purpose:
An account record, written account(id, assertIRI, recs, attrs) in PROV-ASN, contains:
In PROV-ASN, an account record's text matches the accountRecord production of the grammar defined in this specification document.
The following account record
account(ex:acc0, http://example.org/asserter, entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ]) ... wasDerivedFrom(e2,e1) ... activity(a0,create-file,t) ... wasGeneratedBy(e0,a0,[]) ... wasAssociatedWith(a4, ag5, [prov:role="communicator"]) )
contains the set of provenance records of section example-prov-asn-encoding, is asserted by agent http://example.org/asserter, and is identified by identifier ex:acc0.
Account records constitue a scope for record identifiers. A record identifier within the scope of an account is intended to denote a single record. However, nothing prevents an asserter from asserting an account containing, for example, multiple entity records with a same identifier but different attribute-values. In that case, they should be understood as a single entity record with this identifier and the union of all attributes values, as formalized in identified-entity-in-account.
Whilst constraint identified-entity-in-account specifies how to understand multiple entity records with a same identifier within a given account, it does not guarantee that the entity record formed with the union of all attribute-value pairs is consistent. Indeed, a given attribute may be assigned multiple values, resulting in an inconsistent entity record, as illustrated by the following example.
In the following account record, we find two entity records with a same identifier e.
account(ex:acc1, http://example.org/id, entity(e,[prov:type="person", ex:age=20]) entity(e,[prov:type="person", ex:age=30]) ...)
Application of identified-entity-in-account results in an entity record containing the attribute-value pairs age=20 and age=30. This results in an inconsistent characterization of a person. We note that deciding whether a set of attribute-values is consistent or not is application specific and outside the scope of this specification.
Account records can be nested since an account record can occur among the records being wrapped by another account.
An account is said to be well-formed if it satisfies the constraints generation-unicity and derivation-use.
The union of two accounts is another account, containing the unions of their respective records, where records with a same identifier should be understood according to constraint identified-entity-in-account. Well-formed accounts are not closed under union because the constraint generation-unicity may no longer be satisfied in the resulting union.
Indeed, let us consider another account record
account(ex:acc2, http://example.org/asserter2, entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ]) ... activity(a1,create-file,t1) ... wasGeneratedBy(e0,a1,[ex:fct="create"]) ... )
with identifier ex:acc2, containing assertions by asserter by http://example.org/asserter2 stating that the entity represented by entity record identified by e0 was generated by an activity represented by activity record identified by a1 instead of a0 in the previous account ex:acc0. If accounts ex:acc0 and ex:acc2 are merged together, the resulting set of records violates generation-unicity.
Account records constitute a scope for record identifiers. Since accounts can be nested, scopes can also be nested; thus, the scope of record identifiers should be understood in the context of such nested scopes. When a record with an identifier occurs directly within an account, then its identifier denotes this record in the scope of this account, except in sub-accounts where records with the same identifier occur.
The following account record is inspired from section example-prov-asn-encoding. This account, identified by ex:acc3, declares entity record with identifier e0, which is being referred to in the nested account ex:acc4. The scope of identifier e0 is account ex:acc3, including subaccount ex:acc4.
account(ex:acc3, http://example.org/asserter1, entity(e0, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice" ]) activity(a0,create-file,t) wasGeneratedBy(e0,a0,[]) account(ex:acc4, http://example.org/asserter2, entity(e1, [ prov:type="File", ex:path="/shared/crime.txt", ex:creator="Alice", ex:content="" ]) activity(a0,copy-file,t) wasGeneratedBy(e1,a0,[ex:fct="create"]) wasComplementOf(e1,e0)))
Alternatively, an activity record identified by a0 occurs in each of the two accounts. Therefore, each activity record is asserted in a separate scope, and therefore may represent different activities in the world.
The account record is the hook by which further meta information can be expressed about provenance, such as asserter, time of creation, signatures. The annotation mechanism can be used for this purpose, but how general meta-information is expressed is beyond the scope of this specification, except for asserters.
A record container is a house-keeping construct of PROV-DM, also capable of bundling PROV-DM records. A record container is not a record, but can be exploited to return assertions in response to a request for the provenance of something ([PROV-PAQ]).
A record container, written container decls recs endContainer in PROV-ASN, contains:
All the records in recs are implictly wrapped in a default account, scoping all the record identifiers they declare directly, and constituting a toplevel account, in the hierarchy of accounts. Consequently, every provenance record is always expressed in the context of an account, either explicitly in an asserted account, or implicitly in a container's default account.
In PROV-ASN, a record container's text matches the recordContainer production of the grammar defined in this specification document.
The following container
container prefix ex: http://example.org/, account(ex:acc1,http://example.org/asserter1,...) account(ex:acc2,http://example.org/asserter1,...) endContainer
illustrates how two accounts with identifiers ex:acc1 and ex:acc2 can be returned in a PROV-ASN serialization of the provenance of something.
An attribute is a qualified name. A qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part (see detailed rule in [RDF-SPARQL-QUERY], Section 4.1.1).
A qualified name's prefix is optional. If a prefix occurs in a qualified name, it refers to a namespace declared in the record container. In the absence of prefix, the qualified name refers to the default namespace declared in the container.
From this specification's viewpoint, the interpretation of an attribute declared in a namespace other than prov-dm is out of scope.
The PROV data model introduces a fixed set of attributes in the PROV-DM namespace:
The following start record describes the role of the agent identified by ag in this start relation with activity a.
wasStartedBy(a,ag, [prov:role="program-operator"])
The following record declares an agent of type software agent
agent(ag, [prov:type="prov:SoftwareAgent" %% xsd:QName])
An identifier is a qualified name. A qualified name can be mapped into an IRI by concatenating the IRI associated with the prefix and the local part (see detailed rule in [RDF-SPARQL-QUERY], Section 4.1.1).
A PROV-DM Literal represents a data value such as a particular string or number. A PROV-DM Literal represents a value whose interpretation is outside the scope of PROV-DM.
In PROV-ASN, a Literal's text matches the Literal production of the grammar defined in this specification document.
The non terminals stringLiteral and intLiteral are syntactic sugar for quoted strings with datatype xsd:string and xsd:int, respectively.
In particular, a PROV-DM Literal may be an IRI-typed string (with datatype xsd:anyURI); such IRI has no specific interpretation in the context of PROV-DM.
The following examples respectively are the string "abc" (expressed using the convenience notation), the string "abc", the integer number 1, the integer number 1 (expressed using the convenience notation) and the IRI "http://example.org/foo".
"abc" "abc" %% xsd:string "1" %% xsd:int 1 "http://example.org/foo" %% xsd:anyURIThe following example shows a literal of type xsd:QName (see QName [XMLSCHEMA-2]). The prefix ex must be bound to a namespace declared in the record container.
"ex:value" %% xsd:QName
Time instants are defined according to xsd:dateTime [XMLSCHEMA-2].
It is optional to assert time in usage, generation, and activity records.
An asserter is a creator of PROV-DM records. An asserter is denoted by an IRI. Such IRI has no specific interpretation in the context of PROV-DM.
A PROV-DM namespace is identified by an IRI reference [IRI]. In PROV-DM, attributes, identifiers, and literals of with datatype xsd:QName can be placed in a namespace using the mechanisms described in this specification.
A namespace declaration consists of a binding between a prefix and a namespace. Every qualified name with this prefix in the scope of this declaration refers to this namespace. A default namespace declaration consists of a namespace. Every unprefixed qualified name in the scope of this default namespace declaration refers to this namespace.
A recipe link is an association between an activity record and a process specification that underpins the represented activity. Such IRI has no specific interpretation in the context of PROV-DM.
It is optional to assert recipe links in activities.
Process specifications, as referred to by recipe links, are out of scope of this specification.
Location is an identifiable geographic place (ISO 19112). As such, there are numerous ways in which location can be expressed, such as by a coordinate, address, landmark, row, column, and so forth. This document does not specify how to concretely express locations, but instead provide a mechanism to introduce locations in assertions.
Location is an optional attribute of entity records and activity records. The value associated with a attribute location must be a Literal, expected to denote a location.
This section contains the normative specification of common relations of PROV-DM.
The following figure summarizes the additional relations described in subsections 6.2 onwards.
Record: wasAddedTo_Coll(c2,c1) (resp. wasRemovedFrom_Coll(c2,c1)) denotes that collection c2 is an updated version of collection c1, following an insertion (resp. deletion) operation.
Record: wasAddedTo_Key(c,k) (resp. wasRemovedFrom_Key(c,k)) denotes that collection c had a new value with key k added to (resp. removed from) it.
Record: wasAddedTo_Entity(c,e) denotes that collection c had entity e added to it.
Consider the following assertions:
wasAddedTo_Coll(c2,c1) wasAddedTo_Key(c2,k1) wasAddedTo_Entity(c2,e1) wasAddedTo_Coll(c3,c2) wasAddedTo_Key(c3,k2) wasAddedTo_Entity(c3,e2) wasRemovedFrom_Coll(c4,c3) wasRemovedFrom_Key(c4,k1)
The corresponding graphical representation is shown below.
With these assertions:
A traceability record states the existence of a "dependency path" between two entities, indicating that one entity can be shown to be in the lineage of another, and may have influenced it in some way. This relation is transitive.
A traceability record, written tracedTo(id,e2,e1,attrs) in PROV-ASN:
In PROV-ASN, a traceability record's text matches the traceabilityRecord production of the grammar defined in this specification document.
A traceability record can be inferred from existing relations, or can be asserted stating that such a dependency path exists without the asserter knowing its individual steps, as expressed by the following constraints.
We note that the previous constraint is not really an inference rule, since there is nothing that we can actually infer. Instead, this constraint should simply be seen as part of the definition of the traceability record.
PROV-DM allows dependencies amongst activities to be expressed. An information flow ordering record is a representation that an entity was generated by an activity, before it was used by another activity. A control ordering record is a representation that an activity was initiated by another activity.
In PROV-ASN, an activity ordering record's text matches the activityOrderingRecord production of the grammar defined in this specification document.
An information flow ordering record, written as wasInformedBy(id,a2,a1,attrs) in PROV-ASN, contains:
An information flow ordering record is formally defined as follows.
The relationship wasInformedBy is not transitive. Indeed, consider the following records.
wasInformedBy(a2,a1) wasInformedBy(a3,a2)
We cannot infer wasInformedBy(a3,a1) from them. Indeed, from wasInformedBy(a2,a1), we know that there exists e1 such that e1 was generated by a1 and used by a2. Likewise, from wasInformedBy(a3,a2), we know that there exists e2 such that e2 was generated by a2 and used by a3. The following illustration shows a case where transitivity cannot hold. The horizontal axis represents time. We see that e1 was generated after e2 was used. Furthermore, the illustration also shows that a3 completes before a1. So it is impossible for a3 to have used an entity generated by a1.
A control ordering record, written as wasStartedBy(a2,a1) in PROV-ASN, contains:
Such a record states control ordering between a2 and a1, specified as follows.
In the following assertions, we find two activity records, identified by a1 and a2, representing two activities, which took place on two separate hosts. The third record indicates that the latter activity was started by the former.
activity(a1,workflow,t1,t2,[ex:host="server1.example.org"]) activity(a2,sub-workflow,t3,t4,[ex:host="server2.example.org"]) wasStartedBy(a2,a1)
Alternatively, we could have asserted the existence of an entity, representing a request to create a sub-workflow. This request, issued by a1, triggered the start of a2.
entity(e,[prov:type="creation-request"]) wasGeneratedBy(e,a1) wasStartedBy(a2,e)
A revision record is a representation of the creation of an entity considered to be a variant of another. Deciding whether something is made available as a revision of something else usually involves an agent who represents someone in the world who takes responsibility for approving that the former is a due variant of the latter.
A revision record, written wasRevisionOf(e2,e1,ag,attrs) in PROV-ASN, contains:
In PROV-ASN, a revision record's text matches the revisionRecord production of the grammar defined in this specification document.
A revision record needs to satisfy the following constraint, linking the two entity records by a derivation, and stating them to be a complement of a third entity record.
wasRevisionOf is a strict sub-relation of wasDerivedFrom since two entities e2 and e1 may satisfy wasDerivedFrom(e2,e1) without being a variant of each other.
The following revision assertion
agent(ag,[prov:type="QualityController"]) entity(e1,[prov:type="document"]) entity(e2,[prov:type="document"]) wasRevisionOf(e2,e1,ag)
states that the document represented by entity record identified by e2 is a revision of document represented by entity record identified by e1; agent denoted by ag is responsible for this new versioning of the document.
An attribution record represents that an entity is ascribed to an agent and is compliant with the attributionRecord production.
An attribution record, written wasAttributedTo(e,ag,attr), contains the following components:
Attribution models the notion of an activity generating an entity identified by e being controlled by an agent ag, which takes responsibility for generating e. Formally, this is expressed as the following necessary condition.
In PROV-ASN, an attribution record's text matches the attributionRecord production of the grammar.
activity(pe,recipe,t1,t2,attr1) wasGenerateBy(e,pe) wasAssociatedWith(pe,ag,attr2)for some sets of attribute-value pairs attr1 and attr2, time t1, and t2.
A quotation record is a representation of the repeating or copying of some part of an entity, compatible with the quotationRecord production.
A quotation record, written wasQuotedFrom(e2,e1,ag2,ag1,attrs), contains:
In PROV-ASN, a quotation record's text matches the quotationRecord production of the grammar.
wasDerivedFrom(e2,e1) wasAttributedTo(e2,ag2) wasAttributedTo(e1,ag1)
A summary record represents that an entity is a synopsis or abbreviation of another entity. A summary record is compliant with the summaryRecord production.
An assertion wasSummaryOf, written wasSummaryOf(e2,e1,attrs), contains:
In PROV-ASN, a summary record's text matches the summaryRecord production of the grammar.
wasSummaryOf is a strict sub-relation of wasDerivedFrom.
An original source record represents an entity in which another entity first appeared. A original-source record is compliant with the originalSourceRecord production.
An assertion hadOriginalSource, written hadOriginalSource(e2,e1,attrs), contains:
hasOriginalSource is a strict sub-relation of wasDerivedFrom.
In PROV-ASN, an original source record's text matches the originalSourceRecord production of the grammar.
The PROV data model provides several extensibility points that allow designers to specialize it to specific applications or domains. We summarize these extensibility points here:
The PROV-DM namespace declares a set of reserved attributes: type, location.
The PROV-DM namespace declares a reserved attribute: role.
The PROV data model is designed to be application and technology independent, but specializations of PROV-DM are welcome and encouraged. To ensure inter-operability, specializations of the PROV data model that exploit the extensibility points summarized in this section must preserve the semantics specified in this document. For instance, a qualified attribute on a domain specific entity record must represent an aspect of an entity and this aspect must remain unchanged during the characterization's interval of this entity record.
This specification introduces the notion of an identifiable entity in the world. In PROV-DM, an entity record is a representation of such an identifiable entity. An entity record includes an identifier identifying this entity. Identifiers are qualified names, which can be mapped to IRIs.
The term 'resource' is used in a general sense for whatever might be identified by a URI [RFC3986]. On the Web, a URI denotes a resource, without any expectation that the resource is accessed.
The purpose of this section is to clarify the relationship between resource and the notions of entity and entity record.
In the context of PROV-DM, a resource is just a thing in the world. One may take multiple perspectives on such a thing and its situation in the world, fixing some its aspects.
We refer to the example of section 2.1 for a resource (at some URL) and three different perspectives, referred to as entities. Three different entity records can be expressed for this report, which in the PROV-ASN sample below, are expressed within a same account.
container prefix app urn:example: prefix cr http://example.org/crime/ account(acc1, http://example.org/asserter1, entity(app:0, [ prov:type="Document", cr:path="http://example.org/crime.txt" ]) entity(app:1, [ prov:type="Document", cr:path="http://example.org/crime.txt", cr:version="2.1", cr:content="...", cr:date="2011-10-07" ]) entity(app:2, [ prov:type="Document", cr:author="John" ]) ...) endContainer
Each entity record contains an idenfier that identifies the entity it represents. In this example, three identifiers were minted, and their prefix uses the URN syntax with "example" namespace.
Given that the report is a resource denoted by the URI http://example.org/crime.txt, we could simply use this URI as the identifier of an entity. This would avoid us minting new URIs. Hence, the report URI would play a double role: as a URI it denotes a resource accessible at that URI, and as a PROV-DM identifier, it identifies a specific characterization of this report. A given identifier identifies a single entity record within the scope of an account. Hence, below, all entities records have been given the same identifier but appear in the scope of different accounts.
container prefix app http://example.org/ prefix cr http://example.org/crime/ account(acc2, http://example.org/asserter1, entity(app:crime.txt, [ prov:type="Document", cr:path="http://example.org/crime.txt" ]) ...) account(acc3, http://example.org/asserter1, entity(app:crime.txt, [ prov:type="Document", cr:path="http://example.org/crime.txt", cr:version="2.1", cr:content="...", cr:date="2011-10-07" ]) ...) account(acc4, http://example.org/asserter1, entity(app:crime.txt, [ prov:type="Document", cr:author="John" ]) ...) endContainer
In this case, the qualified name app:crime.txt maps to URI http://example.org/crime.txt still denotes the same resource; however, the perspective we take about that resource is expressed as a different entity record, happening to have the same identifier in different accounts.
Alternatively, if we need to assert the existence of two different perspectives on the report within the same account, then alternate identifiers must be used, one of them being allowed to be the resource URI.
container prefix app http://example.org/ prefix app2 urn:example: prefix cr http://example.org/crime/ account(acc5, http://example.org/asserter1, entity(app:crime.txt, [ prov:type="Document", cr:path="http://example.org/crime.txt" ]) entity(app2:1, [ prov:type="Document", cr:path="http://example.org/crime.txt", cr:version="2.1", cr:content="...", cr:date="2011-10-07" ]) ...) endContainer
WG membership to be listed here.