# Overview

The idea of this document is to sketch what aspects of the provenance model can be formalized and how they can be formalized, as a first step towards establishing a consensus on the (intended) meaning of the components of the model and the consistency constraints or inferences that can be applied to the model to distinguish good from bad provenance records.

## Idea of the semantics

As a starting point, I will assume that we intend the assertions made in a PROV-DM instance to be intended to describe one, consistent state of the world, much like a logical formula is said to be satisfied in a mathematical model. That is, I propose an approach similar to that taken in model theory, where the PROV-DM instance corresponds to a formula or theory of a logic, and the semantics corresponds to what logicians call a model.

For example, the formula $\forall x. x \geq 0$ is satisfied in a mathematical model where the individual elements denotable by x are natural numbers, and the ordering relation symbol $\geq$ is interpreted as the usual ordering relation on natural numbers. Here, the goal is to come up with a plausible "intended model" for interpreting PROV-DM instances, where the formulas are assertions in PROV-DM and the individuals are things and agents. This is complicated by the fact that many statements about provenance involve talking about objects that change over time.

The modeling methodology I'm using below is similar to that used in some (old, but still relevant) work on preservation: see [1]

The word "world" is used in PROV-DM to talk about the actual state of affairs that the PROV-DM instance describes, which is what I would usually call a "model". The word "model" is used in PROV-DM mainly in the sense of "data model", that is, to talk about what I would otherwise call the syntax of PROV-DM. To avoid confusion with the uses of terms in PROV-DM, I will use "world" to describe the actual state of affairs, and will try to avoid ambiguous uses of the word "model".

# Basics

I will use syntax for PROV-DM assertions (= formulas) as described in the second working draft of PROV-DM.

A PROV-DM instance consisting of assertions φ1...φn is interpreted as a conjunction, that is, the overall instance is considered to hold if each atomic formula in it holds.

The rest of the document will discuss the structure of the worlds and when an atomic assertion holds in a given world.

## Identifiers

A lowercase symbol x,y,... on its own denotes an identifier. Identifiers may or may not be URIs. I view identifiers as being like variables in logic (or blank nodes in RDF): just because we have two different identifiers x and y doesn't tell us that they denote different things, since we could discover that they are actually the same later.

## Times and Intervals

We assume an ordered set $(T,\leq)$ of time instants. This could be a total order (i.e. a linear timeline) or a partial order describing events that are not necessarily reconciled to a single global clock. We also consider a set Intervals of pairs of times [t1,t2] such that $t_1 \leq t_2$. Intuitively, an interval denotes the set of all time instants between t1 and t2, that is, $\{t \mid t_1 \leq t \leq t_2\}$.

## Attributes and Values

We assume a set Attributes of attribute labels and a set Values of possible values of attributes.

# Formulas

Summary of syntax of formulas from PROV-DM.

 

 formula ::= element_formula | relation_formula | ... element_formula  ::= entity(id,attrs) | activity(id,st,et,attrs) | agent(id,attrs) | note(id, attrs) relation_formula  ::= wasGeneratedBy(id,e,a,attrs,t) | used(id,e,a,attrs,t) | wasStartedBy(id,a,ag,attrs) | wasEndedBy(id,a,ag,attrs) | hasAnnotation(r,n,attrs) | alternateOf(e1,e2) | specializationOf(e1,e2) | ... 

Some additional formulas defined in PROV-DM but not yet handled in the semantics are omitted.

# Worlds

## Things

Things are things in the world. Each thing has a lifetime during which it exists and attributes whose values can change over time.

To model this, a world includes

• a set Things of things
• a function $lifetime : Things \to Intervals$ from objects to time intervals
• a function $value : Things \times Attributes \times Times \to Values$

Note that this description does not say what the structure of an object is, only how it may be described in terms of its time interval and attribute values. An object could just be a record of fixed attribute values; it could be a bear; it could be the Royal Society; it could be a transcendental number like π. All that matters from our point of view is that we know how to map the object to its time interval and attribute mapping.

It is possible for two Things to be indistinguishable by their attribute values and lifetime, but have different identity.

## Objects

A Object is described by a time interval and attributes with unchanging values. Objects encompass entities, agents, events and activities.

To model this, a world includes

• a set Objects
• a function $lifetime : Objects \to Intervals$ from objects to time intervals
• a function $value : Objects \times Attributes \to Values$

Intuitively, lifetime(e) is the time interval during which object e exists; and for each $t \in lifetime(e)$, the value value(e,a) is the value of attribute a during the object's lifetime.

As with Things, it is also possible to have two different objects that are indistinguishable by their attributes and time intervals.

### Entities

An entity is a kind of object that describes a time-slice of a thing, during which some of the thing's attributes are fixed. We assume:

• a set $Entities \subseteq Objects$ of entities, disjoint from Activities and Events below.
• a function $thingOf : Entities \to Things$ that associates each Entity with a Thing, such that for each $t \in lifetime(obj)$, and each attribute defined by obj, we have value(obj,a) = value(thingOf(obj),a,t)

### Agents

An agent is an entity that can act, by controlling, starting, ending, or participating in activities. Agents can act on behalf of other agents. We introduce:

• a set $Agents \subseteq Entities$ of agents, which is also necessarily disjoint from Activities and Events
• a relation $ActsFor \subseteq Agents \times Agents$ [TODO: This probably needs to be made time- and activity-dependent]

### Events

(Note: The term event is controversial).

An event is an object whose lifetime is a single time instant, and relates an activity to an entity (which could be an agent). We introduce:

• A set Events of events, disjoint from Entities and Activities
• A function $time : Events \to Times$ giving the time of each event
• The derived ordering on events given by $evt_1 \leq evt_2 \iff t(evt_1) \leq t(evt_2)$
• A function $type: Events \to \{start,end,use,generate\}$
• A relation $Used \subseteq Events \times Entities$ saying when an event used an entity. An event can use at most one entity, and if $(evt,e)\in Used$ then $time(evt) \in lifetime(e)$ must hold.
• A relation $Generated \subseteq Events \times Entities$ saying when an event generated an entity. An event can generate at most one entity, and an entity is generated by at most one event, and , and if $(evt,e)\in Generated$ then lifetime(e) = [time(evt),_] must hold.

### Actvities

An activity is an object that encompasses a set of events. We introduce

• a set $Activities \subseteq Objects$ of activities, disjoint from Entities and Events
• a relation $ActivityEvent \subseteq Activities \times Events$ associating activities with events, satisfying $time(evt) \in lifetime(act)$ whenever $(act,evt) \in ActivitiesEvents$.

1. Question: Does every activity need to be associated with some event?
2. Question: Does every event need to be associated with some activity?
3. Question: Can an event be associated with more than one activity?

(ISSUE-214: Luc suggests replaces Generated,ActivityEvents with $Generated : Events \times Entities \times Activities$ and similar for Used).

# Semantics

## Interpretations

Note that the above discussion does not mention identifiers. We need to add structure to the world to link identifiers to the objects they denote. We do this using a function which we shall call an interpretation.

I will assume that the mapping from identifiers to objects may not change over time. Thus, we consider interpretations as follows:

• An interpretation function $I : Identifiers \to Objects$ describing which object is the target of each identifier. We write the time argument as a subscript, i.e., I(id).

## Satisfaction

Consider an assertion φ, a world W and an interpretation I. We define notation $W,I \models \phi$ which means that φ is satisfied in W,I. For basic assertions, the definition of the satisfaction relation is given in the next few subsections. For a conjunction of assertions $\phi_1,\ldots,\phi_n$ we write $W,I \models \phi_1,\ldots,\phi_n$ to indicate that $W,I \models \phi_1$ and ... and $W,I \models \phi_n$ hold.

# Semantics of records

## Entity Records

PROV-DM refers to entity assertions entity(id,[attr1 = val1,...]).

Entity assertions can be interpreted as follows:

• $W,I \models entity(id,[attr_1=val_1,...])$ if and only if:
1. $I(id) \in Entities$
2. for each attribute attri, we have value(I(id),attri) = vali.

## Activity Records

An activity record is of the form activity(id,plan,st,et,attrs) where id is a identifier referring to the activity, st is a start time and et is an end time. There is also a set of attribute-value pairs describing additional features of the activity. We don't do anything to interpret the plan here, it is purely an annotation (e.g. pointing to a web page describing the program or service that was executed, to source code, or to some other information that might be useful).

• We say that $W,I \models activity(id,plan,st,et,attrs)$ if and only if:
1. The identifier id maps to an element of Activities, i.e. $I(id) \in Activities$
2. If st is specified then it is equal to the start time of the activity, that is: lifetime(id) = [st,_]
3. If et is specified then it is equal to the end time of the activity, that is: lifetime(id) = [_,et]
4. The values of attributes given in attrs satisfy value(I(id),attri) = vali.

## Agent Records

For the purpose of the semantics so far, an agent record holds at a time instant or interval if the corresponding entity record holds and the entity denoted by the record identifier is an agent. [TODO: Write this out more carefully.]

## Note records

The note records represent ad hoc annotations to other parts of PROV-DM. As such, there is little we can say about their meaning, and so we won't discuss note records in the formal semantics. However, dialects of PROV-DM that use annotations in a systematic way could be justified by extending the semantics and interpreting note records.

# Semantics of Relations

## Generation

The generation assertion is of the form wasGeneratedBy(id,e,a,attrs,t) where id is an event (identifier), e is an object identifier, a is an activity identifier, attrs is a set of attribute-value pairs, and t is an optional time.

• $W,I \models wasGeneratedBy(id,e,a,attrs,t)$ if and only if:
1. The identifier id denotes an event $evt = I(id) \in Events$
2. The identifier e denotes an object $obj = I(e) \in Objects$
3. The identifier a denotes an activity $act = I(a) \in Activities$.
4. The event evt is part of act, that is, such that $(act,evt) \in ActivityEvents$.
5. The type of evt is generation, i.e. type(evt) = generation
6. The event evt occurred at time t, i.e. time(evt) = t
7. The object obj came into existence at time t, that is, lifetime(I(e)) = [t,_]
8. The event evt generated obj, i.e. $(evt,obj) \in Generated$
9. The attribute values are correct, i.e., for each attribute-value pair attri,vali in attrs we have value(evt,attri) = vali.

Question: How should we enforce the intuitive constraint that an object is created by exactly one event/PE?

Question: How to enforce the constraint that e's attribute values are determined by the activity's "state" at time t?

Question: Should uses refer to an activity or to the specific event? A use event might be part of more than one activity?

## Use

The use assertion is of the form used(id,a,e,attrs,t) where id denotes an event, a is an activity identifier, e is an object identifier, attrs is a set of attribute-value pairs, and t is an optional time.

• $W,I \models used(id,a,e,attrs,t)$ if and only if:
1. The identifier id denotes an event $evt = I(id) \in Events$.
2. The identifier a denotes an activity $act = I(id) \in Activities$
3. The identifier e denotes an object $obj = I(e) \in Objects$
4. The event evt is part of act, i.e. $(act,evt) \in ActivityEvents$.
5. The type of evt is use, i.e., type(evt) = use.
6. The event evt occurred at time t, i.e. time(evt) = t
7. The entity obj existed at time t, i.e., $t \in lifetime(obj)$.
8. The event evt used obj, i.e. $(evt,obj) \in Used$.
9. The attribute values are correct, i.e., for each attribute-value pair attri,vali in attrs we have value(evt,attri) = vali.

## Association Records

We probably need a relation to connect agents to activities in order to interpret association. [TODO]

## Start Records

A start record is interpreted as follows:

• $W,I \models wasStartedBy(id,a,ag,attrs)$ holds if and only if:
1. id denotes an event evt = I(id)
2. a denotes an activity act = I(a)
3. ag denotes an agent agent = I(ag)
4. The event evt has type start, i.e. type(evt) = start.
5. [TODO] The agent was associated with the event.
6. The event was part of the activity, that is, $(act,evt) \in ActivitiesEvents$, and lifetime(act) = [time(evt),_].
7. The attribute values are correct, i.e., for each attribute-value pair attri,vali in attrs we have value(evt,attri) = vali.

## End Records

An activity end record is interpreted as follows:

• $W,I \models wasEndedBy(id,a,ag,attrs)$ holds if and only if:
1. id denotes an event evt = I(id)
2. a denotes an activity act = I(a)
3. ag denotes an agent agent = I(ag)
4. The event evt has type end, i.e. type(evt) = start.
5. [TODO] The agent was associated with the event.
6. The event was part of the activity, that is, $(act,evt) \in ActivitiesEvents$, and lifetime(act) = [_,time(evt)].
7. The attribute values are correct, i.e., for each attribute-value pair attri,vali in attrs we have value(evt,attri) = vali.

## Responsibility

The actedOnBehalfOf relation is interpreted using the ActsFor relation. [TODO]

## Derivation

The derivation relations are complicated and are not yet modeled. [TODO]

## Specialization

This section is an attempt to formally describe the new specializationOf relation, which indicates when one entity record "provides a more concrete characterization of" another.

• $W,I \models specializationOf(a,b)$ if and only if:
1. Both a and b are entity identifiers, denoting obj1 = I(a) and obj2 = I(b).
2. The two Entities refer to the same Thing, that is, thingOf(obj1) = thingOf(obj2).
3. The lifetime of obj1 is contained in that of obj2
4. Every attribute attr of obj2 is also an attribute of obj1, and for each such attribute we have value(obj1,attr) = value(obj2,attr).

The second criterion says that the two Entities are descriptions of the same Things. Note that the third criterion allows obj1 and obj2 to have the same lifetime (or that of obj2 can be larger). The last criterion allows obj1 to define more attributes than obj2, but they must agree on their common attributes.

Question: There has been controversy over whether specializationOf is transitive and/oranti-symmetric:

• Transitivity: If specializationOf(a,b) and specializationOf(b,c) hold then specializationOf(a,c) hold. This holds for the above definition.
• Antisymmetry: If specializationOf(a,b) and specializationOf(b,a) hold then a = b. This doesn't follow from the current definition (but it would if we stipulated that two entities that have the same interval, attribute and thing are equal).

## Alternate

The alternateOf relation indicates when two entity records "provide different characterizations of the same thing".

• $W,I \models alternateOf(a,b)$ if and only if:
1. Both a and b are entity identifiers, denoting obj1 = I(a) and obj2 = I(b).
2. The lifetime of obj1 overlaps that of obj2. That is, $lifetime(obj_1) \cap lifetime(obj_2) \neq \emptyset$.
3. The two objects refer to the same underlying Thing $thingOf(obj_1) \neq thingOf(obj_2)$
4. For every attribute attr common to both obj1 and obj2, we have value(obj1,attr) = value(obj2,attr).

Question: There has been controversy whether alternateOf is symmetric and transitive:

• Transitivity: If alternateOf(a,b) holds then alternateOf(b,a) holds.
• Symmetry: If alternateOf(a,b) and alternateOf(b,c) hold then alternateOf(a,c) hold. This does not hold of the above definition because we require the intervals to overlap (a and c do not necessarily have overlapping lifetimes.)

We also consider the following properties which have been suggested:

• specializationOf(e1,e2) implies alternateOf(e1,e2)? (This holds at the moment.)
• alternateOf(a,b) if and only if there exists c such that specializationOf(a,c) and specializationOf(b,c)? (This does not necessarily hold without further assumptions about the Entities).

Problem: The above definition is symmetric. However, it is not transitive. Suppose

and suppose alternateOf(obj1,obj2) and alternateOf(obj2,obj3) hold. Then alternateOf(obj1,obj3) cannot hold since the intervals of obj1,obj3 do not overlap.

## Annotation records

As with note assertions, annotation records are considered out of scope of the formal semantics.

[TODO]