Constraints of the Provenance Data Model

Abstract

Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications.

This document defines a subset of PROV instances called valid PROV instances. The intent of validation is ensure that a PROV instance represents a history of objects and their interactions which is consistent, and thus safe to use for the purpose of logical reasoning and other kinds of analysis. Valid PROV instances satisfy certain definitions, inferences, and constraints. These definitions, inferences, and constraints provide a measure of consistency checking for provenance and reasoning over provenance. They can also be used to normalize PROV instances to forms that can easily be compared in order to determine whether two PROV instances are equivalent. Validity and equivalence are also defined for PROV bundles (that is, named instances) and documents (that is, a toplevel instance together with zero or more bundles).

2. Rationale

This section is non-normative.

This section gives a high-level rationale that provides some further background for the constraints, but does not affect the technical content of the rest of the specification.

2.1 Entities, Activities and Agents

One of the central challenges in representing provenance information is how to deal with change. Real-world objects, information objects and Web resources change over time, and the characteristics that make them identifiable in a given situation are sometimes subject to change as well. PROV allows for things to be described in different ways, with different descriptions of their state.

An entity is a thing one wants to provide provenance for and whose situation in the world is described by some fixed attributes. An entity has a lifetime, defined as the period between its generation event and its invalidation event. An entity's attributes are established when the entity is created and (partially) describe the entity's situation and state during the entirety of the entity's lifetime. Does this atomicity make it impractical to describe a continuously evolving entity like a SPARQL graph store?

A different entity (perhaps representing a different user or system perspective) may fix other aspects of the same thing, and its provenance may be different. Different entities that fix aspects of the same thing are called alternates, and the PROV relations of specializationOf and alternateOf can be used to link such entities. Must alternates agree on the lifetime of an entity?

Besides entities, a variety of other PROV objects have attributes, including activity, generation, usage, invalidation, start, end, communication, attribution, association, delegation, and derivation. Each object has an associated duration interval (which may be a single time point), and attribute-value pairs for a given object are expected to be descriptions that hold for the object's duration.

However, the attributes of entities have special meaning because they are considered to be fixed aspects of underlying, changing things. This motivates constraints on alternateOf and specializationOf relating the attribute values of different entities.

In order to describe the provenance of something during an interval over which relevant attributes of the thing are not fixed, a PROV instance would describe multiple entities, each with its own identifier, lifetime, and fixed attributes, and express dependencies between the various entities using events. For example, in order to describe the provenance of several versions of a document, involving attributes such as authorship that change over time, one can use different entities for the versions linked by appropriate generation, usage, revision, and invalidation events.

There is no assumption that the set of attributes listed in an entity statement is complete, nor that the attributes are independent or orthogonal of s/of/to/ each other. Similarly, there is no assumption that the attributes of an entity uniquely identify it. Two different entities that present the same aspects of possibly different things can have the same attributes; this leads to potential ambiguity, which is mitigated through the use of identifiers.

An activity's lifetime is delimited by its start and its end events. It occurs over an interval delimited by two instantaneous events. However, an activity statement need not mention start or end time information, because they may not be known. An activity's attribute-value pairs are expected to describe the activity's situation during its lifetime.

An activity is not an entity. Indeed, an entity exists in full at any point in its lifetime, persists during this interval, and preserves the characteristics provided. In contrast, an activity is something that occurs, happens, unfolds, or develops through time. This distinction is similar to the distinction between 'continuant' and 'occurrent' in logic [Logic].

2.2 Events

Although time is important for provenance, provenance can be used in many different contexts within individual systems and across the Web. Different systems may use different clocks which may not be precisely synchronized, so when provenance statements are combined by different systems, an application may not be able to align the times involved to a single global timeline. Hence, PROV is designed to minimize assumptions about time. Instead, PROV talks about (identified) events.

The PROV data model is implicitly based on a notion of instantaneous events (or just events), that mark transitions in the world. Events include generation, usage, or invalidation of entities, as well as start or end of activities. This notion of event is not first-class in the data model, but it is useful for explaining its other concepts and its semantics [PROV-SEM]. Thus, events help justify inferences on provenance as well as validity constraints indicating when provenance is self-consistent.

Five kinds of instantaneous events are used in PROV. The activity start and activity end events delimit the beginning and the end of activities, respectively. The entity generation, entity usage, and entity invalidation events apply to entities, and the generation and invalidation events delimit the lifetime of an entity. More precisely:

An activity start event is the instantaneous event that marks the instant an activity starts.

An activity end event is the instantaneous event that marks the instant an activity ends.

An entity generation event is the instantaneous event that marks the final instant of an entity's creation timespan, after which it is available for use. The entity did not exist before this event.

An entity usage event is the instantaneous event that marks the first instant of an entity's consumption timespan by an activity. The described usage had not started before this instant, although the activity could potentially have used the same entity at a different time.

An entity invalidation event is the instantaneous event that marks the initial instant of the destruction, invalidation, or cessation of an entity, after which the entity is no longer available for use. The entity no longer exists after this event.

2.3 Types

As set out in other specifications, the identifiers used in PROV documents have associated type information. An identifier can have more than one type, reflecting subtyping or allowed overlap between types, and so we define a set of types of each identifier, typeOf(id). Some types are, however, required not to overlap (for example, no identifier can describe both an entity and an activity). In addition, an identifier cannot be used to identify both an object (that is, an entity, activity or agent) and a property (that is, a named event such as usage, generation, or a relationship such as attribution.) This specification includes disjointness and typing constraints that check these requirements. Here, we summarize the type constraints in Table 1.

Table 1: Summary of Typing Constraints recapitulates prov-dm Table 4
In relation...	identifier	has type(s)...

entity(e,attrs)	e	'entity'
activity(a,t1,t2,attrs)	a	'activity'
agent(ag,attrs)	ag	'agent'
used(id; a,e,t,attrs)	e	'entity'
used(id; a,e,t,attrs)	a	'activity'
wasGeneratedBy(id; e,a,t,attrs)	e	'entity'
wasGeneratedBy(id; e,a,t,attrs)	a	'activity'
wasInformedBy(id; a2,a1,attrs)	a2	'activity'
wasInformedBy(id; a2,a1,attrs)	a1	'activity'
wasStartedBy(id; a2,e,a1,t,attrs)	a2	'activity'
	e	'entity'
	a1	'activity'
wasEndedBy(id; a2,e,a1,t,attrs)	a2	'activity'
	e	'entity'
	a1	'activity'
wasInvalidatedBy(id; e,a,t,attrs)	e	'entity'
wasInvalidatedBy(id; e,a,t,attrs)	a	'activity'
wasDerivedFrom(id; e2,e1,a,g,u,attrs)	e2	'entity'
	e1	'entity'
	a	'activity'
wasAttributedTo(id; e,ag,attr)	e	'entity'
wasAttributedTo(id; e,ag,attr)	ag	'agent'
wasAssociatedWith(id; a,ag,pl,attrs)	a	'activity'
	ag	'agent'
	pl	'entity'
actedOnBehalfOf(id; ag2,ag1,a,attrs)	ag2	'agent'
	ag1	'agent'
	a	'activity'
alternateOf(e1,e2)	e1	'entity'
alternateOf(e1,e2)	e2	'entity'
specializationOf(e1,e2)	e1	'entity'
specializationOf(e1,e2)	e2	'entity'
mentionOf(e1,e2,b)	e1	'entity'
	e2	'entity'
	b	'entity'
hadMember(c,e)	c	'entity' 'prov:Collection'
hadMember(c,e)	e	'entity'
entity(c,[prov:type='prov:EmptyCollection,...])	c	'entity' 'prov:Collection' 'prov:EmptyCollection'

2.4 Validation Process Overview

This section collects common concepts and operations that are used throughout the specification, and relates them to background terminology and ideas from logic [Logic], constraint programming [CHR], and database constraints [DBCONSTRAINTS]. This section does not attempt to provide a complete introduction to these topics, but it is provided in order to aid readers familiar with one or more of these topics in understanding the specification, and, for all readers, to clarify some of the motivations for choices in the specification ~~to all readers~~.

Constants, Variables and Placeholders

PROV statements involve identifiers, literals, placeholders, and attribute lists. Identifiers are, according to PROV-N, expressed as qualified names which can be mapped to URIs [IRI]. However, in order to specify constraints over PROV instances, we also need variables that represent unknown identifiers, literals, or placeholders. These variables are similar to those in first-order logic [Logic]. A variable is a symbol that can be replaced by other symbols, including either other variables or constant identifiers, literals, or placeholders. In a few special cases, we also use variables for unknown attribute lists. To help distinguish identifiers and variables, we also term the former 'constant identifiers' to highlight their non-variable nature.

Several definitions and inferences conclude by saying that some objects exist such that some other formulas hold. Such an inference introduces fresh existential variables into the instance. An existential variable denotes a fixed object that exists, but its exact identity is unknown. Existential variables can stand for unknown identifiers or literal values only; we do not allow existential variables that stand for unknown attribute lists.

In particular, many occurrences of the placeholder symbol "-" stand for unknown objects; these are handled by expanding them to existential variables. Some placeholders, however, indicate the absence of an object, rather than an unknown object. In other words, the placeholder is overloaded, with different meanings in different places.

An expression is called a term if it is either a constant identifier, literal, placeholder, or variable. We write t to denote an arbitrary term.

Substitution

A substitution is a function that maps variables to terms. Concretely, since we only need to consider substitutions of finite sets of variables, we can write substitutions as [x₁ = t₁,...,x_n=t_n]. A substitution S = [x₁ = t₁,...,x_n=t_n] can be applied to a term as follows.

If the term is a variable x_i, one of the variables in the domain of S, then S(x_i) = t_i.
If the term is a constant identifier or literal c, then S(c) = c.

In addition, a substitution can be applied to an atomic formula (PROV statement) p(t₁,...,t_n) by applying it to each term, that is, S(p(t₁,...,t_n)) = p(S(t₁),...,S(t_n)). Likewise, a substitution S can be applied to an instance I by applying it to each atomic formula (PROV statement) in I, that is, S(I) = {S(A) | A ∈ I}.

Formulas

For the purpose of constraint checking, we view PROV statements (possibly involving existential variables) as formulas. An instance is analogous to a "theory" in logic, that is, a set of formulas all thought to describe the same situation. The set can also be thought of a single, large formula: the conjunction of all of the atomic formulas.

The atomic constraints considered in this specification can be viewed as atomic formulas:

Uniqueness constraints employ atomic equational formulas t = t'.
Ordering constraints employ atomic precedence relations that can be thought of as binary formulas precedes(t,t') or strictly_precedes(t,t')
Typing constraints 'type' ∈ typeOf(id) can be represented as a atomic formulas typeOf(id,'type').
Impossibility constraints employ the conclusion INVALID, which is equivalent to the logical constant False.

Similarly, the definitions, inferences, and constraint rules in this specification can also be viewed as logical formulas, built up out of atomic formulas, logical connectives "and" (∧), "implies" (⇒), and quantifiers "for all" (∀) and "there exists" (∃). For more background on logical formulas, see a logic textbook such as [Logic].

A definition of the form A IF AND ONLY IF there exists y₁...y_m such that B₁ and ... and B_k can be thought of as a formula ∀ x₁,....,x_n. A ⇔ ∃ y₁...y_m . B₁ ∧ ... ∧ B_k, where x₁...x_n are the free variables of the definition.
An inference of the form IF A₁ and ... and A_l THEN there exists y₁...y_m such that B₁ and ... and B_k can be thought of as a formula ∀ x₁,....,x_n. A₁ ∧ ... ∧ A_l ⇒ ∃ y₁...y_m . B₁ ∧ ... ∧ B_k, where x₁...x_n are the free variables of the inference.
A uniqueness, ordering, or typing constraint of the form IF A₁ ∧ ... ∧ A_l THEN C can be viewed as a formula ∀ x₁...x_n. A₁ ∧ ... ∧ A_l ⇒ C.
A constraint of the form IF A₁ ∧ ... ∧ A_l THEN INVALID can be viewed as a formula ∀ x₁...x_n. A₁ ∧ ... ∧ A_l ⇒ False.

Satisfying definitions, inferences, and constraints

In logic, a formula's meaning is defined by saying when it is satisfied. We can view definitions, inferences, and constraints as being satisfied or not satisfied in a PROV instance, augmented with information about the constraints.

A logical equivalence as used in a definition is satisfied when the formula ∀ x₁,....,x_n. A ⇔ ∃ y₁...y_m . B₁ ∧ ... ∧ B_k holds, that is, for any substitution of the variables x₁,....,x_n, formula A and formula ∃ y₁...y_m. B₁ ∧ ... ∧ B_k are either both true or both false.
A logical implication as used in an inference is satisfied with the formula ∀ x₁,....,x_n. A₁ ∧ ... ∧ A_l ⇒ ∃ y₁...y_m . B₁ ∧ ... ∧ B_k holds, that is, for any substitution of the variables x₁,....,x_n, if A₁ ∧ ... ∧ A_l is true, then for some further substitution of terms for variables y₁...y_m, formula B₁ ∧ ... ∧ B_k is also true.
A uniqueness, ordering, or typing constraint is satisfied when its associated formula ∀ x₁...x_n. A₁ ∧ ... ∧ A_l ⇒ C holds, that is, for any substitution of the variables x₁,....,x_n, if A₁ ∧ ... ∧ A_l is true, then C is also true.
An impossibility constraint is satisfied when the formula ∀ x₁...x_n. A₁ ∧ ... ∧ A_l ⇒ False holds. This is logically equivalent to ∄ x₁...x_n. A₁ ∧ ... ∧ A_l, that is, there exists no substitution for x₁...x_n making A₁ ∧ ... ∧ A_l true.

Merging

Merging is an operation that takes two terms and compares them to see if they are equal, or can be made equal by substituting an existential variable with another term. This operation is a special case of unification, a common operation in logical reasoning, including logic programming and automated reasoning. Merging two terms t,t' results in either substitution S such that S(t) = S(t'), or failure indicating that there is no substitution that can be applied to both t and t' to make them equal.

Applying definitions, inferences, and constraints

Formulas can also be interpreted as having computational content. That is, if an instance does not satisfy a formula, we can often apply the formula to the instance to produce another instance that does satisfy the formula. Definitions, inferences, and uniqueness constraints can be applied to instances:

A definition of the form ∀ x₁,....,x_n. A ⇔ ∃ y₁...y_m . B₁ ∧ ... ∧ B_k can be applied by searching for any occurrences of A in the instance and adding B₁, ..., B_k, generating fresh existential variables y₁,...,y_m, and conversely, whenever there is an occurrence of B₁, ..., B_k, adding A. In our setting, the defined formulas A are never used in other formulas, so it is sufficient to replace all occurrences of A with their definitions. The formula A is then redundant, and can be removed from the instance.
An inference of the form ∀ x₁,....,x_n. A₁ ∧ ... ∧ A_p ⇒ ∃ y₁...y_m . B₁ ∧ ... ∧ B_k can be applied by searching for any occurrences of A₁ ∧ ... ∧ A_p in the instance and, for each such match, for which the entire conclusion does not already hold (for some y₁,...,y_m), adding B₁ ∧ ... ∧ B_k to the instance, generating fresh existential variables y₁,...,y_m.
A uniqueness constraint of the form ∀ x₁...x_n. A₁ ∧ ... ∧ A_p ⇒ t = t' can be applied by searching for an occurrence A₁ ∧ ... ∧ A_p in the instance, and if one is found, merging the terms t and t'. If successful, the resulting substitution is applied to the instance; otherwise, the application of the uniqueness constraint fails.
A key constraint can similarly be applied by searching for different occurrences of a statement with the same identifier, merging the corresponding parameters of the statements, and concatenating their attribute lists. The substitutions obtained by merging are applied to the instance.

As noted above, uniqueness or key constraint application can fail, if a required merging step fails. Failure of constraint application means that there is no way to add information to the instance to satisfy the constraint, which in turn implies that the instance is invalid.

The process of applying definitions, inferences, and constraints to a PROV instance until all of them are satisfied is similar to what is sometimes called chasing [DBCONSTRAINTS] or saturation [CHR]. We call this process normalization.

Termination

In general, applying sets of logical formulas of the above definition, inference, and constraint forms is not guaranteed to terminate. A simple example is the inference R(x,y) ⇒ ∃z. R(x,z) ∧R(z,y), which can be applied to {R(a,b)} to generate an infinite sequence of larger and larger instances. To ensure that normalization, validity, and equivalence are decidable, we require that normalization terminates. There is a great deal of work on termination of the chase in databases, or of sets of constraint handling rules. The termination of the notion of normalization defined in this specification is guaranteed because the definitions, inferences and uniqueness/key constraints correspond to a weakly acyclic set of tuple-generating and equality-generating dependencies, in the terminology of [DBCONSTRAINTS]. The termination of the remaining ordering, typing, and impossibility constraints is easy to show. Appendix C gives a proof that the definitions, inferences, and uniqueness and key constraints are weakly acyclic and therefore terminating.

There is an important subtlety that is essential to guarantee termination. This specification draws a distinction between knowing that an identifier has type 'entity', 'activity', or 'agent', and having an explicit entity(id), activity(id), or agent(id) statement in the instance. For example, focusing on entity statements, we can infer 'entity' ∈ typeOf(id) if entity(id) holds in the instance. In contrast, if we only know that 'entity' ∈ typeOf(id), this does not imply that entity(id) holds.

This distinction (for both entities and activities) is essential to ensure termination of the inferences, because we allow inferring that a declared entity(id,attrs) has a generation and invalidation event, using Inference 7 (entity-generation-invalidation-inference). Likewise, for activities, we allow inferring that a declared activity(id,t1,t2,attrs) has a generation and invalidation event, using Inference 8 (activity-start-end-inference). These inferences do not apply to identifiers whose types are known, but for which there is not an explicit entity or activity statement. If we strengthened the type inference constraints to add new entity or activity statements for the entities and activities involved in generating or starting other declared entities or activities, then we could keep generating new entities and activities in an unbounded chain into the past (as in the "chicken and egg" paradox). The design adopted here requires that instances explicitly declare the entities and activities that are relevant for validity checking, and only these can be inferred to have invalidation/generation and start/end events. This inference is not supported for identifiers that are indirectly referenced in other relations and therefore have type 'entity' or 'activity'.

Figure 1^◊: Overview of the Validation Process

Checking ordering, typing, and impossibility constraints

The ordering, typing, and impossibility constraints are checked rather than applied. This means that they do not generate new formulas expressible in PROV, but they do generate basic constraints that might or might not be consistent with each other. Checking such constraints follows a saturation strategy similar to that for normalization:

For ordering constraints, we check by generating all of the precedes and strictly-precedes relationships specified by the rules. These can be thought of as a directed graph whose nodes are terms, and whose edges are precedes or strictly-precedes relationships. An ordering constraint of the form ∀ x₁...x_n. A₁ ∧ ... ∧ A_p ⇒ precedes(t,t') can be applied by searching for occurrences of A₁ ∧ ... ∧ A_p and for each such match adding the atomic formula precedes(t,t') to the instance, and similarly for strictly-precedes constraints. After all such constraints have been checked, and the resulting edges added to the graph, the ordering constraints are violated if there is a cycle in the graph that includes a strictly-precedes edge, and satisfied otherwise.
For typing constraints, we check by constructing a function typeOf(id) mapping identifiers to sets of possible types. We start with a function mapping each identifier to the empty set, reflecting no constraints on the identifiers' types. A typing constraint of the form ∀ x₁...x_n. A₁ ∧ ... ∧ A_p ⇒ 'type' ∈ typeOf(id) is checked by adjusting the function by adding 'type' to typeOf(id) for each conclusion 'type' ∈ typeOf(id) of the rule. Typing constraints with multiple conclusions are handled analogously. Once all constraints have been checked in all possible ways, we check that the disjointness constraints hold of the resulting typeOf function. (These are essentially impossibility constraints).
For impossibility constraints, we check by searching for the forbidden pattern that the impossibility constraint describes. Any match of this pattern leads to failure of the constraint checking process. An impossibility constraint of the form ∀ x₁...x_n. A₁ ∧ ... ∧ A_p ⇒ False can be applied by searching for occurrences of A₁ ∧ ... ∧ A_p in the instance, and if any such occurrence is found, signaling failure.

A normalized instance that satisfies all of the checked constraints is called valid. Validity can be, but is not required to be, checked by normalizing and then checking constraints. Any other algorithm that provides equivalent behavior (that is, accepts the same valid instances and rejects the same invalid instances) is allowed. In particular, the checked constraints and the applied definitions, inferences and uniqueness constraints do not interfere with one another, so it is also possible to mix checking and application. This may be desirable in order to detect invalidity more quickly.

Equivalence and Isomorphism

Given two normal forms, a natural question is whether they contain the same information, that is, whether they are equivalent (if so, then the original instances are also equivalent.) By analogy with logic, if we consider normalized PROV instances with existential variables to represent sets of possible situations, then two normal forms may describe the same situation but differ in inessential details such as the order of statements or of elements of attribute-value lists. To remedy this, we can easily consider instances to be equivalent up to reordering of attributes. However, instances can also be equivalent if they differ only in choice of names of existential variables. Because of this, the appropriate notion of equivalence of normal forms is isomorphism. Two instances I₁ and I₂ are isomorphic if there is an invertible substitution S mapping existential variables to existential variables such that S(I₁) = I₂. This is similar to the notion of equivalence used in [RDF], where blank nodes play an analogous role to existential variables.

Equivalence can be checked by normalizing instances, checking that both instances are valid, then testing whether the two normal forms are isomorphic. (It is technically possible for two invalid normal forms to be isomorphic, but to be considered equivalent, the two instances must also be valid.) As with validity, the algorithm suggested by this specification is just one of many possible ways to implement equivalence checking; it is not required that implementations compute normal forms explicitly, only that their determinations of equivalence match those obtained by the algorithm in this specification.

From Instances to Bundles and Documents

PROV documents can contain multiple instances: a toplevel instance consisting of the set of statements not appearing within a bundle, and zero or more named instances called bundles. For the purpose of inference and constraint checking, these instances are treated independently. That is, a PROV document is valid provided that each instance in it is valid and the names of its bundles are distinct. Similarly, a PROV document is equivalent to another if their toplevel instances are equivalent, they have the same number of bundles with the same names, and the instances of their corresponding bundles are equivalent. Analogously to blank nodes in [RDF], the scope of an existential variable in PROV is the instance level, so existential variables with the same name occurring in different instances do not necessarily denote the same term. This is a consequence of the fact that the instances of two equivalent documents only need to be pairwise isomorphic; this is a weaker property than requiring that there be a single isomorphism that works for all of the corresponding instances.

2.5 Summary of inferences and constraints

Table 2 summarizes the inferences, and constraints specified in this document, broken down by component and type or relation involved.

Table: work in progress; these entries might change when the document is updated.

Table 2: Summary of inferences and constraints for PROV Types and Relations
Type or Relation Name	Inferences and Constraints	Component

Entity	Inference 7 (entity-generation-invalidation-inference) Inference 21 (specialization-attributes-inference) Constraint 23 (key-object) Constraint 56 (impossible-object-property-overlap) Constraint 57 (entity-activity-disjoint)	1
Activity	Inference 8 (activity-start-end-inference) Constraint 23 (key-object) Constraint 29 (unique-startTime) Constraint 30 (unique-endTime) Constraint 56 (impossible-object-property-overlap) Constraint 57 (entity-activity-disjoint)
Generation	Inference 6 (generation-use-communication-inference) Inference 15 (influence-inference) Constraint 24 (key-properties) Constraint 25 (unique-generation) Constraint 36 (generation-within-activity) Constraint 38 (generation-precedes-invalidation) Constraint 39 (generation-precedes-usage) Constraint 41 (generation-generation-ordering) Constraint 43 (derivation-usage-generation-ordering) Constraint 44 (derivation-generation-generation-ordering) Constraint 45 (wasStartedBy-ordering) Constraint 46 (wasEndedBy-ordering) Constraint 47 (specialization-generation-ordering) Constraint 49 (wasAssociatedWith-ordering) Constraint 50 (wasAttributedTo-ordering) Constraint 51 (actedOnBehalfOf-ordering) Constraint 55 (impossible-property-overlap) Constraint 52 (typing)
Usage	Inference 6 (generation-use-communication-inference) Inference 15 (influence-inference) Constraint 24 (key-properties) Constraint 35 (usage-within-activity) Constraint 39 (generation-precedes-usage) Constraint 40 (usage-precedes-invalidation) Constraint 43 (derivation-usage-generation-ordering) Constraint 55 (impossible-property-overlap) Constraint 52 (typing)
Communication	Inference 5 (communication-generation-use-inference) Inference 15 (influence-inference) Constraint 24 (key-properties) Constraint 37 (wasInformedBy-ordering) Constraint 55 (impossible-property-overlap) Constraint 52 (typing)
Start	Inference 9 (wasStartedBy-inference) Inference 15 (influence-inference) Constraint 24 (key-properties) Constraint 27 (unique-wasStartedBy) Constraint 29 (unique-startTime) Constraint 32 (start-precedes-end) Constraint 35 (usage-within-activity) Constraint 36 (generation-within-activity) Constraint 37 (wasInformedBy-ordering) Constraint 33 (start-start-ordering) Constraint 45 (wasStartedBy-ordering) Constraint 49 (wasAssociatedWith-ordering) Constraint 55 (impossible-property-overlap) Constraint 52 (typing)
End	Inference 10 (wasEndedBy-inference) Inference 15 (influence-inference) Constraint 24 (key-properties) Constraint 28 (unique-wasEndedBy) Constraint 30 (unique-endTime) Constraint 32 (start-precedes-end) Constraint 35 (usage-within-activity) Constraint 36 (generation-within-activity) Constraint 37 (wasInformedBy-ordering) Constraint 34 (end-end-ordering) Constraint 46 (wasEndedBy-ordering) Constraint 49 (wasAssociatedWith-ordering) Constraint 55 (impossible-property-overlap) Constraint 52 (typing)
Invalidation	Inference 15 (influence-inference) Constraint 24 (key-properties) Constraint 26 (unique-invalidation) Constraint 38 (generation-precedes-invalidation) Constraint 40 (usage-precedes-invalidation) Constraint 42 (invalidation-invalidation-ordering) Constraint 45 (wasStartedBy-ordering) Constraint 46 (wasEndedBy-ordering) Constraint 48 (specialization-invalidation-ordering) Constraint 49 (wasAssociatedWith-ordering) Constraint 50 (wasAttributedTo-ordering) Constraint 51 (actedOnBehalfOf-ordering) Constraint 55 (impossible-property-overlap) Constraint 52 (typing)

Derivation	Inference 11 (derivation-generation-use-inference) Inference 15 (influence-inference) Constraint 24 (key-properties) Constraint 43 (derivation-usage-generation-ordering) Constraint 44 (derivation-generation-generation-ordering) Constraint 52 (typing)	2
Revision	Inference 12 (revision-is-alternate-inference)
Quotation	No specific constraints
Primary Source	No specific constraints
Influence	No specific constraints

Agent	Constraint 23 (key-object) Constraint 56 (impossible-object-property-overlap)	3
Attribution	Inference 13 (attribution-inference) Inference 15 (influence-inference) Constraint 24 (key-properties) Constraint 50 (wasAttributedTo-ordering) Constraint 55 (impossible-property-overlap) Constraint 52 (typing)
Association	Inference 15 (influence-inference) Constraint 24 (key-properties) Constraint 49 (wasAssociatedWith-ordering) Constraint 55 (impossible-property-overlap) Constraint 52 (typing)
Delegation	Inference 14 (delegation-inference) Inference 15 (influence-inference) Constraint 24 (key-properties) Constraint 51 (actedOnBehalfOf-ordering) Constraint 55 (impossible-property-overlap) Constraint 52 (typing)
Influence	Inference 15 (influence-inference) Constraint 24 (key-properties)

Bundle constructor	No specific constraints; see section 6.2 Bundles and Documents	4
Bundle type	No specific constraints; see section 6.2 Bundles and Documents	4

Alternate	Inference 16 (alternate-reflexive) Inference 17 (alternate-transitive) Inference 18 (alternate-symmetric) Constraint 52 (typing)	5
Specialization	Inference 19 (specialization-transitive) Inference 20 (specialization-alternate-inference) Inference 21 (specialization-attributes-inference) Constraint 47 (specialization-generation-ordering) Constraint 48 (specialization-invalidation-ordering) Constraint 54 (impossible-specialization-reflexive) Constraint 52 (typing)
Mention	Inference 22 (mention-specialization-inference) Constraint 31 (unique-mention) Constraint 52 (typing)

Collection	No specific constraints	6
Membership	Constraint 58 (membership-empty-collection) Constraint 52 (typing)	6

3. Compliance with this document

For the purpose of compliance, the normative sections of this document are section 3. Compliance with this document, section 4. Definitions and Inferences, section 5. Constraints, and section 6. Normalization, Validity, and Equivalence. To be compliant:

When processing provenance, an application may apply the inferences and definitions in section 4. Definitions and Inferences.
If determining whether a PROV instance or document is valid, an application must check that all of the constraints of section 5. Constraints are satisfied on the normal form of the instance or document.
If producing provenance meant for other applications to use, the application should produce valid provenance, as specified in section 6. Normalization, Validity, and Equivalence.
If determining whether two PROV instances or documents are equivalent, an application must determine whether their normal forms are equal, as specified in section 6. Normalization, Validity, and Equivalence.

Compliant applications are not required to explicitly compute normal forms; however, if checking validity or equivalence, the results should be the same as would be obtained by computing normal forms as defined in this specification.

All figures are for illustration purposes only. Information in tables is normative if it appears in a normative section; specifically, Table 3 is normative. Text in appendices and in boxes labeled "Remark" is informative. Where there is any apparent ambiguity between the descriptive text and the formal text in a "definition", "inference" or "constraint" box, the formal text takes priority.

4. Definitions and Inferences

This section describes definitions and inferences that may be used on provenance data, and preserve equivalence on valid PROV instances (as detailed in section 6. Normalization, Validity, and Equivalence). A definition is a rule that can be applied to PROV instances to replace defined expressions with definitions. An inference is a rule that can be applied to PROV instances to add new PROV statements. A definition states that a provenance statement is equivalent to some other statements, whereas an inference only states one direction of an implication; thus, defined provenance statements can be replaced by their definitions.

Definitions have the following general form:

Definition-example NNN (definition-example)

defined_stmt IF AND ONLY IF there exists a₁,..., a_m such that defining_stmt₁ and ... and defining_stmt_n.

A definition can be applied to a PROV instance, since its defined_stmt is defined in terms of other statements. Applying a definition to an instance means that if an occurrence of a defined provenance statement defined_stmt can be found in a PROV instance, then we can remove it and add all of the statements defining_stmt₁ ... defining_stmt_n to the instance, possibly after generating fresh identifiers a₁,...,a_m for existential variables. In other words, it is safe to replace a defined statement with its definition.

Inferences have the following general form:

Inference-example NNN (inference-example)

IF hyp₁ and ... and hyp_k THEN there exists a₁ and ... and a_m such that concl₁ and ... and concl_n.

Inferences can be applied to PROV instances. Applying an inference to an instance means that if all of the provenance statements matching hyp₁... hyp_k can be found in the instance, then we check whether the conclusion concl₁ ... concl_n is satisfied for some values of existential variables. If so, application of the inference has no effect on the instance. If not, then a copy the conclusion should be added to the instance, after generating fresh identifiers a₁,...,a_m for the existential variables. These fresh identifiers might later be found to be equal to known identifiers; they play a similar role in PROV constraints to existential variables in logic, to "labeled nulls" in database theory [DBCONSTRAINTS], or to blank nodes in [RDF]. In general, omitted optional parameters to [PROV-N] statements, or explicit - markers, are placeholders for existentially quantified variables; that is, they denote unknown values. There are a few exceptions to this general rule, which are specified in Definition 4 (optional-placeholders).

Definitions and inferences can be viewed as logical formulas; similar formalisms are often used in rule-based reasoning [CHR] and in databases [DBCONSTRAINTS]. In particular, the identifiers a₁ ... a_n should be viewed as existentially quantified variables, meaning that through subsequent reasoning steps they may turn out to be equal to other identifiers that are already known, or to other existentially quantified variables. Their treatment is analogous to that of blank nodes in RDF. In contrast, distinct URIs or literal values in PROV are assumed to be distinct for the purpose of checking validity or inferences. This issue is discussed in more detail under Uniqueness Constraints.

In a [definition|inference], term symbols such as id, start, end, e, a, attrs, are assumed to be variables unless otherwise specified. These variables are scoped at the [definition|inference|constraint] level, so the rule is equivalent to any one-for-one renaming of the variable names. When several rules are collected within a [definition|inference] as an ordered list, the scope of the variables in each rule is at the level of list elements, and so reuse of variable names in different rules does not affect the meaning.

4.1 Optional Identifiers and Attributes

Many PROV relation statements have an identifier, identifying a link between two or more related objects. Identifiers can sometimes be omitted in [PROV-O] notation. For the purpose of inference and validity checking, we generate special identifiers called existential variables denoting the unknown values.

Existential variables can be substituted with other terms. Specifically, a substitution is a function from a set of existential variables to identifiers, literals, the placeholder -, or other existential variables. A substitution S can be applied to an instance I by replacing all occurrences of existential variables x in the instance with S(x).

Definition 1 (optional-identifiers), Definition 2 (optional-attributes), and Definition 3 (definition-short-forms), explain how to expand the compact forms of PROV-N notation into a normal form. Definition 4 (optional-placeholders) indicates when other optional parameters can be replaced by existential variables.

Definition 1 (optional-identifiers)

For each r in { used, wasGeneratedBy, wasInvalidatedBy, wasInfluencedBy, wasStartedBy, wasEndedBy, wasInformedBy, wasDerivedFrom, wasAttributedTo, wasAssociatedWith, actedOnBehalfOf}, the following definitional rules hold:

r(a₁,...,a_n) IF AND ONLY IF there exists id such that r(id; a₁,...,a_n).
r(-; a₁,...,a_n) IF AND ONLY IF there exists id such that r(id; a₁,...,a_n).

Likewise, many PROV-N statements allow for an optional attribute list. If it is omitted, this is the same as specifying an empty attribute list:

Definition 2 (optional-attributes)

For each p in {entity, activity, agent}, if a_n is not an attribute list parameter then the following definitional rule holds:
p(a₁,...,a_n) IF AND ONLY IF p(a₁,...,a_n,[]).
For each r in { used, wasGeneratedBy, wasInvalidated, wasInfluencedBy, wasStartedBy, wasEndedBy, wasInformedBy, wasDerivedFrom, wasAttributedTo, wasAssociatedWith, actedOnBehalfOf}, if a_n is not an attribute list parameter then the following definition holds:
r(id; a₁,...,a_n) IF AND ONLY IF r(id; a₁,...,a_n,[]).

Definitions Definition 1 (optional-identifiers) and Definition 2 (optional-attributes). do not apply to alternateOf, specializationOf, and mentionOf, which do not have identifiers and attributes.

Finally, many PROV statements have other optional arguments or short forms that can be used if none of the optional arguments is present. These are handled by specific rules listed below.

Definition 3 (definition-short-forms)

activity(id,attrs) IF AND ONLY IF activity(id,-,-,attrs).
wasGeneratedBy(id; e,attrs) IF AND ONLY IF wasGeneratedBy(id; e,-,-,attrs).
used(id; a,attrs) IF AND ONLY IF used(id; a,-,-,attrs).
wasStartedBy(id; a,attrs) IF AND ONLY IF wasStartedBy(id; a,-,-,-,attrs).
wasEndedBy(id; a,attrs) IF AND ONLY IF wasEndedBy(id; a,-,-,-,attrs).
wasInvalidatedBy(id; e,attrs) IF AND ONLY IF wasInvalidatedBy(id; e,-,-,attrs).
wasDerivedFrom(id; e2,e1,attrs) IF AND ONLY IF wasDerivedFrom(id; e2,e1,-,-,-,attrs).
wasAssociatedWith(id; e,attrs) IF AND ONLY IF wasAssociatedWith(id; e,-,-,attrs).
actedOnBehalfOf(id; a2,a1,attrs) IF AND ONLY IF actedOnBehalfOf(id; a2,a1,-,attrs).

There are no expansion rules for entity, agent, communication, attribution, influence, alternate, specialization, or mention relations, because these have no optional parameters aside from the identifier and attributes, which are expanded by the rules in Definition 1 (optional-identifiers) and Definition 2 (optional-attributes).

Finally, most optional parameters (written -) are, for the purpose of this document, considered to be distinct, fresh existential variables. Optional parameters are defined in [PROV-DM] and in [PROV-N] for each type of PROV statement. Thus, before proceeding to apply other definitions or inferences, most occurrences of - are to be replaced by fresh existential variables, distinct from any others occurring in the instance. The only exceptions to this general rule, where - are to be left in place, are the activity, generation, and usage parameters in wasDerivedFrom and the plan parameter in wasAssociatedWith. This is further explained in remarks below.

The treatment of optional parameters is specified formally using the auxiliary concept of expandable parameter. An expandable parameter is one that can be omitted using the placeholder -, and if so, it is to be replaced by a fresh existential identifier. Table 3 defines the expandable parameters of the properties of PROV, needed in Definition 4 (optional-placeholders). For emphasis, the four optional parameters that are not expandable are also listed. Parameters that cannot have value -, and identifiers that are expanded by Definition 1 (optional-identifiers), are not listed.

Table 3: Expandable and Non-Expandable Parameters
Relation	Expandable	Non-expandable

used(id; a,e,t,attrs)	e,t
wasGeneratedBy(id; e,a,t,attrs)	a,t
wasStartedBy(id; a2,e,a1,t,attrs)	e,a1,t
wasEndedBy(id; a2,e,a1,t,attrs)	e,a1,t
wasInvalidatedBy(id; e,a,t,attrs)	a,t
wasDerivedFrom(id; e2,e1,-,g,u,attrs)		g,u
wasDerivedFrom(id; e2,e1,a,g,u,attrs) (where a is not placeholder -)	g,u	a
wasAssociatedWith(id; a,ag,pl,attrs)	ag	pl
actedOnBehalfOf(id; ag2,ag1,a,attrs)	a

Definition 4 (optional-placeholders) states how parameters are to be expanded, using the expandable parameters defined in Table 3. The last two parts, 4 and 5, indicate how to handle expansion of parameters for wasDerivedFrom expansion, which is only allowed for the generation and use parameters when the activity is specified. Essentially, the definitions state that parameters g,u are expandable only if the activity is specified, i.e., if parameter a is provided. The rationale for this is that when a is provided, then there have to be two events, namely u and g, which account for the usage of e1 and the generation of e2, respectively, by a. Conversely, if a is not provided, then one cannot tell whether one or more activities are involved in the derivation, and the explicit introduction of such events, which correspond to a single acitivity, would therefore not be justified.

A later constraint, Constraint 53 (impossible-unspecified-derivation-generation-use), forbids specifying generation and use parameters when the activity is unspecified.

Definition 4 (optional-placeholders)

activity(id,-,t2,attrs) IF AND ONLY IF there exists t1 such that activity(id,t1,t2,attrs). Here, t2 may be a placeholder.
activity(id,t1,-,attrs) IF AND ONLY IF there exists t2 such that activity(id,t1,t2,attrs). Here, t1 must not be a placeholder.
For each r in { used, wasGeneratedBy, wasStartedBy, wasEndedBy, wasInvalidatedBy, wasAssociatedWith, actedOnBehalfOf }, if the ith parameter of r is an expandable parameter of r as specified in Table 3 then the following definition holds:
r(a₀;...,a_i-1, -, a_i+1, ...,a_n) IF AND ONLY IF there exists a' such that r(a₀;...,a_i-1,a',a_i+1,...,a_n).
If a is not the placeholder -, and u is any term, then the following definition holds:
wasDerivedFrom(id;e2,e1,a,-,u,attrs) IF AND ONLY IF there exists g such that wasDerivedFrom(id; e2,e1,a,g,u,attrs).
If a is not the placeholder -, and g is any term, then the following definition holds:
wasDerivedFrom(id;e2,e1,a,g,-,attrs) IF AND ONLY IF there exists u such that wasDerivedFrom(id; e2,e1,a,g,u,attrs).

In an association of the form wasAssociatedWith(id; a,ag,-,attr), the absence of a plan means: either no plan exists, or a plan exists but it is not identified. Thus, it is not equivalent to wasAssociatedWith(id; a,ag,p,attr) where a plan p is given.

A derivation wasDerivedFrom(id; e2,e1,a,gen,use,attrs) that specifies an activity explicitly indicates that this activity achieved the derivation, with a usage use of entity e1, and a generation gen of entity e2. It differs from a derivation of the form wasDerivedFrom(id; e2,e1,-,-,-,attrs) with missing activity, generation, and usage. In the latter form, it is not specified if one or more activities are involved in the derivation.

Let us consider a system, in which a derivation is underpinned by multiple activities. Conceptually, one could also model such a system with a new activity that encompasses the two original activities and underpins the derivation. The inferences defined in this specification do not allow the latter modelling to be inferred from the former. Hence, the two modellings of the same system are regarded as different in the context of this specification.

4.2 Entities and Activities

Communication between activities implies the existence of an underlying entity generated by one activity and used by the other, and vice versa.

Inference 5 (communication-generation-use-inference)

IF wasInformedBy(_id; a2,a1,_attrs) THEN there exist e, _gen, _t1, _use, and _t2, such that wasGeneratedBy(_gen; e,a1,_t1,[]) and used(_use; a2,e,_t2,[]) hold.

Inference 6 (generation-use-communication-inference)

IF wasGeneratedBy(_gen; e,a1,_t1,_attrs1) and used(_id2; a2,e,_t2,_attrs2) hold THEN there exists _id such that wasInformedBy(_id; a2,a1,[])

The relationship wasInformedBy is not transitive. Indeed, consider the following statements.

wasInformedBy(a2,a1)
wasInformedBy(a3,a2)

We cannot infer wasInformedBy(a3,a1) from these statements alone. Indeed, from wasInformedBy(a2,a1), we know that there exists e1 such that e1 was generated by a1 and used by a2. Likewise, from wasInformedBy(a3,a2), we know that there exists e2 such that e2 was generated by a2 and used by a3. The following illustration shows a counterexample to transitivity. The horizontal axis represents the event line. We see that e1 was generated after e2 was used. Furthermore, the illustration also shows that a3 completes before a1 started. So in this example (with no other information) it is impossible for a3 to have used an entity generated by a1. This is illustrated in Figure 2.

Figure 2^◊: Counter-example for transitivity of wasInformedBy

From an entity statement, we can infer the existence of generation and invalidation events.

Inference 7 (entity-generation-invalidation-inference)

IF entity(e,_attrs) THEN there exist _gen, _a1, _t1, _inv, _a2, and _t2 such that wasGeneratedBy(_gen; e,_a1,_t1,[]) and wasInvalidatedBy(_inv; e,_a2,_t2,[]).

From an activity statement, we can infer start and end events whose times match the start and end times of the activity, respectively.

Inference 8 (activity-start-end-inference)

IF activity(a,t1,t2,_attrs) THEN there exist _start, _e1, _a1, _end, _a2, and _e2 such that wasStartedBy(_start; a,_e1,_a1,t1,[]) and wasEndedBy(_end; a,_e2,_a2,t2,[]).

The start of an activity a triggered by entity e1 implies that e1 was generated by the starting activity a1.

Inference 9 (wasStartedBy-inference)

IF wasStartedBy(_id; a,e1,a1,_t,_attrs), THEN there exist _gen and _t1 such that wasGeneratedBy(_gen; e1,a1,_t1,[]).

Likewise, the ending of activity a by triggering entity e1 implies that e1 was generated by the ending activity a1.

Inference 10 (wasEndedBy-inference)

IF wasEndedBy(_id; a,e1,a1,_t,_attrs), THEN there exist _gen and _t1 such that wasGeneratedBy(_gen; e1,a1,_t1,[]).

4.3 Derivations

Derivations with explicit activity, generation, and usage admit the following inference:

Inference 11 (derivation-generation-use-inference)

In this inference, none of a, gen2 or use1 can be placeholders -.

IF wasDerivedFrom(_id; e2,e1,a,gen2,use1,_attrs), THEN there exists _t1 and _t2 such that used(use1; a,e1,_t1,[]) and wasGeneratedBy(gen2; e2,a,_t2,[]).

A revision admits the following inference, stating that the two entities linked by a revision are also alternates.

Inference 12 (revision-is-alternate-inference)

In this inference, any of _a, _g or _u may be placeholders.

IF wasDerivedFrom(_id; e2,e1,_a,_g,_u,[prov:type='prov:Revision']), THEN alternateOf(e2,e1).

There is no inference stating that wasDerivedFrom is transitive.

4.4 Agents

Attribution is the ascribing of an entity to an agent. An entity can only be ascribed to an agent if the agent was associated with an activity that generated the entity. If the activity, generation and association events are not explicit in the instance, they can be inferred.

Inference 13 (attribution-inference)

IF wasAttributedTo(_att; e,ag,_attrs) THEN there exist a, _t, _gen, _assoc, _pl, such that wasGeneratedBy(_gen; e,a,_t,[]) and wasAssociatedWith(_assoc; a,ag,_pl,[]).

In the above inference, _pl is an existential variable, so it can be merged with a constant identifier, another existential variable, or a placeholder -, as explained in the definition of merging.

Delegation relates agents where one agent acts on behalf of another, in the context of some activity. The supervising agent delegates some responsibility for part of the activity to the subordinate agent, while retaining some responsibility for the overall activity. Both agents are associated with this activity.

Inference 14 (delegation-inference)

IF actedOnBehalfOf(_id; ag1, ag2, a, _attrs) THEN there exist _id1, _pl1, _id2, and _pl2 such that wasAssociatedWith(_id1; a, ag1, _pl1, []) and wasAssociatedWith(_id2; a, ag2, _pl2, []).

The two associations between the agents and the activity may have different identifiers, different plans, and different attributes. In particular, the plans of the two agents need not be the same, and one, both, or neither can be the placeholder - indicating that there is no plan, because the existential variables _pl1 and _pl2 can be replaced with constant identifiers, existential variables, or placeholders - independently, as explained in the definition of merging.

The wasInfluencedBy relation is implied by other relations, including usage, start, end, generation, invalidation, communication, derivation, attribution, association, and delegation. To capture this explicitly, we allow the following inferences:

Inference 15 (influence-inference)

IF wasGeneratedBy(id; e,a,_t,attrs) THEN wasInfluencedBy(id; e, a, attrs).
IF used(id; a,e,_t,attrs) THEN wasInfluencedBy(id; a, e, attrs).
IF wasInformedBy(id; a2,a1,attrs) THEN wasInfluencedBy(id; a2, a1, attrs).
IF wasStartedBy(id; a2,e,a1,_t,attrs) THEN wasInfluencedBy(id; a2, e, attrs).
IF wasEndedBy(id; a2,e,_a1,_t,attrs) THEN wasInfluencedBy(id; a2, e, attrs).
IF wasInvalidatedBy(id; e,a,_t,attrs) THEN wasInfluencedBy(id; e, a, attrs).
IF wasDerivedFrom(id; e2, e1, a, g, u, attrs) THEN wasInfluencedBy(id; e2, e1, attrs). Here, a, g, u may be placeholders -.
IF wasAttributedTo(id; e,ag,attrs) THEN wasInfluencedBy(id; e, ag, attrs).
IF wasAssociatedWith(id; a,ag,_pl,attrs) THEN wasInfluencedBy(id; a, ag, attrs). Here, _pl may be a placeholder -.
IF actedOnBehalfOf(id; ag2,ag1,_a,attrs) THEN wasInfluencedBy(id; ag2, ag1, attrs).

The inferences above permit the use of same identifier for an influence relationship and a more specific relationship.

4.5 Alternate and Specialized Entities

The relation alternateOf is an equivalence relation: that is, it is reflexive, transitive and symmetric. As a consequence, the following inferences can be applied:

Inference 16 (alternate-reflexive)

IF entity(e) THEN alternateOf(e,e).

Inference 17 (alternate-transitive)

IF alternateOf(e1,e2) and alternateOf(e2,e3) THEN alternateOf(e1,e3).

Inference 18 (alternate-symmetric)

IF alternateOf(e1,e2) THEN alternateOf(e2,e1).

Similarly, specialization is a strict partial order: it is irreflexive and transitive. Irreflexivity is handled later as Constraint 54 (impossible-specialization-reflexive)

Inference 19 (specialization-transitive)

IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3).

If one entity specializes another, then they are also alternates:

Inference 20 (specialization-alternate-inference)

IF specializationOf(e1,e2) THEN alternateOf(e1,e2).

If one entity specializes another then all attributes of the more general entity are also attributes of the more specific one.

Inference 21 (specialization-attributes-inference)

IF entity(e1, attrs) and specializationOf(e2,e1), THEN entity(e2, attrs).

Note: The following inference is associated with a feature "at risk" and may be removed from this specification based on feedback. Please send feedback to public-prov-comments@w3.org.

If one entity is a mention of another in a bundle, then the former is also a specialization of the latter:

Inference 22 (mention-specialization-inference)

IF mentionOf(e2,e1,b) THEN specializationOf(e2,e1).

5. Constraints

This section defines a collection of constraints on PROV instances. There are three kinds of constraints:

uniqueness constraints that say that a PROV instance can contain at most one statement of each kind with a given identifier. For example, if we describe the same generation event twice, then the two statements should have the same times;
event ordering constraints that say that it should be possible to arrange the events (generation, usage, invalidation, start, end) described in a PROV instance into a preorder that corresponds to a sensible "history" (for example, an entity should not be generated after it is used); and
impossibility constraints, which forbid certain patterns of statements in valid PROV instances.

As in a [definition|inference], term symbols such as id, start, end, e, a, attrs in a constraint, are assumed to be variables unless otherwise specified. These variables are scoped at the constraint level, so the rule is equivalent to any one-for-one renaming of the variable names. When several rules are collected within a constraint as an ordered list, the scope of the variables in each rule is at the level of list elements, and so reuse of variable names in different rules does not affect the meaning.

5.1 Uniqueness Constraints

In the absence of existential variables, uniqueness constraints could be checked directly by checking that no identifier appears more than once for a given statement. However, in the presence of existential variables, we need to be more careful to combine partial information that might be present in multiple compatible statements, due to inferences. Uniqueness constraints are enforced through merging pairs of statements subject to equalities. For example, suppose we have two activity statements activity(a,2011-11-16T16:00:00,_t1,[a=1]) and activity(a,_t2,2011-11-16T18:00:00,[b=2]), with existential variables _t1 and _t2. The merge of these two statements (describing the same activity a) is activity(a,2011-11-16T16:00:00,2011-11-16T18:00:00,[a=1,b=2]).

Merging is an operation that can be applied to a pair of terms, or a pair of attribute lists. The result of merging is either a substitution (mapping existentially quantified variables to terms) or failure, indicating that the merge cannot be performed. Merging of pairs of terms, attribute lists, or statements is defined as follows.

If t and t' are constant identifiers or values (including the placeholder -), then their merge exists only if they are equal, otherwise merging fails.
If x is an existential variable and t' is any term (identifier, constant, placeholder -, or existential variable), then their merge is t', and the resulting substitution is [x=t']. In the special case where t'=x, the merge is x and the resulting substitution is empty.
If t is any term (identifier, constant, placeholder -, or existential variable) and x' is an existential variable, then their merge is the same as the merge of x and t.
The merge of two attribute lists attrs1 and attrs2 is their union, considered as sets of key-value pairs, written attrs1 ∪ attrs2. Duplicate keys with different are allowed, but equal key-value pairs are merged.

Merging for terms is analogous to unification in logic programming and theorem proving, restricted to flat terms with no function symbols. No occurs check is needed because there are no function symbols.

A typical uniqueness constraint is as follows:

Constraint-example NNN (uniqueness-example)

IF hyp₁ and ... and hyp_n THEN t₁ = u₁ and ... and t_n = u_n.

Such a constraint is enforced as follows:

Suppose PROV instance I contains all of the hypotheses hyp₁ and ... and hyp_n.
Attempt to merge all of the equated terms in the conclusion t₁ = u₁ and ... and t_n = u_n.
If merging fails, then the constraint is unsatisfiable, so application of the constraint to I fails. If this failure occurs during normalization prior to validation, then I is invalid, as explained in Section 6.
If merging succeeds with a substitution S, then S is applied to the instance I, yielding result S(I).

Key constraints are uniqueness constraints that specify that a particular key field of a relation uniquely determines the other parameters. Key constraints are written as follows:

Constraint-example NNN (key-example)

The a_k field is a KEY for relation r(a₀; a₁,...,a_n).

Because of the presence of attributes, key constraints do not reduce directly to uniqueness constraints. Instead, we enforce key constraints as follows.

Suppose r(a₀; a₁,...a_n,attrs1) and r(b₀; b₁,...b_n,attrs2) hold in PROV instance I, where the key fields a_k = b_k are equal.
Attempt to merge all of the corresponding parameters a₀ = b₀ and ... and a_n = b_n.
If merging fails, then the constraint is unsatisfiable, so application of the key constraint to I fails.
If merging succeeds with substitution S, then we remove r(a₀; a₁,...a_n,attrs1) and r(b₀; b₁,...b_n,attrs2) from I, obtaining instance I', and return instance {r(S(a₀); S(a₁),...S(a_n),attrs1 ∪ attrs2)} ∪ S(I').

Thus, if a PROV instance contains an apparent violation of a uniqueness constraint or key constraint, merging can be used to determine whether the constraint can be satisfied by instantiating some existential variables with other terms. For key constraints, this is the same as merging pairs of statements whose keys are equal and whose corresponding arguments are compatible, because after merging respective arguments and attribute lists, the two statements become equal and one can be omitted.

The various identified objects of PROV must have unique statements describing them within a valid PROV instance. This is enforced through the following key constraints:

Constraint 23 (key-object)

The identifier field e is a KEY for the entity(e,attrs) statement.
The identifier field a is a KEY for the activity(a,t1,t2,attrs) statement.
The identifier field ag is a KEY for the agent(ag,attrs) statement.

Likewise, the statements in a valid PROV instance must provide consistent information about each identified object or relationship. The following key constraints require that all of the information about each identified statement can be merged into a single, consistent statement:

Constraint 24 (key-properties)

The identifier field id is a KEY for the wasGeneratedBy(id; e,a,t,attrs) statement.
The identifier field id is a KEY for the used(id; a,e,t,attrs) statement.
The identifier field id is a KEY for the wasInformedBy(id; a2,a1,attrs) statement.
The identifier field id is a KEY for the wasStartedBy(id; a2,e,a1,t,attrs) statement.
The identifier field id is a KEY for the wasEndedBy(id; a2,e,a1,t,attrs) statement.
The identifier field id is a KEY for the wasInvalidatedBy(id; e,a,t,attrs) statement.
The identifier field id is a KEY for the wasDerivedFrom(id; e2, e1, a, g2, u1, attrs) statement.
The identifier field id is a KEY for the wasAttributedTo(id; e,ag,attr) statement.
The identifier field id is a KEY for the wasAssociatedWith(id; a,ag,pl,attrs) statement.
The identifier field id is a KEY for the wasAssociatedWith(id; a,ag,-,attrs) statement.
The identifier field id is a KEY for the actedOnBehalfOf(id; ag2,ag1,a,attrs) statement.
The identifier field id is a KEY for the wasInfluencedBy(id; o2,o1,attrs) statement.

Entities may have multiple generation or invalidation events (either or both may, however, be left implicit). An entity can be generated by more than one activity, with one generation event per each entity-activity pair. These events must be simultaneous, as required by Constraint 41 (generation-generation-ordering) and Constraint 42 (invalidation-invalidation-ordering).

Constraint 25 (unique-generation)

IF wasGeneratedBy(gen1; e,a,_t1,_attrs1) and wasGeneratedBy(gen2; e,a,_t2,_attrs2), THEN gen1 = gen2.

Constraint 26 (unique-invalidation)

IF wasInvalidatedBy(inv1; e,a,_t1,_attrs1) and wasInvalidatedBy(inv2; e,a,_t2,_attrs2), THEN inv1 = inv2.

It follows from the above uniqueness and key constraints that the generation and invalidation events linking an entity and activity are unique, if specified. However, because we apply the constraints by merging, it is possible for a valid PROV instance to contain multiple statements about the same generation or invalidation event, for example:

wasGeneratedBy(id1; e,a,-,[prov:location="Paris"])
wasGeneratedBy(-; e,a,-,[color="Red"])

When the uniqueness and key constraints are applied, the instance is normalized to the following form:

wasGeneratedBy(id1; e,a,_t,[prov:location="Paris",color="Red"])

where _t is a new existential variable.

An activity may have more than one start and end event, each having a different activity (either or both may, however, be left implicit). However, the triggering entity linking any two activities in a start or end event is unique. That is, an activity may be started by several other activities, with shared or separate triggering entities. If an activity is started or ended by multiple events, they must all be simultaneous, as specified in Constraint 33 (start-start-ordering) and Constraint 34 (end-end-ordering).

Constraint 27 (unique-wasStartedBy)

IF wasStartedBy(start1; a,_e1,a0,_t1,_attrs1) and wasStartedBy(start2; a,_e2,a0,_t2,_attrs2), THEN start1 = start2.

Constraint 28 (unique-wasEndedBy)

IF wasEndedBy(end1; a,_e1,a0,_t1,_attrs1) and wasEndedBy(end2; a,_e2,a0,_t2,_attrs2), THEN end1 = end2.

An activity start event is the instantaneous event that marks the instant an activity starts. It allows for an optional time attribute. Activities also allow for an optional start time attribute. If both are specified, they must be the same, as expressed by the following constraint.

Constraint 29 (unique-startTime)

IF activity(a2,t1,_t2,_attrs) and wasStartedBy(_start; a2,_e,_a1,t,_attrs), THEN t1=t.

An activity end event is the instantaneous event that marks the instant an activity ends. It allows for an optional time attribute. Activities also allow for an optional end time attribute. If both are specified, they must be the same, as expressed by the following constraint.

Constraint 30 (unique-endTime)

IF activity(a2,_t1,t2,_attrs) and wasEndedBy(_end; a2,_e,_a1,t,_attrs1), THEN t2 = t.

Note: The following constraint is associated with a feature "at risk" and may be removed from this specification based on feedback. Please send feedback to public-prov-comments@w3.org.

An entity can be the subject of at most one mention relation.

Constraint 31 (unique-mention)

IF mentionOf(e, e1, b1) and mentionOf(e, e2, b2), THEN e1=e2 and b1=b2.

5.2 Event Ordering Constraints

Given that provenance consists of a description of past entities and activities, valid provenance instances must satisfy ordering constraints between instantaneous events, which are introduced in this section. For instance, an entity can only be used after it was generated; in other words, an entity's generation event precedes any of this entity's usage events. Should this ordering constraint be violated, the associated generation and usage would not be credible. The rest of this section defines the temporal interpretation of provenance instances as a set of instantaneous event ordering constraints.

To allow for minimalistic clock assumptions, like Lamport [CLOCK], PROV relies on a notion of relative ordering of instantaneous events, without using physical clocks. This specification assumes that a preorder exists between instantaneous events.

Specifically, precedes is a preorder between instantaneous events. A constraint of the form e1 precedes e2 means that e1 happened at the same time as or before e2. For symmetry, follows is defined as the inverse of precedes; that is, a constraint of the form e1 follows e2 means that e1 happened at the same time as or after e2. Both relations are preorders, meaning that they are reflexive and transitive. Moreover, we sometimes consider strict forms of these orders: we say e1 strictly precedes e2 to indicate that e1 happened before e2, but not at the same time. This is a transitive relation.

PROV also allows for time observations to be inserted in specific provenance statements, for each of the five kinds of instantaneous events introduced in this specification. Times in provenance records arising from different sources might be with respect to different timelines (e.g. different time zones) leading to apparent inconsistencies. For the purpose of checking ordering constraints, the times associated with events are irrelevant; thus, there is no inference that time ordering implies event ordering, or vice versa. However, an application may flag time values that appear inconsistent with the event ordering as possible inconsistencies. When generating provenance, an application should use a consistent imeline for related PROV statements within an instance.

A typical ordering constraint is as follows.

Constraint-example NNN (ordering-example)

IF hyp₁ and ... and hyp_n THEN evt1 precedes/strictly precedes evt2.

The conclusion of an ordering constraint is either precedes or strictly precedes. One way to check ordering constraints is to generate all precedes and strictly precedes relationships arising from the ordering constraints to form a directed graph, with edges marked precedes or strictly precedes, and check that there is no cycle containing a strictly precedes edge.

5.2.1 Activity constraints

This section specifies ordering constraints from the perspective of the lifetime of an activity. An activity starts, then during its lifetime can use, generate or invalidate entities, communicate with, start, or end other activities, or be associated with agents, and finally it ends. The following constraints amount to checking that all of the events associated with an activity take place within the activity's lifetime, and the start and end events mark the start and endpoints of its lifetime.

Figure 3 summarizes the ordering constraints on activities in a graphical manner. For this and subsequent figures, an event time line points to the right. Activities are represented by rectangles, whereas entities are represented by circles. Usage, generation and invalidation are represented by the corresponding edges between entities and activities. The five kinds of instantaneous events are represented by vertical dotted lines (adjacent to the vertical sides of an activity's rectangle, or intersecting usage and generation edges). The ordering constraints are represented by triangles: an occurrence of a triangle between two instantaneous event vertical dotted lines represents that the event denoted by the left line precedes the event denoted by the right line.

Miscellaneous suggestions about figures (originally from Tim Lebo):

I think it would help if the "corresponding edges between entities and activities" where the same visual style as the vertical line marking the time the Usage, generation and derivation occurred. A matching visual style provides a Gestalt that matches the concept. I am looking at subfigures b and c in 5.2.

Figure 3^◊: Summary of instantaneous event ordering constraints for activities

The existence of an activity implies that the activity start event always precedes the corresponding activity end event. This is illustrated by Figure 3 (a) and expressed by Constraint 32 (start-precedes-end).

Constraint 32 (start-precedes-end)

IF wasStartedBy(start; a,_e1,_a1,_t1,_attrs1) and wasEndedBy(end; a,_e2,_a2,_t2,_attrs2) THEN start precedes end.

If an activity is started by more than one activity, the events must all be simultaneous. The following constraint requires that if there are two start events that start the same activity, then one precedes the other. Using this constraint in both directions means that each event precedes the other.

Constraint 33 (start-start-ordering)

IF wasStartedBy(start1; a,_e1,_a1,_t1,_attrs1) and wasStartedBy(start2; a,_e2,_a2,_t2,_attrs2) THEN start1 precedes start2.

If an activity is ended by more than one activity, the events must all be simultaneous. The following constraint requires that if there are two end events that end the same activity, then one precedes the other. Using this constraint in both directions means that each event precedes the other, that is, they are simultaneous.

Constraint 34 (end-end-ordering)

IF wasEndedBy(end1; a,_e1,_a1,_t1,_attrs1) and wasEndedBy(end2; a,_e2,_a2,_t2,_attrs2) THEN end1 precedes end2.

A usage implies ordering of events, since the usage event had to occur during the associated activity. This is illustrated by Figure 3 (b) and expressed by Constraint 35 (usage-within-activity).

Constraint 35 (usage-within-activity)

IF wasStartedBy(start; a,_e1,_a1,_t1,_attrs1) and used(use; a,_e2,_t2,_attrs2) THEN start precedes use.
IF used(use; a,_e1,_t1,_attrs1) and wasEndedBy(end; a,_e2,_a2,_t2,_attrs2) THEN use precedes end.

A generation implies ordering of events, since the generation event had to occur during the associated activity. This is illustrated by Figure 3 (c) and expressed by Constraint 36 (generation-within-activity).

Constraint 36 (generation-within-activity)

IF wasStartedBy(start; a,_e1,_a1,_t1,_attrs1) and wasGeneratedBy(gen; _e2,a,_t2,_attrs2) THEN start precedes gen.
IF wasGeneratedBy(gen; _e,a,_t,_attrs) and wasEndedBy(end; a,_e1,_a1,_t1,_attrs1) THEN gen precedes end.

Communication between two activities a1 and a2 also implies ordering of events, since some entity must have been generated by the former and used by the latter, which implies that the start event of a1 cannot follow the end event of a2. This is illustrated by Figure 3 (d) and expressed by Constraint 37 (wasInformedBy-ordering).

Constraint 37 (wasInformedBy-ordering)

IF wasInformedBy(_id; a2,a1,_attrs) and wasStartedBy(start; a1,_e1,_a1',_t1,_attrs1) and wasEndedBy(end; a2,_e2,_a2',_t2,_attrs2) THEN start precedes end.

5.2.2 Entity constraints

The figure(s) in this section should have vertical lines with visual styles that match the diagonal arrow that they go with.

As with activities, entities have lifetimes: they are generated, then can be used, other entities can be derived from them, and finally they can be invalidated. The constraints on these events are illustrated graphically in Figure 4 and Figure 5.

Figure 4^◊: Summary of instantaneous event ordering constraints for entities

Generation of an entity precedes its invalidation. (This follows from other constraints if the entity is used, but it is stated explicitly here to cover the case of an entity that is generated and invalidated without being used.)

Constraint 38 (generation-precedes-invalidation)

IF wasGeneratedBy(gen; e,_a1,_t1,_attrs1) and wasInvalidatedBy(inv; e,_a2,_t2,_attrs2) THEN gen precedes inv.

A usage and a generation for a given entity implies ordering of events, since the generation event had to precede the usage event. This is illustrated by Figure 4(a) and expressed by Constraint 39 (generation-precedes-usage).

Constraint 39 (generation-precedes-usage)

IF wasGeneratedBy(gen; e,_a1,_t1,_attrs1) and used(use; _a2,e,_t2,_attrs2) THEN gen precedes use.

All usages of an entity precede its invalidation, which is captured by Constraint 40 (usage-precedes-invalidation) (without any explicit graphical representation).

Constraint 40 (usage-precedes-invalidation)

IF used(use; _a1,e,_t1,_attrs1) and wasInvalidatedBy(inv; e,_a2,_t2,_attrs2) THEN use precedes inv.

If an entity is generated by more than one activity, the events must all be simultaneous. The following constraint requires that if there are two generation events that generate the same entity, then one precedes the other. Using this constraint in both directions means that each event precedes the other.

Constraint 41 (generation-generation-ordering)

IF wasGeneratedBy(gen1; e,_a1,_t1,_attrs1) and wasGeneratedBy(gen2; e,_a2,_t2,_attrs2) THEN gen1 precedes gen2.

If an entity is invalidated by more than one activity, the events must all be simultaneous. The following constraint requires that if there are two invalidation events that invalidate the same entity, then one precedes the other. Using this constraint in both directions means that each event precedes the other, that is, they are simultaneous.

Constraint 42 (invalidation-invalidation-ordering)

IF wasInvalidatedBy(inv1; e,_a1,_t1,_attrs1) and wasInvalidatedBy(inv2; e,_a2,_t2,_attrs2) THEN inv1 precedes inv2.

If there is a derivation relationship linking e2 and e1, then this means that the entity e1 had some influence on the entity e2; for this to be possible, some event ordering must be satisfied. First, we consider derivations, where the activity and usage are known. In that case, the usage of e1 has to precede the generation of e2. This is illustrated by Figure 4 (b) and expressed by Constraint 43 (derivation-usage-generation-ordering).

Constraint 43 (derivation-usage-generation-ordering)

In this constraint, _a, gen2, use1 must not be placeholders.

IF wasDerivedFrom(_d; _e2,_e1,_a,gen2,use1,_attrs) THEN use1 precedes gen2.

When the activity, generation or usage is unknown, a similar constraint exists, except that the constraint refers to its generation event, as illustrated by Figure 4 (c) and expressed by Constraint 44 (derivation-generation-generation-ordering).

Constraint 44 (derivation-generation-generation-ordering)

In this constraint, any _a, _g, _u may be placeholders.

IF wasDerivedFrom(_d; e2,e1,_a,_g,_u,attrs) and wasGeneratedBy(gen1; e1,_a1,_t1,_attrs1) and wasGeneratedBy(gen2; e2,_a2,_t2,_attrs2) THEN gen1 strictly precedes gen2.

This constraint requires the derived entity to be generated strictly following the generation of the original entity. This follows from the [PROV-DM] definition of derivation: A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity, thus the derived entity must be newer than the original entity.

The event ordering is between generations of e1 and e2, as opposed to derivation where usage is known, which implies ordering between the usage of e1 and generation of e2.

The entity that triggered the start of an activity must exist before the activity starts. This is illustrated by Figure 5(a) and expressed by Constraint 45 (wasStartedBy-ordering).

Constraint 45 (wasStartedBy-ordering)

IF wasGeneratedBy(gen; e,_a1,_t1,_attrs1) and wasStartedBy(start; _a,e,_a2,_t2,_attrs2) THEN gen precedes start.
IF wasStartedBy(start; _a,e,_a1,_t1,_attrs1) and wasInvalidatedBy(inv; e,_a2,_t2,_attrs2) THEN start precedes inv.

Similarly, the entity that triggered the end of an activity must exist before the activity ends, as illustrated by Figure 5(b).

Constraint 46 (wasEndedBy-ordering)

IF wasGeneratedBy(gen; e,_a1,_t1,_attrs1) and wasEndedBy(end; _a,e,_a2,_t2,_attrs2) THEN gen precedes end.
IF wasEndedBy(end; _a,e,_a1,_t1,_attrs1) and wasInvalidatedBy(inv; e,_a2,_t2,_attrs2) THEN end precedes inv.

Figure 5^◊: Summary of instantaneous event ordering constraints for trigger entities

If an entity is a specialization of another, then the more specific entity must have been generated after the less specific entity was generated.

Constraint 47 (specialization-generation-ordering)

IF specializationOf(e2,e1) and wasGeneratedBy(gen1; e1,_a1,_t1,_attrs1) and wasGeneratedBy(gen2; e2,_a2,_t2,_attrs2) THEN gen1 precedes gen2.

Similarly, if an entity is a specialization of another entity, and then the invalidation event of the more specific entity precedes that of the less specific entity.

Constraint 48 (specialization-invalidation-ordering)

IF specializationOf(e1,e2) and wasInvalidatedBy(inv1; e1,_a1,_t1,_attrs1) and wasInvalidatedBy(inv2; e2,_a2,_t2,_attrs2) THEN inv1 precedes inv2.

5.2.3 Agent constraints

Like entities and activities, agents have lifetimes that follow a familiar pattern. An agent that is also an entity can be generated and invalidated; an agent that is also an activity can be started or ended. During its lifetime, an agent can participate in interactions such as starting or ending other activities, association with an activity, attribution, or delegation.

Further constraints associated with agents appear in Figure 6 and are discussed below.

Figure 6^◊: Summary of instantaneous event ordering constraints for agents

An activity that was associated with an agent must have some overlap with the agent. The agent must have been generated (or started), or must have become associated with the activity, after the activity start: so, the agent must exist before the activity end. Likewise, the agent may be destructed (or ended), or may terminate its association with the activity, before the activity end: hence, the agent invalidation (or end) is required to happen after the activity start. This is illustrated by Figure 6 (a) and expressed by Constraint 49 (wasAssociatedWith-ordering).

Constraint 49 (wasAssociatedWith-ordering)

In the following inferences, _pl may be a placeholder -.

IF wasAssociatedWith(_assoc; a,ag,_pl,_attrs) and wasStartedBy(start1; a,_e1,_a1,_t1,_attrs1) and wasInvalidatedBy(inv2; ag,_a2,_t2,_attrs2) THEN start1 precedes inv2.
IF wasAssociatedWith(_assoc; a,ag,_pl,_attrs) and wasGeneratedBy(gen1; ag,_a1,_t1,_attrs1) and wasEndedBy(end2; a,_e2,_a2,_t2,_attrs2) THEN gen1 precedes end2.
IF wasAssociatedWith(_assoc; a,ag,_pl,_attrs) and wasStartedBy(start1; a,_e1,_a1,_t1,_attrs1) and wasEndedBy(end2; ag,_e2,_a2,_t2,_attrs2) THEN start1 precedes end2.
IF wasAssociatedWith(_assoc; a,ag,_pl,_attrs) and wasStartedBy(start1; ag,_e1,_a1,_t1,_attrs1) and wasEndedBy(end2; a,_e2,_a2,_t2,_attrs2) THEN start1 precedes end2.

An agent to which an entity was attributed, must exist before this entity was generated. This is illustrated by Figure 6 (b) and expressed by Constraint 50 (wasAttributedTo-ordering).

Constraint 50 (wasAttributedTo-ordering)

IF wasAttributedTo(_at; e,ag,_attrs) and wasGeneratedBy(gen1; ag,_a1,_t1,_attrs1) and wasGeneratedBy(gen2; e,_a2,_t2,_attrs2) THEN gen1 precedes gen2.
IF wasAttributedTo(_at; e,ag,_attrs) and wasStartedBy(start1; ag,_e1,_a1,_t1,_attrs1) and wasGeneratedBy(gen2; e,_a2,_t2,_attrs2) THEN start1 precedes gen2.

For delegation, two agents need to have some overlap in their lifetime.

Constraint 51 (actedOnBehalfOf-ordering)

IF actedOnBehalfOf(_del; ag2,ag1,_a,_attrs) and wasGeneratedBy(gen1; ag1,_a1,_t1,_attrs1) and wasInvalidatedBy(inv2; ag2,_a2,_t2,_attrs2) THEN gen1 precedes inv2.
IF actedOnBehalfOf(_del; ag2,ag1,_a,_attrs) and wasStartedBy(start1; ag1,_e1,_a1,_t1,_attrs1) and wasEndedBy(end2; ag2,_e2,_a2,_t2,_attrs2) THEN start1 precedes end2.

5.3 Type Constraints

The following rule establishes types denoted by identifiers from their use within expressions. The function typeOf gives the set of types denoted by an identifier. That is, typeOf(e) returns the set of types associated with identifier e. The function typeOf is not a term of PROV, but a construct introduced to validate PROV statements.

For any identifier id, typeOf(id) is a subset of {'entity', 'activity', 'agent', 'prov:Collection', 'prov:EmptyCollection'}. For identifiers that do not have a type, typeOf gives the empty set. Identifiers can have more than one type, because of subtyping (e.g. 'prov:EmptyCollection' is a subtype of 'prov:Collection') or because certain types are not disjoint (such as 'agent' and 'entity'). The set of types does not reflect all of the distinctions among objects, only those relevant for checking validity. In particular, subtypes such as 'plan' and 'bundle' are omitted, and statements such as wasAssociatedWith and mentionOf that have plan or bundle parameters only check that these parameters are entities.

To check if a PROV instance satisfies type constraints, one obtains the types of identifiers by application of Constraint 52 (typing) and check that none of the impossibility constraints Constraint 57 (entity-activity-disjoint) and Constraint 58 (membership-empty-collection) are violated as a result.

Constraint 52 (typing)

IF entity(e,attrs) THEN 'entity' ∈ typeOf(e).
IF agent(ag,attrs) THEN 'agent' ∈ typeOf(ag).
IF activity(a,attrs) THEN 'activity' ∈ typeOf(a).
IF used(u; a,e,t,attrs) THEN 'activity' ∈ typeOf(a) AND 'entity' ∈ typeOf(e).
IF wasGeneratedBy(gen; e,a,t,attrs) THEN 'entity' ∈ typeOf(e) AND 'activity' ∈ typeOf(a).
IF wasInformedBy(id; a2,a1,attrs) THEN 'activity' ∈ typeOf(a2) AND 'activity' ∈ typeOf(a1).
IF wasStartedBy(id; a2,e,a1,t,attrs) THEN 'activity' ∈ typeOf(a2) AND 'entity' ∈ typeOf(e) AND 'activity' ∈ typeOf(a1).
IF wasEndedBy(id; a2,e,a1,t,attrs) THEN 'activity' ∈ typeOf(a2) AND 'entity' ∈ typeOf(e) AND 'activity' ∈ typeOf(a1).
IF wasInvalidatedBy(id; e,a,t,attrs) THEN 'entity' ∈ typeOf(e) AND 'activity' ∈ typeOf(a).
IF wasDerivedFrom(id; e2, e1, a, g2, u1, attrs) THEN 'entity' ∈ typeOf(e2) AND 'entity' ∈ typeOf(e1) AND 'activity' ∈ typeOf(a). In this constraint, a, g2, and u1 must not be placeholders.
IF wasDerivedFrom(id; e2, e1, -, -, -, attrs) THEN 'entity' ∈ typeOf(e2) AND 'entity' ∈ typeOf(e1).
IF wasAttributedTo(id; e,ag,attr) THEN 'entity' ∈ typeOf(e) AND 'agent' ∈ typeOf(ag).
IF wasAssociatedWith(id; a,ag,pl,attrs) THEN 'activity' ∈ typeOf(a) AND 'agent' ∈ typeOf(ag) AND 'entity' ∈ typeOf(pl). In this constraint, pl must not be a placeholder.
IF wasAssociatedWith(id; a,ag,-,attrs) THEN 'activity' ∈ typeOf(a) AND 'agent' ∈ typeOf(ag).
IF actedOnBehalfOf(id; ag2,ag1,a,attrs) THEN 'agent' ∈ typeOf(ag2) AND 'agent' ∈ typeOf(ag1) AND 'activity' ∈ typeOf(a).
IF alternateOf(e2, e1) THEN 'entity' ∈ typeOf(e2) AND 'entity' ∈ typeOf(e1).
IF specializationOf(e2, e1) THEN 'entity' ∈ typeOf(e2) AND 'entity' ∈ typeOf(e1).
IF mentionOf(e2,e1,b) THEN 'entity' ∈ typeOf(e2) AND 'entity' ∈ typeOf(e1) AND 'entity' ∈ typeOf(b).
IF hadMember(c,e) THEN 'prov:Collection' ∈ typeOf(c) AND 'entity' ∈ typeOf(c) AND 'entity' ∈ typeOf(e).
IF entity(c,[prov:type='prov:EmptyCollection']) THEN 'entity' ∈ typeOf(c) AND 'prov:Collection' ∈ typeOf(c)AND 'prov:EmptyCollection' ∈ typeOf(c).

5.4 Impossibility constraints

Impossibility constraints require that certain patterns of statements never appear in valid PROV instances. Impossibility constraints have the following general form:

Constraint-example NNN (impossible-example)

IF hyp₁ and ... and hyp_n THEN INVALID.

Checking an impossibility constraint on instance I means checking whether there is any way of matching the pattern hyp₁, ..., hyp_n. If there is, then checking the constraint on I fails (which implies that I is invalid).

A derivation with unspecified activity wasDerivedFrom(id;e1,e2,-,g,u,attrs) represents a derivation that takes one or more steps, whose activity, generation and use events are unspecified. It is forbidden to specify a generation or use event without specifying the activity.

Constraint 53 (impossible-unspecified-derivation-generation-use)

In the following rules, g and u must not be -.

IF wasDerivedFrom(_id;_e2,_e1,-,g,-,attrs) THEN INVALID.
IF wasDerivedFrom(_id;_e2,_e1,-,-,u,attrs) THEN INVALID.
IF wasDerivedFrom(_id;_e2,_e1,-,g,u,attrs) THEN INVALID.

As noted previously, specialization is a strict partial order: it is irreflexive and transitive.

Constraint 54 (impossible-specialization-reflexive)

IF specializationOf(e,e) THEN INVALID.

Furthermore, identifiers of basic relationships are disjoint.

Constraint 55 (impossible-property-overlap)

For each r and s in { used, wasGeneratedBy, wasInvalidatedBy, wasStartedBy, wasEndedBy, wasInformedBy, wasAttributedTo, wasAssociatedWith, actedOnBehalfOf} such that r and s are different relation names, the following constraint holds:

IF r(id; a₁,...,a_m) and s(id; b₁,...,b_n) THEN INVALID.

Since wasInfluencedBy is a superproperty of many other properties, it is excluded from the set of properties whose identifiers are required to be pairwise disjoint. The following example illustrates this observation:

wasInfluencedBy(id;e2,e1)
wasDerivedFrom(id;e2,e1)

This satisfies the disjointness constraint.

There is, however, no constraint requiring that every influence relationship is accompanied by a more specific relationship having the same identifier. The following valid example illustrates this observation:

wasInfluencedBy(id; e2,e1)

This is valid; there is no inferrable information about what kind of influence relates e2 and e1, other than its identity.

Identifiers of entities, agents and activities cannot also be identifiers of properties.

Constraint 56 (impossible-object-property-overlap)

For each p in {entity, activity or agent} and for each r in { used, wasGeneratedBy, wasInvalidatedBy, wasInfluencedBy, wasStartedBy, wasEndedBy, wasInformedBy, wasDerivedFrom, wasAttributedTo, wasAssociatedWith, actedOnBehalfOf}, the following impossibility constraint holds:

IF p(id,a₁,...,a_n) and r(id; b₁,...,b_n) THEN INVALID.

The set of entities and activities are disjoint, expressed by the following constraint:

Constraint 57 (entity-activity-disjoint)

IF 'entity' ∈ typeOf(id) AND 'activity' ∈ typeOf(id) THEN INVALID.

There is no disjointness between entities and agents. This is because one might want to make statements about the provenance of an agent, by making it an entity. For example, one can assert both entity(a1) and agent(a1) in a valid PROV instance. Similarly, there is no disjointness between activities and agents, and one can assert both activity(a1) and agent(a1) in a valid PROV instance. However, one should keep in mind that some specific types of agents may not be suitable as activities. For example, asserting statements such as agent(Bob, [type=prov:Person]) and activity(Bob) is discouraged. In these cases, disjointness can be ensured by explicitly asserting the agent as both agent and entity, and applying Constraint 57 (entity-activity-disjoint).

An empty collection cannot contain any member, expressed by the following constraint:

Constraint 58 (membership-empty-collection)

IF hasMember(c,e) and 'prov:EmptyCollection' ∈ typeOf(c) THEN INVALID.

Stage #	Inference	Hypotheses	Conclusions
1	19, 20, 21, 22	specializationOf, mentionOf	specializationOf, entity
2	7, 8, 13, 14	entity, activity, wasAttributedTo, actedOnBehalfOf	wasInvalidatedBy, wasStartedBy, wasEndedBy
3	9, 10	wasStartedBy, wasEndedBy	wasGeneratedBy
4	11, 12	wasDerivedFrom	wasGeneratedBy, used, alternateOf
5	16, 17, 18	alternateOf, entity	alternateOf
6	5, 6	wasInformedBy, generated, used	wasInformedBy, generated, used
7	15	many	wasInfluencedBy

ericP's notes on: Constraints of the Provenance Data Model

notes on 11 September 2012 LC

Abstract

Status of This Document

PROV Family of Specifications

How to read the PROV Family of Specifications

Table of Contents

1. Introduction

1.1 Conventions

1.2 Purpose of this document

1.3 Structure of this document

1.4 Audience