ConsolidatedConcepts

From Provenance WG Wiki
Revision as of 17:38, 9 July 2011 by Tlebo (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This document lists the consolidated concepts as part of the F2F1 Model Proposal. Members of the Model task force are here curating the ProvenanceConcepts. For further comments, please use the discussion page. This page led to F2F1ConceptDefinitions during F2F1.

Introduction

In the real world, there are "stuffs" that can be physical, digital, logical, conceptual, or otherwise.

PIL (name to be determined!) is an assertion language which allows asserters to make assertions about stuffs and activities in the real world (as they view it) and how they influence each other, in other words, to describe their provenance.

In this document, we define a set of key provenance-related concepts following two months of discussion in the Provenance WG, and we raise topics for discussion or clarification.

Thing

Persistent page about this concept: ConceptThing

For the genesis of the concepts described in this subsection, please refer to section [Genesis_of_these_concepts_(by_Khalid)].

Thing: "things" represent real-world stuffs and have properties modeling aspects of stuff states. Things have:

  • an identity
  • a set of invariant (== immutable) properties
  • a set of mutable properties

There are no assumptions that the sets of properties are complete, or that the properties are independent/orthogonal of each other.

Mutable, immutable properties:

  • mutable properties may change their value during the lifecycle (lifetime? check) of the thing they describe
  • immutable properties do not change their value

Asserter/Observer:

An Asserter/Observer asserts a description of the provenance of one or more things. (Paul Groth)

Invariant View or Perspective on a Thing (IPV)

Persistent page about this concept: ConceptInvariantViewOnThing (Discussion of IVP of is taken from here)

"IVP of" (an Invariant View or Perspective of) is a relation between two things A and B.

  • We say that "B is an IVP of A" if (sufficient condition? check) for its asserter, A and B represent the same stuff in the real world, and the stuff states modelled by A and B are consistent.
  • "B is an IVP of A" is valid relative to an asserter if, for its asserter, the following holds:
    • the properties they share must have corresponding values
    • some mutable properties of A correspond to some immutable properties of B

Example: rectangle A may have varying length and width, whereas B, an IVP of A, may have a invariant area (Khalid, Stian) Note (Paolo): relation IVP is anti-symmetric (B IVP of A => not (A IVP B) check)

Consistency of states:

States may be modelled by different ontologies. It is left to the asserter to establish their consistency. This is outside the scope of PIL. (Luc)

Correspondence amongst properties:

  • A property P1 of A corresponds to one or more properties of B when P1 can be converted into those properties of B (e.g. temperature conversion from Farenheit to Celsius) or can be merged (Luc, Khalid)
  • such correspondence is not explicit in PIL. Rather, it is the asserter's responsibility (Luc)

Examples

Journalism use case

Note: The digital journalism example does not yet highlight IVP very well. Some relations might be IVPs, such as the d1 and d2 both being IVPs of an abstract d0 (Stian)

File example

From the FileExample:

  • i0: A file, for which we have a property name (/home/towns.txt) and a property creator (Alice), which are invariant in the interval [t,t+4[
  • i1: A file (i0) with added property content which is empty; it exists in the interval [t,t+1[
  • i2: A file (i0) with added property content with value London and Edinburgh; it exists in the interval [t+1,t+3[
  • i3: A file (i0) with added property content with value London, Edinburgh, NY, LA; it exists in the interval [t+3,t+4[
  • i4: the information sent to sendmail at t+2 (that's a copy of i2's content)
  • i5: the information sent to sendmail at t+4 (that's a copy of i3's content)

Here i1, i2 and i3 are all IVP of i0. i0 as a thing identifies the file across edits. Its content property is mutable - but in i2 we talk about an IVP of i0, i0 with a particular content. When we say that Charles emailed the content "London, Edinburgh", that email i4 is derived from i2, but not from i3.

Here i2 can't be said to be derived from i0, i2 is just a more specific, restricted perspective over i0 that allows us to talk about the file with a certain content. You could similarly go 'higher up' and talk about i00 with name being mutable (tracking the abstract 'file' across renames), or go further down and lock down properties such as last viewed or file permissions.

In this examples you can also talk about the File being an IVP of a Document (where file encoding and format is mutable), which is an IVP of a List of Towns. The provenance about that list would talk about who proposed to add NY and LA rather than who edited the file.

If you print the file, then that Printout will be derived from the File, but will also be an IVP of the same Document as the File.

Issues for discussion

  1. Terminology: do we keep terms: "stuff", "thing", and how do we name "IVP of"
  2. on the need for invariant (immutable) properties:
    1. we need invariance because we need a snapshot of something (some properties), which we can refer to, so that we can explain its provenance (Luc)
    2. I think we should make an effort to see if we can do without this assumption. Assuming that a property is invariant may be too strong in practice. (Paolo)
  3. "Derivation" and "IVP of" seem to overlap in a sufficient number of examples to raise a flag. Can a clear and rigorous distinction be made? (Paolo)
  4. Can the mutable/immutable nature of a property be always determined? this seems less than crisp since some of the examples (see FileExample) indicate that some properties are only immutable within a certain scope (time interval, for example) (Paolo)

Additional notes (Paolo): Additional thoughts on IVPT I have put some of my personal notes as a contribution to the discussion on these concepts.

Genesis of these concepts

This section is written by Khalid.

Initially the working group starts discussing the concept of “resource”. More specifically, there were three concepts that have been borrowed from the Web Architecture:

  • resource
  • resource state, and
  • resource state representation.

There were many discussion about these three concepts as suggested by the mailing list of the working group which contains over 70 emails with a subject that contains the term “resource”. Among the questions that were discussed without reaching a consensus is whether we want to capture the provenance of a resource and/or resource state and/or resource state representation. Some members of the working group expressed concerns that the term resource has a particular meaning within the Web Architecture, and that it would be better to use a different term. Another candidate term that was quickly discarded is “Entity”, because it has a specific meaning in the context of HTTP [and another in classic data modelling -PM]. The discussion of resource had led to the notion of to terms such as “resource snapshot” and “resource text”. After a month, on the 6th of June 2011, discussion on resource shifted to “Invariant View or Perspective on a Thing” (or IVPT). Although many members felt that the term was not “elegant”, progress and consensus on what it means was quickly reached compared with the term “resource”. Discussion on IVPT had led to two concepts: “stuff” and “thing”, and one relationship “IVP of”.

Note on the provenance of these concepts:

An initial notion of thing, as defined here: definition of thing had been agreed upon during Teleconference 2011-06-16, as a way to allow definitions of other concepts.

It has since evolved here and the list of concepts that follows reflects for the most part the latest version, along with recorded comments on it.

Thus, attributions to people may reflect only the latest edition, rather than the original proposal.

Process execution

Persistent page about this concept: ConceptProcessExecution

Summarised by Stian (To be reviewed by Satya, Paolo)

A process execution is an activity that uses (zero or more) things, performs a piece of work, and generates (zero or more) new things. (Graham, Paolo)

The activity can be automatic (typically having a process definition and be in the form of a script, workflow or service) or manual (e.g. reviewing, decision making, authoring). (Khalid, Jun)

A process execution has a duration, i.e. it spans a time interval. Statements denoting this duration are optional. (email vote). A process execution has either completed (occurred in the past) or is occurring in present (partially complete). For the asserter, the start of an execution is always in the past, from the instant referred to by any assertion made about it. (Teleconference 2011-06-16)

  1. A distinction is made between process execution and process specification/definition (Teleconference 2011-06-09). Specifically, process execution refers to instance of a process specification (or action specification). Process specification/definition is referred to as recipe in the charter. However, this distinction is not pursued any further than this, as it has been deemed out of scope for this WG (Teleconference 2011-06-09)
  2. Terminology (for process specification/definition, process execution, recipe) needs to be agreed on, if appropriate (Teleconference 2011-06-09)
  3. A process execution has a duration, i.e. it spans a time interval. Statements denoting this duration are optional. (email vote)
  4. A process execution has either completed (occurred in the past) or is occurring in present (partially complete). In other words, the start of a process execution is always in the past, from the instant referred to by any assertion made about it. ([Teleconference 2011-06-16 http://www.w3.org/2011/prov/meeting/2011-06-16#resolution_2])


Examples

Examples of process execution in the Data journalism example: (Satya)

  • government' (gov) converts of data (d1) to RDF (f1) at time (t1) (by Jun and Khalid)
  • government (gov) publishes RDF data (f1) (Satya)
  • analyst (alice) downloads a turtle serialization (lcp1). (Satya)
  • alice generates chart (c1) from the turtle (lcp1) file and the statistical assumptions (stats1), using some software (tools1) (Jun)
  • newspaper (news) obtains image (img1) (Satya)
  • blogger (bob) publishes the chart (c2) (Khalid)

Issues for discussion

  1. It should be understood that, in the definition, use, perform a piece of work, and generate do not have to be performed sequentially, e.g. some generate can happen before some use. ACCEPTED F2F1
  2. A process execution should be associated with an actor. (Proposed by Jun on 2011-05-31) POSTPONED
  3. A process specification can be either pre-defined or not. (Proposed by Khalid on 2011-05-31) ACCEPTED F2F1
  4. A process execution may consume and/or generate IVPTs. (Proposed by Paolo on 2011-05-20) NO LONGER RELEVANT
  5. A process execution represents a specific data processing activity in which in which all inputs and outputs are fully determined. (Proposed by Graham and curated by Jun on 2011-06-20) REJECTED F2F1 -- (see alternative statement in the minutes)
  6. If we adopt an “OS Style” process model, then a distinction needs to be made between process specification, process, which is an instance of a process specification, and process execution, which is the state of a process with in a time interval, when the activities specified in the process specification take place. This may have been resolved by the agreement above, where the distinction is partially made (process spec vs process exec), and it was decided that process spec == recipe is out of scope. I will not insist on process (Paolo) -- NO LONGER RELEVANT F2F1

Time

Persistent page about this concept: ConceptTime

curated by Stian, Paolo

The notion of Time is latent in the definition of [Process Execution].

This is because, intuitively, data generated by a process must have been created within the duration of the execution - and that data used must have existed before the process execution finishes. This was proposed for the 2011-06-23 telcon. but the proposal brought up further issues. These issues are now collected under this Time heading.

Issues for discussion

  1. Time is currently not a first-class concept in the model.
  2. One of the reasons is that the consistent determination of time in a distributed setting may be difficult to achieve. It may require knowledge of the time measurement itself (who did them, when and how was the clock synchronized, frame of reference for satellites in orbit, etc).
  3. Despite these difficulties, there is a recognition that time is naturally part of users' provenance practice, for instance in a scientist's lab book
  4. ordering of processes can be used as a suitable surrogate for actual time
  5. On the same not as (4): Can we introduce events line as a surrogate of timeline? (Paolo)
  6. We use the set of constraints as a starting point for building an understanding of PIL (Teleconference 2011-06-30). In PIL, there are different kinds of events: beginning of process execution, end of process execution, generation of thing, use of thing, which satisfy some ordering, according to relation "precede":
      • Ordering Property: Thing
        • Creation of a thing precedes any of its use
      • Ordering Property: Process Execution
        • Beginning of a process execution precedes its end
      • Ordering Property: Generation
        • Generation of a thing by process execution P is preceded by beginning of P and precedes end of P
      • Ordering Property: Use
        • Use of a thing by process execution P is preceded by beginning of P and precedes end of P
      • Ordering Property: Derivation
        • If a thing B is derived from a thing A, then the use of A precedes the generation of B.
      • Ordering holds irrespective of time annotations to events, which may be provided by different clocks.


The provenance of a provenance account can define more details (out of scope for WG) about how the asserter's statements about time (and other properties) have been determined. This can then be taken into account before making assumptions about the timing of different events. [don't get this --PM]

Derivation

Persistent page about this concept: ConceptDerivation

To be summarised by Satya (+Jun)

Derivation expresses that some stuff is transformed from, created from, or affected by other stuff. A thing B is derived from a thing A if the values of some invariant properties of B are at least partially determined by the values of some invariant properties of A. (Accepted by WG in June 30 telcon)

Derivation is a property linking two (or more) distinct things that represents how a thing X is affected by or transformed/created from another thing(s) Y, where Y existed before X began to exist. (Simon, Luc, Jun, Satya)


Example

Examples of derivation in the journalism example:

  • RDF (f1), converted by gov, is derived from data (d1), because f1 is transformed from d1 (Jun)

Issues for discussion

  1. Is temporal dimension explicitly associated with the things participating in the derivation property? (Graham)
  2. ""derivation" or "partially determined by" relationship could be subjective or context-dependent assertion, not an objectively true or false statement." (James)
  3. Does derivation include control dependency? If so, is this reflected in this definition

Use

Persistent page about this concept: ConceptUse

Summarised by Jun

Use is the action/transition/event by which a process execution consumes a thing.

Use is associated with a time (the time at which the thing is used), though statements about use do not have to mention time.

Example

  • Data (d1) was used in a process execution at time (t1), that generated RDF data (f1)
  • RDF data (f1) along with its provenance (prov) were used to generate a Web resource (r1)
  • The turtle serialization (lcp1) of the resource (r1) and software (tools1) with statistical some statistical assumptions (stats1) were used to generate a chart (c1)

Issues for discussions

  1. For a thing X to be used by a process execution P, the following must hold (see discussion):
    1. X was generated before its use
    2. Use occurs after P's beginning and before P's end
  2. P exploits/reads the values of some of X's invariant properties but not the values of its variant properties
  3. P exploits/reads the values of some of X's invariant properties
  4. Also symmetric to the discussion related to generation, we want to decide upon whether use should be modelled as a concept or as a relationship.

Generation

Persistent page about this concept: ConceptGeneration

Summarised by Jun (+Stian +Paolo)

Generation is the action/transition/event by which a process execution creates a thing.

Generation is associated with a time (the time at which the new thing begins its existence), though statements about generation do not have to mention time.

Example

Journalism use case

Examples of generation in the journalism example:

  • The generation of RDF (f1) at time t1 through the conversion process by gov
  • The generation of the provenance information (prov) regarding RDF (f1) through the process of the gov generating this provenance information
  • The generation of rdf data (r1) as a Web resource through the publication process that publishes RDF data (f1) along with its provenance (prov) on a portal with a license (li1)

File example

From the FileExample:

  • The generation of file i1
  • The generation of file i2
  • The generation of file i3

Here i1, i2 and i3 are all IVP of file i0, created by Alice.

Issues for discussion

  1. Whether generation should be modelled as a concept itself or as a relationship between concepts, such as a process execution and a thing. This issue is raised based on the initial definitions raised by Jun. However, Luc did raise that "Whether this is a concept or a relationship seems to me more relevant to the formalization of the vocabulary, and may depend on what we want to say (or infer) about such events." Luc 08:45, 2 June 2011 (UTC) and related note .
  2. Should we also mention in the definition that, for a thing X to be generated by a process execution P, the following must hold (see discussion):
    1. X must be something that did not exist before generation time (this means that nothing had the thing's identity before that time)
    2. generation occurs after P's beginning and before P's end
    3. P and things used by P determine the values of X's invariant properties, but not the values of variant properties (too(?) strict)
    4. P and things used by P determine values of some of X's invariant properties (less strict)
  3. Whether generation should be an action/transition/event relating IVP of things or things or does it not matter - NO LONGER RELEVANT

Agent

Persistent page about this concept: ConceptAgent

Summarized by Satya (To be reviewed by Stian, Jun)

Agent is an active thing that is defined with respect to a process execution (Paul, Stian, Jun, Stephan, Jim, Satya).

An agent can have provenance assertions about it (Jim).

Example

Examples from the Data journalism example:

  • Organization: government (gov) is an agent, involved in the conversion of data (d1) to RDF (f1) (Jun)
  • Person: a document (art1) written by agent Joe (joe)
  • Computer Agent: new chart (c2) based on the data (lcp2) using some software (tools2), where tools2 is an agent

Issues for discussion

  1. Is there distinction between software tools and humans as agents? (Martin, Jim)
  2. Should agent and process execution be linked by causal relation? (Martin, Paul, Satya)
  3. Should agents that may/should have played an active role in a process execution be part of provenance assertions? (Stian, Graham)
  4. Can agents be defined as roles (of agency) associated with things (in context of process execution)? (Stephan, Jim)
  5. It is not necessary that an agent must be involved in a process execution (Paul)
  6. What if we agreed that (a) Agents are things, and (b) a Process Execution involves Participants, which are things, with a Role in the execution. Then we could talk about Agents independently of an execution, get them involved in executions with a Role, naturally associate provenance to them, etc. (Paolo)

Ordering of process execution

Persistent page about this concept: ConceptOrderingOfProcesses

Summarized by Satya

Ordering of processes execution (in provenance) needs to be modeled as a property linking process entities in specific order along a particular dimension (temporal or control flow) (Satya)

Example

Examples from the Data journalism example:

  • government (gov) converts data (d1) before government (gov) publishes RDF data (f1)
  • government (gov) publishes an update (d2) after government (gov) publishes RDF data (f1)

Issues for discussion

  • Note this concept has not been actively discussed, thus, the definition is the first one proposed by Satya.

Other Concepts

Consensus has not been reached on other concepts. We provide here a few pointers to discussions and proposals:

  • Collections: (Paolo) I have added one more page, where I propose specific relations to be added for expressing assertions concerning elements of data structures, specifically ordered trees (nested ordered lists). In the future these can be extended to other data structures. The rationale is to provide a way to express precise provenance for structured data. The page is available here.
  • Version: ConceptVersion