Re: PROV-ISSUE-67 (single-execution): Why is there a difference in what is represented by one vs multiple executions? [Conceptual Model]

Hi Simon,

That's a good example, thanks!

Let me try and explain, how I see it:

With

isDerivedFrom (report1, data1)

the asserter has a deep knowledge of the process execution that underpins
this derivation. In particular, it is PE workflow1 that generates 
report1, and
uses data1. Hence, both the generation event for report1 and the use event
for data1 occur during workflow1.

In the provenance challenge, when you were using slicing techniques to 
extract derivations from process
definitions, I would argue you were generating similar derivations.

With

isDerivedFromInMultipleSteps (report2, data2)

the asserter is much less precise, and does not state whether a single 
process
is involved for generation/use, and which interval they occur in.

Furthermore, in this example, with the provenance given,  one cannot 
ascertain
whether 'unpublished2' is in the derivation path between report2 and data2.

A stronger provenance would have been

isDerivedFrom (report2, unpublished2)

isDerivedFrom(unpublished2, data2)


from which we can infer by transitive closure

isDerivedFrom+ (report2, data2)


So, to me,
1. isDerivedFrom is fundamental in the model, and requires deep/precise
     knowledge of process executions.
2. isDerivedFrom+ is useful inference (transitive closure).
3. isDerivedFromInMultipleSteps is convenience assertion, but not
      as precise as 1&2.

We could drop 3, but then, you wouldn't be able to express your second 
example.

Asserting that
  isDerivedFrom(report2, data2)
would be very different. It would mean that the process execution that 
generated
report2 also used data2.

So,

used (workflow1.2, data2, r) for some role r.

But that's not the intent.

What do you think?
Regards,
Luc




On 01/08/11 16:53, Simon Miles wrote:
> Hi Luc,
>
> OK. Here's my stab at an motivating example.
>
> An organisation, Org, wants to use the WG standard to record and
> provide access to provenance data on the documents it makes available
> online to its clients. It has storage limits on the provenance it can
> maintain.
>
> Alice regularly receives government data sets and for each, creates a
> report which is published online. Looking for a minimal way to express
> this using PIL, Org decides on one BOB for each data set, one for each
> report, one process representing the create-and-publish workflow, and
> a derivation link to show that the report is based on the data set. A
> given instance of this, for one data set, is:
>
>    bob (data1, [ type: "File", location: "/shared/crime1.data" ])
>    bob (report1, [ type: "File", location:
> "http://example.com/report1.pdf", creator: "Alice" ])
>    processExecution (workflow1, create-and-publish, t)
>    isGeneratedBy (report1, workflow1, out)
>    used (workflow1, data1, in)
>    isDerivedFrom (report1, data1)
>
> A client, Clive, finds a mistake in report1, looks at the provenance
> and, being "creator", Alice gets the blame. However, the error is
> actually due to Bob, who published Alice's report, messing up the axes
> on a graph. To avoid Alice's anger, Org agrees to refine what is
> modelled to a finer granularity: create, then publish. As they have
> storage constraints, they will make available only one granularity of
> provenance information, and use this finer granularity only for
> subsequent reports. A given instance would be:
>
>    bob (data2, [ type: "File", location: "/shared/crime2.data" ])
>    bob (unpublished2, [ type: "File", location: "/shared/report2.pdf",
> creator: "Alice" ])
>    bob (report2, [ type: "File", location:
> "http://example.com/report2.pdf", creator: "Alice", publisher: "Bob"
> ])
>    processExecution (workflow1.1, create, t)
>    processExecution (workflow1.2, publish, t+4)
>    isGeneratedBy (unpublished2, workflow1.1, out)
>    isGeneratedBy (report2, workflow1.2, out)
>    used (workflow1.1, data2, in)
>    used (workflow1.2, unpublished2, in)
>    isDerivedFromInMultipleSteps (report2, data2)
>
> Clive queries to find out what data sets the reports available are
> derived from. He finds that while report1 is derived from data1 in one
> step (isDerivedFrom), report2 is derived from data2 in multiple steps
> (isDerivedFromInMultipleSteps). He (like me) does not understand how
> he should interpret the distinction between the two. There is
> apparently something different in the way that report2 is related to
> data2 compared to how report1 is derived from data1, and possibly he
> should trust report2 less because of this indirect link to its source
> data. But Org is adamant that nothing has changed in their procedures,
> and there is no distinction.
>
> Thanks,
> Simon
>
> On 1 August 2011 12:15, Luc Moreau<L.Moreau@ecs.soton.ac.uk>  wrote:
>    
>> Hi Simon,
>>
>> Sorry, but I don't understand.  Your initial example was not valid
>> because you had
>> two PEs generating a single BOB.
>>
>> If they are different ways of describing something happening in the
>> world, I
>> assume that you will identify different activities, and hence multiple
>> process executions
>> will be asserted.
>>
>> Can you reformulate an example that illustrate your concern?
>>
>> Luc
>>
>> On 08/01/2011 12:02 PM, Simon Miles wrote:
>>      
>>> Hi Luc,
>>>
>>> I follow your argument, but it seems tangential to my point. The
>>> following argument still seems inevitably true to me:
>>>
>>> Activity in the world that uses one BOB and generates another *can* be
>>> described in PIL as multiple process executions or a single process
>>> execution (regardless of whether it actually is described in these
>>> different ways or not, or whether accounts are required or not).
>>>
>>> Therefore, what one process execution denotes is not distinct from
>>> what multiple process executions denotes, we have just provided more
>>> detail in the latter description (and this detail is, in any case,
>>> removed when saying "is derived from").
>>>
>>> Therefore, isDerivedFrom and isDerivedFromInMultipleSteps as defined
>>> do not describe anything different in the world, so we have two terms
>>> for representing the same thing.
>>>
>>> I know that we've debated this or similar before, but it is still not
>>> clear to me where the fault lies in my argument, or what
>>> isDerivedFromInMultipleSteps really represents. If it's only me that's
>>> confused, I understand there are more urgent concerns (though I'd
>>> still like to understand).
>>>
>>> Thanks,
>>> Simon
>>>
>>> On 1 August 2011 09:25, Luc Moreau<L.Moreau@ecs.soton.ac.uk>    wrote:
>>>
>>>        
>>>> Hi Simon,
>>>>
>>>> If I understand you correctly, you are suggesting that the following two
>>>> assertions hold together.
>>>>
>>>> isGeneratedBy(e5,pe5,out)
>>>> isGeneratedBy(e5,pe4,out)
>>>>
>>>> But this is not legal, since it is stated that one BOB is generated by
>>>> at most one process execution.
>>>>
>>>> What you are suggesting should be encoded in a separate account (though
>>>> we have not defined this yet!).
>>>> A one-step derivation then expands to one process execution in a given
>>>> account.
>>>> In a separate account, there may be a multi-step derivation between the
>>>> same two BOBs and it would
>>>> expand into multiple process executions.
>>>>
>>>> Does it make sense?
>>>> Regards,
>>>>
>>>> Luc
>>>>
>>>>
>>>> On 07/29/2011 05:52 PM, Provenance Working Group Issue Tracker wrote:
>>>>
>>>>          
>>>>> PROV-ISSUE-67 (single-execution): Why is there a difference in what is represented by one vs multiple executions? [Conceptual Model]
>>>>>
>>>>> http://www.w3.org/2011/prov/track/issues/67
>>>>>
>>>>> Raised by: Simon Miles
>>>>> On product: Conceptual Model
>>>>>
>>>>> By the definition, "a process execution represents an identifiable activity". This does not seem to preclude one process execution assertion denoting, at a coarse granularity, the same events in the world denoted by multiple process executions in other assertions.
>>>>>
>>>>> If so, then in the File Scenario example, I could add a coarse-grained process execution representing the whole e1-to-e5 activity:
>>>>>      processExecution(pe5,collaboratively-edit,t)
>>>>>      uses(pe5,e1,in)
>>>>>      isGeneratedBy(e5,pe5,out)
>>>>>
>>>>> But then Section 5.5.2 distinguishes between "a single process execution" and "one or more process executions". Following the argument above, these could represent exactly the same occurrences in the world.
>>>>>
>>>>> So there is no difference between what is denoted by one and multiple process executions, and so no difference between isDerivedFrom and isDerivedFromInMultipleSteps as described. Whether e5 was derived from e1 appears to me to be entirely independent of how many process executions were involved.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> --
>>>> Professor Luc Moreau
>>>> Electronics and Computer Science   tel:   +44 23 8059 4487
>>>> University of Southampton          fax:   +44 23 8059 2865
>>>> Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
>>>> United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
>>>>
>>>>
>>>>
>>>> ______________________________________________________________________
>>>> This email has been scanned by the MessageLabs Email Security System.
>>>> For more information please visit http://www.messagelabs.com/email
>>>> ______________________________________________________________________
>>>>
>>>>
>>>>          
>>>
>>>
>>>        
>> --
>> Professor Luc Moreau
>> Electronics and Computer Science   tel:   +44 23 8059 4487
>> University of Southampton          fax:   +44 23 8059 2865
>> Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
>> United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
>>
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>>
>>      
>
>
>    

Received on Monday, 1 August 2011 22:58:43 UTC