FileExample

From Provenance WG Wiki
Jump to: navigation, search

File Example

An alternative example to discuss concepts

  • smaller than the data journalism example
  • focuses on mutable entities
  • moves away from the web architecture, to avoid uncertainties of its definition.

File Scenario

We assume here a file system in which users Alice, Bob, Charles, David, and Edith can share and edit files.

  • Time t: Alice creates an empty file in /home/towns.txt
  • Time t+1: Bob appends the following lines to /home/towns.txt:
London
Edinburgh
  • Time t+2: Charles emails the contents of /home/towns.txt
cat /home/towns.txt | sendmail ...
  • Time t+3: David edits file /home/towns.txt and appends two lines
New York
Los Angeles
  • Time t+4: Edith emails the contents of /home/towns.txt
cat /home/towns.txt | sendmail ...

Some Invariant "Things"

According to Definition of "Thing and IVP of", in the real world we have some "stuff" :file, data items sent to sendmail.

We also have several identifiable things modelling that file:

  • i0: A file, for which we have a property name (/home/towns.txt) and a property creator (Alice), which are invariant in the interval [t,t+4[
  • i1: A file (i0) with added property content which is empty; it exists in the interval [t,t+1[
  • i2: A file (i0) with added property content with value London and Edinburgh; it exists in the interval [t+1,t+3[
  • i3: A file (i0) with added property content with value London, Edinburgh, NY, LA; it exists in the interval [t+3,t+4[

There are also two further things:

  • i4: the information sent to sendmail at t+2 (that's a copy of i2's content)
  • i5: the information sent to sendmail at t+4 (that's a copy of i3's content)


Other things:

  • ...

Relative notion of invariance

The above invariant pieces of information i0, i1, .., i5 are invariant at the level of abstraction described here. For instance, i4 and i5 can be explained in terms of information stored in pipe, read from pipe, stored in some buffer, etc (see Groth FGCS11 for an illustration of this). Likewise, the file contents could be stored on a cloud, moved at various locations, cached in buffers, etc.

Mapping concepts

Let T denotes the set of things.

We have some derivations: T -> T

i2 -> i1
i3 -> i2
i4 -> i2 
i5 -> i3

We have some process executions (type PE):

pe1=append("London Edinburgh")
pe2=copy
pe3=append("New York Los Angeles")
pe4=copy

We have some Generations T->PE and some Uses PE->T

i2 -> pe1,     pe1 -> i1
i3 -> pe3,     pe3 -> i2
i4 -> pe2,     pe2 -> i2 
i5 -> pe4,     pe4 -> i3

In addition, we have IVPoF: T -> T

IVPof(i1,i0)
IVPof(i2,i0)
IVPof(i3,i0)

since i0 has an variant property (content) that is invariant for each of i1, i2 and i3.

Example encoding

An example encoding in N3/Turtle by Stian Soiland-Reyes (Ssoiland) - mainly to exercise Graham's suggestion of Dynamic Resource vs. View Resource. Feel free to edit or clone to include further ProvenanceConcepts, I only used a minimum here to highlight the discussion.


:i0Dyn a :DynamicResource .
# We can't describe i0's "real" properties here 
# because that would be another view

:i0 a :ViewResource, :DynamicResource ; 
	# :DynamicResource implied if object of :viewOf
   :viewOf :i0Dyn ;
   :name "/home/towns.txt" ;
   :creator :Alice .


# Metadata stored in filesystem attributes
:i0Provenance a :ProvenanceResource ;
   :provenanceOf :i0 ;
   :account :FileSystem ;
   :processes (
      [  :agent :Alice ;
         :location :server1 ;
         :process :fileCreation ;
         :time "2011-06-15 18:00:01 UTC"  ]
    ) .

# however the server log file claims the file was created on her 
# workstation (not server), and 1 second later (clocks out of sync?)

:i0Provenance2 a :ProvenanceResource ;
   :provenanceOf :i0 ;
   :account :ServerLogFile ;
   :processes (
      [  :agent :Alice ;
         :location :AliceWorkstation;
         :process :fileCreation ;
         :time "2011-06-15 18:00:02 UTC"  ]
    ) .
## Q: Is this a different view? Is creation time part of identifying i0?



:i1 a :ViewResource ;     
  :viewOf :i0 ;  # Or directly to :i0Dyn?
  :name "/home/towns.txt" ;
  :creator :Alice ;
  :content [ :bytes ""  ] .

:i1Provenance a :ProvenanceResource ;
  :provenanceOf :i1 ;
   :account :FileSystem ;
   :processes (
      [  :agent :Alice ;
         :location :server1 ;
         :process :fileCreation ;
         :time "2011-06-15 18:00:01 UTC"  ]
     [   :agent :Alice ;
         :location :server1 ;
         :process :fileWrite ;
         :time "2011-06-15 18:00:03 UTC"  ]
   ) .


:i2 a :ViewResource ;
  :viewOf :i0, :i0Dyn; #:viewOf is transitive
  :name "/home/towns.txt" ;
  :creator :Alice ;
  :content [ :bytes "London\nEdinburgh\n"  ] .
  
:i2Provenance a :ProvenanceResource ;
  :provenanceOf :i2 ;
  :account :FileSystem ;
  :processes (
     [  :agent :Alice ;
         :location :server1 ;
         :process :fileCreation ;
         :time "2011-06-15 18:00:03 UTC"  ]
# Lost as file system metadata only keeps last-modified
#     [   :agent :Alice ;
#         :location :server1 ;
#         :process :fileWrite ;
#         :time "2011-06-15 18:00:03 UTC"  ]
     [   :agent :Bob;
         :location :server1 ;
         :process :fileWrite ;
         :time "2011-06-15 18:14:12 UTC"  ]
    ) .  



:i3 a :ViewResource ;
  :viewOf :i0 ;
  :name "/home/towns.txt" ;
  :creator :Alice ;
  :content [ :bytes "London\nEdinburgh\nNew York\nLos Angeles\n"  ] .

:i3Provenance a :ProvenanceResource ;
  :provenanceOf :i3 ;
  :account :FileSystem ;
  # Derivation from i1 is only indirect through i0 - as :FileSystem does not
  # implicitly track file edits. We can calculate the provenance later as long
  # as the file is not renamed and the creator/creation time matches.
  :processes (
      [  :agent :Alice ;
         :location :server1 ;
         :process :fileCreation ;
         :time "2011-06-15 18:00:03 UTC"  ]
      [  :agent :David;
         :location :server1 ;
         :process :fileWrite ;
         :time "2011-06-16 12:01:35 UTC"  ]
  ) .




Including derivedFrom in a versioned FS

What if we had used a versioned file system instead, like Dropbox' "Show previous versions" - internally tracked using an id for "the file towns.txt".


:i0Dropbox a :ViewResource, :DynamicResource ;
  :creator :Alice ;
  :dropBoxId "1234567" .

:i0 :viewOf :i0Dropbox .
   # The :i0Dropbox view does not include file name in its identifier and can
   # therefore be seen as the dynamic resource 'above' the :i0 view
   # identification

:i3Dropbox a :ViewResource ;
  :viewOf :i0Dropbox ; # also :i0 and :i3 - but :Dropbox does not know that
  :name "/home/towns.txt" ;
  :creator :Alice ;
  :content [ :bytes "London\nEdinburgh\nNew York\nLos Angeles\n"  ] ;
  :dropBoxId "1234567" .


:i3DropboxProvenance a :ProvenanceResource ;
  :provenanceOf :i3Dropbox ;
  :account :Dropbox ;
  :processes (
      [  :agent :Alice ;
         :location :AliceWorkstation ;
         :process :fileCreation ;
         :time "2011-06-15 18:00:03 UTC"  ]
     [   :agent :Alice ;
         :location :AliceWorkstation ;
         :process :fileWrite ;
         :time "2011-06-15 18:00:03 UTC"  ]
     [   :agent :Bob;
         :location :BobLaptop ;
         :process :fileWrite ;
         :time "2011-06-15 18:14:12 UTC"  ]
     [   :agent :David;
         :location :DavidWorkstation ;
         :process :fileWrite ;
         :time "2011-06-16 12:01:35 UTC"  ]
  ) ;
  :derivedFrom :i2Dropbox .


# Shortened for brevity
:i2Dropbox a :ViewResource ;
  :viewOf :i0Dropbox ;
  :content [ :bytes "London\nEdinburgh\n"  ] ;
  # .. 
  :dropBoxId "1234567" .
:i2DropboxProvenance a :ProvenanceResource ;
  :provenanceOf :i2Dropbox ;
  # ...
  :derivedFrom :i1Dropbox .
:i1Dropbox a :ViewResource ;
  :viewOf :i0Dropbox ;
  :content [ :bytes ""  ] ;
  # ..
  :dropBoxId "1234567" .
:i1DropboxProvenance a :ProvenanceResource ;
  :provenanceOf :i1Dropbox .
  # ...


# The Dropbox id allow us to track the file across renames
# (presumably DB uses checksums to detect renames)

:renamedDropbox a :ViewResource ;
  :viewOf :i0Dropbox ;
  :name "/home/towns_to_visit.txt" ;
# Note - this can not be inferred to be a view of :i0 - as 
# :name is an immutable property part of the identity of :i0.
#
# This might or might not be be a view of 'real world' :i0Dyn - or we can say
# there exists a common implicit super-resource that is common for both
# :i0Dropbox and :i0.
  :content [ :bytes "London\nEdinburgh\nNew York\nLos Angeles\n"  ] ;
  :dropBoxId "1234567" . 


:renamedDropboxProvenance a :ProvenanceResource ;
  :provenanceOf :renamedDropbox;
  :account :Dropbox ;
  :derivedFrom :i3Dropbox ;
  :processes (
      [  :agent :Alice ;
         :location :AliceWorkstation ;
         :process :fileCreation ;
         :time "2011-06-15 18:00:03 UTC"  ]
      # ...
      [  :agent :Richard ;
         :location :RichardLaptop;
         :process :fileRename;
         :time "2011-06-17 12:10:04 UTC"  ]

   ) .


Comments (Paolo)

I seem to understand the intended meaning of invariant, but the term seems misleading to me. The way this is used in the example seems to indicate the span (of time, or events) over which the value of a property is constant.

I have added a figure with my interpretation of this example. Let's see if it makes sense.

IPVT-fileExample.001.png
  • As suggested, I will use the term Thing to denote the element/entity that we want to express the provenance of. However, I see this as an abstract object with known entity, which I don't necessarily identify with anything physical, i.e. bits (I would like to be able to study the provenance of a traditional folk song through the ages, for example)
  • I have removed i0, because I think property content, which appears in i1, is really also part of the definition of i0, where it has null value, therefore I see no distinction between i0 and i1. In other words, I would like to say that i0...i5 are instances of some class I which is defined by attributes {name, creator, content}. This is a special case: in general, views can be instances of different classes.
  • Horizontally you have an events line (which may or may not be a timeline, but I'd rather avoid mentioning time at all as we know it gets messy when you have more than one observer, and they are distributed).
  • Vertically you have "views". A view is defined as the set of values for all the properties (attributes?) defined on the Thing at a certain point in time. Note that this is not the standard DB definition of views (rather, it is the definition of snapshot, or database state at a certain point in time). The set of properties of each view I0..i5 in this example are the same (name, creator, content. I would argue that they all apply to each of them, and that in some of the views the value may be N/A, or null), but in general the properties may differ.
  • For each view, the example specifies a validity for the value of each property. In the example this is a time interval, but I think it would be more general to think in terms of span across a sequence of observed events. The validity is depicted graphically as a horizontal rightwards bold line segment
    • the name (n) and creator (cr) properties have values in i0, and these values remain the same for all subsequent views. In those views, the property names are in parenthesis when they don't change from a previous view (i.e., they are inside a sequence of events for which their values is constant).
    • the validity of the content (co) property values are as indicated in the example: [t0,t1[ for i1, etc.
    • the processes that lead to state changes (append, copy) are denoted by the blue arcs. In this case the dependencies are obtained simply by traversing the arcs backwards, however in general views can be drawn at arbitrary times and need not correspond to states that are reached through observed events.

My hope is that one can similarly represent more general examples, namely those in which:

  • the set of properties for each view overlap but are not identical, and
  • mapping must be established between the values of those properties.