Exploring provenance model complexity

From Provenance WG Wiki
Revision as of 12:31, 3 November 2011 by Gklyne (Talk | contribs)

Jump to: navigation, search

Background

It has been suggested (http://lists.w3.org/Archives/Public/public-prov-wg/2011Oct/0140.html) that the provenance model as presented may suggest a level of complexity that may be off-putting to developers.

Paul Groth indicated in a blog post (http://www.w3.org/blog/SW/2011/10/23/5-simple-provenance-statements/) that there are very simple ways in which provenance can be represented in RDF.

This page is created to explore some ways in which the simple expressive forms might be connected to the data model and abstract syntax per http://www.w3.org/TR/prov-dm/

This note makes extensive use of Notation3 (@@ref), a superset of Turtle (@@ref) for representing RDF examples.

A very simple example

Suppose we wish to express the following as a provenance assertion:

  ex:aDocument dcterms:creator "Meritorious Meerkat" .

The problem with this form of expression is that there is no clear way indicated by the provenance specification to know that it is intended as a provenance statement.

Expressing the example using provenance ASN

This is just one way in which the above example might be represented using ASN. The use of dcterms:creator may be arguable, but that isn't really the crucial point here, I think.

 entity(aDocument, [type=ex:Document])
 agent(meritoriousMeerkat)
 entity(meritoriousMeerkat, [foaf:name="Meritorious Meerkat"])
 wasGeneratedBy(aDocument, pe1, qualifier())
 wasControlledBy(pe1,meritoriousMeerkat,qualifier=(role=dcterms:creator))

So, given this Provenance ASN, how can we represent this in RDF so that it's intended use as a provenance expression can be distinguished?

Direct translation of ASN to RDF

  ex:aDocument a prov:Entity ; prov:wasGeneratedBy

   [ a prov:ProcessExecution ;  
     prov:wasControlledBy
       [ a prov:Agent ; foaf:name  "Meritorious Meerkat" ]
   ] .

or, almost equivalently (without using blank node expressions):

  ex:MeritoriousMeerkat a prov:Agent ; foaf:name "Meritorious Meerkat" .   ex:aDocument a prov:Entity ; prov:wasGeneratedBy ex:pe1 .   ex:pe1 a prov:ProcessExecution ; prov:wasControlledBy ex:MeritoriousMeerkat .

In this formulation, the process execution is used as a mediating "event" to link the provenance details to the entity described. The prov:wasGeneratedBy statement effectively signals that provenance information about its subject is linked via its object (where "subject" and "object" are used in the sense defined by RDF).

This may be viewed as a rather cumbersome way to express the original example, but it's not clear how this can be simplified without losing the structure that distinguishes the provenance expression.

Use named graph for provenance

A different approach might be to wrap the provenance statements in a separate graph resource; e.g. (using Notation3 syntax). Starting with a fairly direct representation of the ASN form, and introducing a "hasProvenance" statement:

 ex:aDocument a prov:Entity ;  prov:hasProvenance 
   { ex:MeritoriousMeerkat a prov:Agent ; foaf:name  "Meritorious Meerkat" .
     ex:aDocument prov:wasGeneratedBy ex:pe1 .
     ex:pe1 a prov:ProcessExecution ;  prov:wasControlledBy ex:MeritoriousMeerkat .
   } .

Because the provenance has been separated and clearly signalled, we can now envisaged a simpler form more closely recognizable as the original example:

 ex:aDocument a prov:Entity ;  prov:hasProvenance 
   { ex:aDocument dcterms:creator "Meritorious Meerkat" } .

This form requires a repetition of the entity URI, and could be awkward to express if the entity node does not have an explicit URI. This may turj out to be an advantage - more discussion needed here.

Using this form requires that RDF processing software has support for "named graphs" or equivalent, which are not part of the original RDF standard.

Use blank node for provenance

Similar to the above approach, but introducing a new node as a placeholder "context" for hanging provenance statements:

 ex:aDocument a prov:Entity ;  prov:hasProvenance 
   [ 
     prov:wasGeneratedBy
       [ a prov:ProcessExecution ;  
         prov:wasControlledBy [ a prov:Agent ; foaf:name  "Meritorious Meerkat" ] 
       ]
   ] .

or, almost equivalently (without using blank node expressions):

 ex:aDocument a prov:Entity ;  
   prov:hasProvenance ex:prov .
 ex:prov prov:wasGeneratedBy ex:pe1 .
 ex:pe1 a prov:ProcessExecution ;  
   prov:wasControlledBy ex:MeritoriousMeerkat .
 ex:MeritoriousMeerkat a prov:Agent ; 
   foaf:name  "Meritorious Meerkat" .

The simplified form of this might then be:

 ex:aDocument a prov:Entity ;  prov:hasProvenance 
   [ dcterms:creator "Meritorious Meerkat" ] .

From an implementation perspective, this approach seems easier, but it presents a conceptual difficulty in that it is not clear what the newly introduced "context" node actually denotes.

Relating simplified forms to the full provenance data model

Not addressed above are mechanisms whereby the simplified forms can be related to the full provenance model. It seems that it should be possible to construct some reasonably simple rules to recognize and map simple provenance expressions, but this has not yet been done.