ProvenanceOfW3CReport

From Provenance WG Wiki
Jump to: navigation, search

Provenance of a W3C Report

This example describes the provenance of the prov-dm (version 2) document http://www.w3.org/TR/2011/WD-prov-dm-20111215. The purpose of this example is to bring up potential issues with identifiers in the data model.

We can envisage three different accounts at least.

1. Account 1: the W3C view

This account contains the kind of provenance records that W3C could keep, for auditors to check that due processes are followed.

In this account, we see that:

  • WD-prov-dm-20111215 was derived from WD-prov-dm-20111018
  • both WD-prov-dm-20111215 and WD-prov-dm-20111018 were published by the w3c agent
  • the publication activity for WD-prov-dm-20111215 used a publication request (req3)
  • the publication activity for WD-prov-dm-20111018 used another publication request (req2) and a transition request (req1)
  • publications were according to the process rules (rec-advance: a plan).

A simple visualization of Account 1 may be as follows. W3c-publication1.png


2. Account 2: the Web master view

This account contains the kind of provenance records that the webmaster could keep.

In this account, we see that:

  • WD-prov-dm-20111215 was derived from Overview.html in the mercurial repo
  • WD-prov-dm-20111215 is the result of copying file(s) to the remote web server
  • webmaster was involved
  • webmaster did all this on the basis of the publication request (req3)

A simple visualization of this account may be as follows. W3c-publication2.png


3. Account 3: the authors view.

That's what each author could keep (for instance, to write in their CV).

  • the document WD-prov-dm-20111215 was co-edited by a set of authors, with various roles

Again, visualization could as follows. W3c-publication3.png

4. We can envisage other accounts.

  • The mercurial log with all edits.
  • An account where other agents claim contribution to the document (faithfully or not)
  • etc

Comments

To illustrate this example, I have written the three accounts in three separate asn files:


In this example, each WD is identified by its URI:


The entity records in the different accounts have different attributes and different values. (Similarly to example of section 9 in the prov-dm document.)

In account 1, the type that matters for the W3C process is that it is a Working Draft.

entity(w3:WD-prov-dm-20111215, [ prov:type="WD" ])

In account 2, the type that matters for the W3C master is that it is an html4 document.

entity(w3:WD-prov-dm-20111215, [ prov:type="html4" ])

Other attributes that may usefully captured include that the document was checked to be compliant with pubrules, it is html4 compliant, etc.

In account 3, the authors know this the second version of the prov-dm report. Hence, the entity record has the following shape.

entity(w3:WD-prov-dm-20111215, [ prov:type="document", ex:version="2" ])

When in account 1, we write

wasGeneratedBy(w3:WD-prov-dm-20111215, ex:pub2)

we really mean to refer to the record of WD-prov-dm-20111215 in the same account, it would not make sense to say ex:pub2 generated WD-prov-dm-20111215 in account 3.

In Account 2, there is a further example that uses record identifiers. In this derivation, we need to identify the generation and usage.

wasDerivedFrom(w3:WD-prov-dm-20111215,hg:Overview.html, ex:rcp, rec:g, rec:u)


Discussion about Identifiers

Note: I am not saying that the way I modelled this example and expressed it in prov-dm is the only way to do so. However, I believe that this way is compatible with the current definition of the data model. I believe that it exposes issues with identifiers, as currently defined in the document.

Entity Record vs Entity

To understand this discussion it is necessary to make the distinction between entity record and entity. Section 5.1 defines a record as a body of information about something which is of interest from a provenance viewpoint.

Likewise, Definition dfn-entity defines entity record as a representation of an entity. For instance, the following text is the expression of an entity record in the ASN notation (production entityRecord)

entity(w3:WD-prov-dm-20111215, [ prov:type="WD" ])


Most other records also represent something. For instance, a usage record (Definition dfn-Use) a representation of an instantaneous world event: an activity beginning to consume an entity.

A notable exception is an account record (Definition dfn-Account) which is a wrapper of records.


Minimizing Identifiers Minting

An assumption in the design of prov-dm was to minimize the number of new identifiers to mint when producing provenance records. A good example was Paul's blog (http://www.w3.org/blog/SW/2011/10/23/5-simple-provenance-statements/).

A consequence of this design is that we just reuse the URI of WD-prov-dm-20111215. Different entity records containing this URI occur in the three accounts of the example.


Entity Record identification vs Entity Identifier

The identifier w3:WD-prov-dm-20111215 identifies the W3C Working Draft, which we have conceptualized as an entity.

In the first account (expressed in ASN here), the same identifier w3:WD-prov-dm-20111215 allows us to find the entity record:

entity(w3:WD-prov-dm-20111215, [ prov:type="WD" ])

Hence, we don't seem to have explicit entity record identifiers, but an entity identifier helps us find an entity record inside a given account. This is summed up as follows:

 entity id + account id = natural key for entity record 

Expressed differently, it is like an entity identifier acts a a local identifier for an entity record in an account.

Usage Records

The story for usage records is slightly different. Indeed, the derivation record above (taken from account 2) needs to reference a generation record and a usage record. For instance, the identifier rec:u is given to the usage record, so that it can be referenced in the derivation record.

used(rec:u, ex:rcp,hg:Overview.html)
wasDerivedFrom(w3:WD-prov-dm-20111215,hg:Overview.html, ex:rcp, rec:g, rec:u)

What is rec:u? A record identifier? or the identifier of the event represented by this usage record. Currently prov-dm seems (?) to indicate it's the former. I used here a different namespace prefix 'rec' to indicate that it is a record identifier.

And then we have notes

For extensibility and interoperability, it's necessary to be able to add notes to records. This requires each record to have an identifier.

To allow for third parties to add notes to any record, then we would need a mechanism to identify all records. However, record identifiers are optional for most records (to minimize minting): this goes against extensibility.

What about ER Diagram?

For those who take a database perspective, they will interpret the E/R diagram http://dvcs.w3.org/hg/prov/raw-file/default/model/overview.png in such a way that all records, in the context of an account, have got identifiers.

SW Perspective on URIs

In the second published working draft, the component id of an entity record was defined as:

id: an identifier id identifying an entity; the identifier of the entity record is defined to be the same as the identifier of the entity;

This was regarded as breaking the semantic web principle, since the same URI was then denoting different resources: an entity and an entity record.


To address this problem, the current version of prov-dm defines the component id as follows:

id: an identifier id identifying an entity; 

It's no longer a record identifier!!!

The current document also contains the following text:

  • The entity identifier id contained in an entity record is expected to be unique among all the identifiers contained in the current account's records. This constraint is elaborated upon in identifiable-record-in-account. It means that the current account does not contain any other record for this identifier. Effectively, id acts as a local identifier for this record. In this specification, whenever we write "an entity record identified by ... ", this identification is to be understood in the context of the account that defines it.


But then, how do we annotate entity record? what about the view E/R diagram? Should we introduce entity record identifiers? But then, this goes against id minting minimization.

And Then We Have Accounts

So far, my accounts were expressed in separate files.

How should account records be structured?

For the following to hold:

 entity id + account id = natural key for entity record 

an account identifier identify an account uniquely (it can't be an identifier that identifies an account in the context of an account).

What distinguishes account records from the other records is that records is the mechanism by which we want to express attribution of provenance records. Hence, we probably want to say that accounts are attributed to some agents, etc

The challenge is the following. Reusing Graham's terminology, account records now become part of the domain of discourse, since they are things we want to express provenance about.

Making Record's natural Key explicit

In the first public working draft of the document, in Section 5.5.4, we had the following note:

  • We are going to introduce a notion of qualified identifier, which allows us to refer to an identifier in the scope of a given account.

The term 'qualified' may not have been the best choice, but we wanted to introduce the natural key discussed above.

The reason for this is that we need to be able to create relations across accounts. An entity in an account may have been generated by an activity from another account. Or an entity described in an account may be an alternate of another entity described in another account.

There was some push back for this 'qualified identifier'. We dropped the concept. But some trace of it remains in the current draft Alternate and Specialization Records, which allows for an optional account identifier to be specified.

In fact, this should be the case for every relation.

Nested Accounts

... and I have not discussed whether accounts should be nested or not.