Talk:Provenance Vocabulary Mappings

From XG Provenance Wiki
Jump to: navigation, search

Preliminary comments on provenance vocabulary mapping to OPM (by Luc)

- opm:Process does not have to have terminated, it must have started in the past, but it can complete in the future.

  • Olaf: Okay. I will change the explanation and (since prv:Execution explicitly refers to completed executions of actions or processes) I will also adjust the mapping - prv:Execution is narrower than opm:Process.

- prv:DataAccess and prv:DataCreation could be defined as subtype of opm:Process (by means of the type annotation). Likewise for all other specializations.

  • Olaf: I agree. However, for the mapping table we decided to use the SKOS vocabulary. Hence, my use of narrower.

- prv:Actor: it seems that the "maintainer" of a web server could also be seen as an opm:Agent controlling the process.

  • Olaf: I wouldn't actually talk about "controlling the process." However, I agree with Simon's argument that there is a casual connection between the maintainer of a Web server and the execution of data access processes during which data was retrieved from that server.

- account: so how do you distinguish that statements belong to different descriptions (is it with named graphs)?

  • Olaf: I would understand each RDF graph with a provenance description that I retrieve from the Web as a separate description; all (provenance-related) statements in such a graph belong to the same description. Simon is right, however, RDF itself does not provide the means to explicitly relate different descriptions (of the same situation) to each other. I removed the "not necessary" comment from the mapping table.

- prv:involvedActor: the key question from an OPM viewpoint is whether the involvedActor had a causal relationship with the process. If the involvedActor happened to be there (like looking at process, or minuting the process, or getting inspired by the process) without affecting it, then involvedActor is not a concept of opm.

  • Olaf: Out of curiosity: What about involvedActors who might have had an impact on the process? This means, an impact which cannot be verified explicitly. For instance, fans cheering for their team in a football stadium might have had an implicit influence on the outcome of a game. Or, maybe more subtle, a specific person in the audience during a talk might increase the nervousness of the speaker and, thus, might affect the (quality of the) talk. Do you understand this as a causal relationship?

Comments specifically on Provenance Vocabulary to OPM Mapping

- You might be taking opm:wasControlledBy too literally on its name - I agree that prv:involvedActor may be a broader version of opm:wasControlledBy, but the example you give in the justification seems to fit opm:wasControlledBy as it is a causal connection between process and agent, i.e. the data access would not have occurred if the Web Service had not been there to perform it.

- I think that the notion of account in OPM is not just to allow for distinct assertions about the same occurrence, but also to say where two assertions should be understood in conjunction as part of the same record (i.e. in the same account) - this is important for describing refinement of OPM processes into multiple steps (all part of one account).

--- SimonMiles

Comments on Provenir mapping to OPM (by Luc)

- I don't understand "provenir:process allows modeling of processes that may or may not result in creation of new entities (provenir:data)." Do you mean that a provenir:process potentially does not result in a new provenir:data? If so, then it's the same as an opm:process (a process without output is still a process)? or do you mean that it may result in an already existing provenir:data. Then in that case, it's different from an opm:process.

- "OPM does not define the relationship between opm:Artifact and opm:Account.": I don't understand how this justifies this mapping. OPM defines account membership (for edges and nodes).

- In provenir, if you allow provenir:data to be mutable (and preserving their identity), I suppose that you allow multiple processes to "generate/affect" that data. How do you distinguish their order?

- OPM models persistent identities with the annotation "pname".

- I don't understand the mapping "provenir:data is mapped to opm:Account using skos:broadMatch". Are you saying that provenir:data allows you to regard a provenance graph as an entity you can describe the provenance of? So, you can potentially do that in OPM too, but that entity is an opm:artifact which has a provenance graph as an opm:value. The opm:value could be an opm:OPMGraph or an opm:Account. However, generally, we've been hesitant in talking about the provenance of provenance.

- " in other words the provenir:process may or may not require the existence of the provenir:data to initiate/terminate. ". In that case, is it fair to say that provenir:has_participant is not necessarily causal? (whereas opm:used is causal in the sense that the opm:artifact was required for the opm:process to complete)

Comments specifically on SWAN Vocabulary to OPM Mapping (by Simon)

The mapping and justifications make broadly sense to me. The only thing that would be worth mentioning is that, as with Dublin Core, those terms described as "narrower than opm:wasControlledBy" are not exactly in the same hierarchy. This is because the SWAN terms relate a resource to an agent manipulating it, while opm:wasControlledBy relates a process manipulating a resource to the agent doing the manipulation, i.e. the range is the same (agent) but the domain (resource vs process) of the relations is not.

  • Jun: I agree. Similar to DC, the SWAN provenance ontology takes a resource-centric view. It does not explicitly describe the process involved in the creation of a resource, unlike OPM and the Provenance Vocabulary. I changed narrower to relatedMatch for all of them, except for pav:contributors, because SWAN does not provide a clear definition for the term and there are no domain and range definitions for the property either.

Further Comments on Provenir Terms (by Luc)

The terms provenir:part_of and provenir:contained_in have similar counterparts in the draft opm profile for collections. They are mapped to specializations of opm:wasDerivedFrom.

Some conclusions (by Luc)

  • Each provenance vocabulary has a core that maps quite naturally to OPM
  • Provenance vocabularies introduce rich sets of useful concepts that can be seen as subtypes of OPM nodes or edges. A systematic structuring of these, per application domain, in the form of profiles, would be necessary to preserve this rich expressiveness
  • Several vocabularies provide a notion of artifact identity that persists across transformations. This notion is important but does not seem to be entirely captured by the 'pname' property in OPM.
  • Several vocabularies provide containment relationships and collections. OPM has a draft collection profile attempting to capture these. Effort would be required to integrate all them in a coherent manner.
  • Time is important in most models, but OPM seems to be the only one to specify precise time constraints implied by causal dependencies.
    • Jun: But the time dimension is not yet included in the matching table. At least DC and PAV both have a lot of time-related properties. And to me, many of them imply a causal dependency, such as pav:creatededOn, pav:importedOn, etc.
  • Several vocabularies provide non causal relationships: provenir:adjacent_to (spatial proximity), provenir:preceded_by (temporal order), whereas opm focuses on causal dependencies. It is debatable whether they are provenance related or whether they can be inferred from artifact properties (such as their location). The definition of these relationships should ensure that no conflict with causality can result.
  • Some vocabularies are already standardized and are strongly adopted (e.g. dublin core), but do not have the full expressiveness of OPM (in terms of processes and artifacts and arbitrary dependencies). Better bridges between OPM and these would be desirable.
  • Cryptographic hashes and signatures are supported by some vocabularies. This is essential, but not supported by OPM 1.1.
  • Versioning is also a important issue, but not well captured by OPM currently.
  • Work on mapping could be continued and made more formal, but this is a substantial activity. It could be complemented by a reverse mapping from opm to each provenance vocabulary. This would allows us to define expressiveness more formally.

Some conclusions (by Olaf)

  • The approach to use the OPM terms as the common provenance terms (with being aware of the potential limitations such a predefinition would involve) turned out to be right choice for this exercise. It helped us to focus on defining the mapping and understanding differences and similarities between models instead of digessing into endless discussions on what qualifies as a common provenance term. Such a discussion could follow now in a more informed manner using the results of our mapping exercise as a starting point.
  • Some provenance models include relationships that are not casual, something explicitly left out of OPM. For instance, the Provenance Vocabulary allows users to describe who was responsible for a data providing service that was accessed during the execution of a data access process. While it can be argued whether this is provenance information or not, it is of great value in provenance based measurement of trustworthiness of data or IQ assessment in general.
  • Time related terms are missing from the common provenance terms we selected.
  • While we decided to use the SKOS vocabulary to describe the mappings it would have been possible to even use more precise relationships (e.g. rdfs:subClassOf, rdfs:subPropertyOf, owl:equivalentClass) in some cases. For instance, prv:Execution rdfs:subClassOf opm:Process.
  • The provenance-related vocabularies I mapped to the selected OPM terms (that was, Provenance Vocabulary and ChangeSet) can be defined as OPM profiles.

One more conclusion (by Jun)

A lot of the compared vocabularies take a process-centric view, while some others take a resource-centric view, such as Dublin Core and the SWAN Provenance Ontology.

It is understandable that, particularly in the workflow provenance context, it is very important and mostly feasible to explicitly describe the process involved in causing the existence of a resource. However, it might not always be the case. For example, when expressing that some brain tissues were obtained from a Disease Center, it might be sufficient to say that some tissues (Artifact) were contributed by (dct:contributor, or pav:contributedBy) a disease center (Agent). Although we can introduce a process that represents the contribution process, it will make things more verbose than necessary. These resource-centric terms are important shortcuts to complement process-centric provenance vocabularies.

  • Response by Luc*.

I agree entirely with your proposal. I think you are making two points:

  • It's important to define good shortcuts that are intuitive and allow developers to express provenance in a compact manner.
  • Specifically, you refer to a shortcut for a resource centric view. Without mentioning it, I believe you refer to your experience with OPM, where there is no edge between artifact and agent. In OPM, we can have resource-oriented descriptions with edges between artifacts, or process-oriented descriptions with edges to/from processes. Agents are kind-of falling between the cracks here.

  • Response to Luc by Jun:
  • Not only with OPM, there are no edge between artifact and agent in both OPMV and the Provenance Vocabulary. For OPMV, the intention is be compliant with OPM as much as possible and to reuse existing vocabularies as much as possible. For the Provenance Vocabulary, we wanted to be more precise because making the responsible entities explicit is very important in the Linked Data context. And the event information is available in most cases.