Warning:
This wiki has been archived and is now read-only.

Provenance Vocabulary Mappings

From XG Provenance Wiki
Jump to: navigation, search

About this document

Source: W3C Provenance Incubator Group

Authors: Satya Sahoo, Paul Groth, Olaf Hartig, Simon Miles, Sam Coppens, James Myers, Yolanda Gil, Luc Moreau, Jun Zhao, Michael Panzer, Daniel Garijo

Acknowledgements: Chris Bizer for initiating the discussion on creation of mappings between provenance terminologies. Luc Moreau and Simon Miles for reviewing and giving feedback on the mapping rationale.

Release Date: August 06, 2010.

Description: This document describes an initiative by the W3C Provenance Incubator Group to identify correspondence between a set of core provenance concepts defined in the Open Provenance Model (OPM) and other provenance terminologies. The document is expected to facilitate interoperability and help users better understand the commonalities and differences between OPM and other provenance terminologies. The mappings between the provenance terms are formally encoded using the W3C recommended Simple Knowledge Organization System (SKOS) vocabulary. This report also documents the rationale for the mappings.

Introduction

The need for representing provenance information has led to the creation of a number of provenance models that cater to a diverse set of application domains. The provenance models often reflect the requirements of the user community that developed the model, for example the Open Provenance Model (OPM) was developed as a generic provenance model in context of workflow provenance whereas the Provenir ontology was developed as a upper-level provenance ontology for use in Semantic Web applications, and the Provenance Vocabulary was proposed for publishing provenance aware content using the Linked Data (LD) principles. Given the large number of such models it is useful to identify the correspondence between their respective provenance terms to,

  • Help users better understand the similarities and differences between the provenance terminologies,
  • Facilitate the development of applications that can utilize the mappings for provenance interoperability, and
  • Enable the provenance research community to move towards the adoption of a common provenance terminology.

This work has been done by the Provenance Incubator Group, part of the W3C Incubator Activity, with a charter to provide a state-of-the art understanding and develop a roadmap in the area of provenance for Semantic Web technologies, development, and possible standardization. The group's activities are public, and are recorded on the W3C Provenance Incubator Group wiki. The group agreed to document the mappings between existing provenance models and vocabularies at the first Face-to-Face meeting held on April 25-26, 2010.

Approach and Provenance Terminologies

The mappings are structured following the approach used to create the Ontology for Media Resource 1.0 by the W3C Media Annotations Working Group. Essentially, a reference model is selected and then all other vocabularies and models are mapped to it. In the case of the media resource vocabularies, a new reference model was created, while here we agreed that it was best not to develop a new model but rather select an already existing one. Another departure was our desire to use formal mappings using semantic relations when possible, rather than textual comments.

We chose a core set of provenance vocabularies and models. Some were chosen because they represent a significant community effort, others because they have significant adoption, others because they are well known in the community. Others can be easily mapped following the same process that we established. The core set selected for mapping includes:

  • Open Provenance Model
  • Provenir ontology
  • Provenance Vocabulary
  • Proof Markup Language
  • Dublin Core
  • PREMIS
  • WOT Schema
  • SWAN Provenance Ontology
  • Semantic Web Publishing Vocabulary
  • Changeset Vocabulary

We selected OPM as the reference provenance model. First, because OPM is a general and broad model that encompasses many aspects of provenance. Second, it already represents a community effort that spans several years and is still ongoing, already benefiting from many discussions, practical use, and several versions. Finally, many groups are already undergoing efforts to map their vocabularies to OPM, and in addition there are already some mappings (called profiles in OPM) developed by the OPM group to some existing vocabularies. There was a consensus in using OPM, and a clear sense that the effort involved in designing mappings to OPM would be manageable and at the same time very useful to the community.

Briefly, OPM is used to describe histories in terms of processes (things happening), artifacts (what things happen to), and agents (what controls things happening). These three are kinds of nodes within a graph, where each edge denotes a causal relationship. Edges have named types depending on the kinds of node they relate: a process used an artifact; an artifact was generated by a process; one artifact was derived from another artifact; one process was triggered by another process; a process was controlled by an agent. As records of the same occurrences may be observed by different actors and/or from different perspectives, OPM allows subgraphs to belong to one or more accounts.

We agreed to make the mappings as formal as possible. We agreed to use SKOS terms (skos:broader skos:narrower skos:related) and OWL2 terms (owl:equivalentClass, owl:equivalentProperty) to indicate the mapping. To this end, we used a prefix for each of the models and vocabularies.

We created a single table with concise proposals for all the mappings. To complement the table, we wrote detailed notes about the mappings summarized in the table.

Mappings

Terms from Reference Model (OPM) Provenance Models: Terms and Mappings
Provenir ontology Provenance Vocabulary Proof Markup Language Dublin Core PREMIS WOT Schema SWAN PAV Ontology Semantic Web Publishing Vocabulary Changeset Vocabulary
Model Term Mapping Model Term Mapping Model Term Mapping Model Term Mapping Model Term Mapping Model Term Mapping Model Term Mapping Model Term Mapping Model Term Mapping
opm:Process provenir:process skos:broadMatch prv:Execution, prv:DataAccess, prv:DataCreation narrower, narrower, narrower pmlj:InferenceStep skos:relatedMatch dcmitype:Event skos:relatedMatch premis:Event skos:relatedMatch wot:SigEvent skos:narrowMatch swp:signatureMethod, swp:digestMethod skos:narrower cs:ChangeSet skos:relatedMatch
opm:Artifact provenir:data skos:relatedMatch prv:Artifact, prv:DataItem, prv:CreationGuideline, prv:File related, narrower, narrower, narrower pmlp:Information, pmlp:Source closeMatch, related dct:Collection, dct:BibliographicResource, dct:PhysicalResource

dct:MethodOfAccrual

skos:narrowMatch premis:Object skos:narrowMatch wot:PubKey skos:narrowMatch swp uses rdfg:Graph broader cs uses rdf:Statement skos:narrowMatch
opm:Agent provenir:agent skos:relatedMatch prv:Actor broader pmlp:Agent, pmlp:Source narrower, related dct:Agent skos:exactMatch premis:Agent skos:narrowMatch swp:Authority narrower
opm:Account provenir:data skos:broadMatch pmlj:NodeSet narrower dct:ProvenanceStatement skos:narrowMatch pav:versionNumber narrower swp:Warrant narrower
opm:wasDerivedFrom provenir:derives_from skos:relatedMatch prv:precededBy narrower pmlp:hasSourceUsage narrower dct:replaces, dct:source, dct:hasPart, dct:references skos:narrowMatch premis:relatedObjectIdentification skos:broadMatch pav:importedFromSource, pav:previousVersion narrower, narrower
opm:used provenir:has_participant skos:broadMatch prv:employedArtifact prv:usedData prv:usedGuideline related narrower narrower pmlj:hasAntecedentList relatedMatch dcterms:requires, dcterms:isRequiredBy skos:relatedMatch, skos:relatedMatch premis:linkingObjectIdentifier skos:relatedMatch wot:signer skos:narrowMatch swp:quotedBy skos:narrower cs:statement cs:removal cs:addition skos:narrowMatch skos:narrowMatch skos:narrowMatch
opm:wasGeneratedBy provenir:has_participant skos:broadMatch prv:yieldedBy prv:createdBy prv:retrievedBy related narrower narrower pmlj:IsConsequentOf, pmlp:hasConclusion relatedMatch, related dct:source skos:broadMatch premis:relatedEventIdentification skos:broadMatch pav:createdBy broader swp:assertedBy skos:narrower
opm:wasControlledBy provenir:has_agent skos:broadMatch prv:involvedActor prv:performedBy broader equivalent pmlp:hasSourceUsage, pmlj:hasInferenceEngine related, relatedMatch dct:contributor skos:relatedMatch premis:linkingAgentIdentifier skos:relatedMatch pav:contributors, pav:authors, pav:curators, pav:importedBy, pav:publishedBy, pav:submittedBy narrower, related, related, related, related, related cs:creatorName skos:narrowMatch
opm:wasTriggeredBy provenir:preceded_by skos:broadMatch dct:source skos:broadMatch

Notes on Models

This section contains notes that provide additional details about the mappings summarized in the table.

Provenir ontology

Provenir ontology properties for inclusion in table of provenance terms:

  • provenir:part_of – This property is used to represent parthood relation between entities (both class and instance-level). For example, a mass analyzer is provenir:part_of a mass spectrometer
  • provenir:contained_in - This property is used to represent containment relation between entities. For example, a temperature sensor provenir:contained_in an ocean buoy.
  • provenir: adjacent_to - Spatial proximity is represented by this property. For example, quality of observations made by a sensor may be affected if it is provenir:adjacent_to a sensor generating a magnetic field.
  • provenir:transformation_of – This property is similar to the ro:transformation_of property that is asserted between two entities that preserve their identity between the two transformation stages. For example, an cancer cell is a provenir:transformation_of a normal cell.
  • provenir:preceded_by - This property is used define a temporal ordering of processes, which may or may not be modeled be linked by a common artifact (such as in OPM:triggered_by). Example from RO, aging provenir:preceded_by development.
  • provenir:located_in - An instance of data or agent is associated with exactly one spatial region that is its exact location at given instance of time. For example, a sensor is provenir:located_in a specific geospatial region at time instance t
  • provenir:has_temporal_value - This property is used to explicitly associate temporal value with individuals of Provenir classes. For example, duration of a liquid chromatography process provenir:has_temporal_value 20 minutes.

Provenir ontology to OPM Mapping Rationale

The following table provides an explanation for the decisions that resulted in the given mapping from the common provenance terms to the Provenir ontology. All OPM term definitions are from "The Open Provenance Model Core Specification (v1.1)" by Moreau et al., 2009

Mapping Explanation
provenir:process is mapped to opm:Process using skos:broadMatch provenir:process allows modeling of processes that may or may not result in creation of new entities (provenir:data). opm:process is defined as "...actions resulting in new artifacts."
provenir:data is mapped to opm:Artifact using skos:relatedMatch OPM does not define the relationship between opm:Artifact and opm:Account. Specializations (sub-class) of the provenir:data can be used to model information entities represented by both opm:Artifact and opm:Account. Further, opm:Artifact are "immutable piece of state" whereas provenir:data allows representation of both immutable entities as well as entities that can undergo change or modification without losing their identities (for example, an organism retains its "identity" from its birth to its death).
provenir:agent is mapped to opm:Agent using skos:relatedMatch opm:Agent are defined as "... a cause of a process taking place". provenir:agent on the other hands allows modeling of agents that are may or may not be causally linked to a process (for example, an ocean buoy containing a sensor measuring temperature is not causally linked to the measurement process).
provenir:data is mapped to opm:Account using skos:broadMatch opm:Account describes the creation of provenance graphs at "...different levels of abstraction or from different viewpoints", provenir:data can be specialized (sub-class) to model provenance information at multiple levels of granularity (including referring to provenance graphs as entities using RDF named graph identifiers).
provenir:derives_from is mapped to opm:wasDerivedFrom using skos:relatedMatch provenir:derives_from property represents the derivation history of data entities as a chain or pathway. Unlike opm:wasDerivedFrom, provenir:derives_from may or may not represent an existential relationship between entities.
provenir:has_participant is mapped to opm:used using skos:broadMatch provenir:has_participant property describes the participation of provenir:data entities in a provenir:process. Unlike opm:used, provenir:has_participant may or may not represent an existential relationship between the provenir:data and provenir:process, in other words the provenir:process may or may not require the existence of the provenir:data to initiate/terminate.
provenir:has_participant is mapped to opm:wasGeneratedBy using skos:broadMatch opm:wasGeneratedBy can be interpreted as an inverse property of opm:used. provenir:has_participant allows modeling of more types of relationships between data and process, in addition to the existential relationship modeled by opm:wasGeneratedBy.
provenir:has_agent is mapped to opm:wasControlledBy using skos:broadMatch provenir:has_agent property is a "...causal property that links agent to process where the agent is directly responsible for the change in state of the process". In addition to "causal dependecy" represented by opm:wasControlledBy, provenir:has_agent allows modeling of causal participation also.
provenir:preceded_by is mapped to opm:wasTriggeredBy using skos:broadMatch provenir:preceded_by property allows ordering of provenir:process entities along multiple dimensions, including time. Unlike opm:wasTriggeredBy, provenir:preceded_by property links processes that may or may not be causally dependent.

Provenance Vocabulary

The Provenance Vocabulary was developed to describe provenance of Linked Data on the Web. The openness of the Web of Linked Data allows everyone to publish anything. Applications that are based on data from the Web have to evaluate the provenance of this data in order to estimate its reliability. To obtain such provenance information the applications rely on provenance-related metadata from third parties, e.g. the data providers. However, a recent study revealed a general lack of provenance-related metadata about Linked Data on the Web. One reason - among others - might be the lack of suitable vocabularies to describe provenance of Linked Data. The Provenance Vocabulary aims to fill this void.

The Provenance Vocabulary is defined as an OWL ontology and it is partitioned into a core ontology and supplementary modules. To avoid making the core ontology too complex the modules provide less frequently used concepts and a broad range of specializations of the core concepts. At present the Provenance Vocabulary provides three supplementary modules: Types, Files and Integrity Verification.

The development of the Provenance Vocabulary was motivated by the need to describe the main aspects of provenance of data consumed from the Web. The authors of the vocabulary identified two main dimensions of provenance that are typical in this context: data creation and data access. Some, more general concepts, such as actors, processes, and artifacts, are relevant in both these dimensions. Consequently, the Provenance Vocabulary consists of three parts: general terms, terms for data creation, and terms for data access. Detailed information about using the Provenance Vocabulary and many examples can be found in the Guide to the Provenance Vocabulary.

Related Publications:

  • Olaf Hartig and Jun Zhao: Publishing and Consuming Provenance Metadata on the Web of Linked Data. In Proceedings of the 3rd International Provenance and Annotation Workshop (IPAW), Troy, New York, USA, June 2010 Download PDF
  • Olaf Hartig: Provenance Information in the Web of Data. In Proceedings of the Linked Data on the Web (LDOW) Workshop at WWW, Madrid, Spain, April 2009 Download PDF

Rationale of the Mapping

The following table provides an explanation for the decisions that resulted in the given mapping from the common provenance terms to the Provenance Vocabulary. The current mapping is based on release v0.5 of the vocabulary.

Mapping Explanation
prv:Execution is narrower than opm:Process Both terms refer to a specific execution of a process. However, while the definition of opm:Process only requires that this execution must have started in the past, prv:Execution explicitly refers to executions that also have already been completed.
prv:DataAccess is narrower than opm:Process prv:DataAccess is a sub-class of prv:Execution and since prv:Execution is equivalent to opm:Process prv:DataAccess is narrower.
prv:DataCreation is narrower than opm:Process prv:DataCreation is a sub-class of prv:Execution and since prv:Execution is equivalent to opm:Process prv:DataCreation is narrower.
prv:Artifact is similar to opm:Artifact prv:Artifact is anything that can be the input to the execution of a process or (one of) the result(s) of such an execution. Hence, the Provenance Vocabulary does not understand artifacts as an "immutable piece of state" as OPM does. (Question: is prv:Artifact broader than opm:Artifact?)
prv:DataItem is narrower than opm:Artifact prv:DataItem is a special kind of artifacts represented by opm:Artifact.
prv:CreationGuideline is narrower than opm:Artifact prv:CreationGuideline is a special kind of artifacts represented by opm:Artifact.
prv:File is narrower than opm:Artifact prv:File is a special kind of artifacts represented by opm:Artifact.
prv:Actor is broader than opm:Agent Each opm:Agent is directly related to a process (OPM defines opm:Agent as "a catalyst of a process"). A prv:Actor can be basically any active entity. This includes entities that are directly involved in the processes described (as represented by opm:Agent) but also entities that are not directly involved (e.g. the person who maintains the Web server that served a prv:DataItem in a prv:DataAccess execution).
nothing for opm:Account The Provenance Vocabulary is meant to be used to provide provenance descriptions as Linked Data on the Web. The possibility to provide multiple accounts of descriptions is an inherent characteristic of the Web of Linked Data. However, the common Linked Data publication practices do not provide the means to explicitly relate different descriptions (of the same situation) to each other.
prv:precededBy is narrower than opm:wasDerivedFrom Deriving something (a prv:DataItem in the case of the Provenance Vocabulary) from a preceding version of it is a special kind of deriving something from something else.
prv:employedArtifact is similar to opm:used While prv:employedArtifact refers to anything that can be any kind of input to the execution of a process, opm:used focuses on the input that adheres to the OPM notion of artifact.
prv:usedData is narrower than opm:used Since prv:DataItem is a special kind of opm:Artifact using a data item for the execution of a process is a special kind of using an opm:Artifact for the process.
prv:usedGuideline is narrower than opm:used Since prv:CreationGuideline is a special kind of opm:Artifact using a creation guideline for the execution of a process is a special kind of using an opm:Artifact for the process.
prv:yieldedBy is similar to opm:wasGeneratedBy While prv:yieldedBy refers to anything that can be any kind of output of the execution of a process, opm:wasGeneratedBy focuses on output that adheres to the OPM notion of artifact.
prv:createdBy is narrower than opm:wasGeneratedBy Since prv:DataItem is a special kind of opm:Artifact and prv:DataCreation is a special kind of prv:Execution (equivalent to opm:Process), creating a data item by the execution of a data creation process is a special kind of generating an opm:Artifact by the process.
prv:retrievedBy is narrower than opm:wasGeneratedBy Retrieving a data item from the execution of a data access process is a special kind of generating an opm:Artifact by the process.
prv:performedBy is equivalent to opm:wasControlledBy Both relationships refer to the active entity that was responsible for controlling the execution of a process.
prv:involvedActor is broader than opm:wasControlledBy prv:involvedActor is a super-property of prv:performedBy and it refers to active entities that were somehow involved in the execution of a process. This involvement does not necessarily mean that the referent was responsible for controlling the execution. An example for an involved actor that did not control a process execution is a Web service that responded in the execution of a prv:DataAccess.
nothing for opm:wasTriggeredBy opm:wasTriggeredBy cannot be mapped directly to a Provenance Vocabulary term. However, due to the use of RDF with its blank nodes it is very easy to represent the semantic of opm:wasTriggeredBy which assumes the existence of an unknown artifact by a blank node that was prv:yieldedBy a prv:Execution P1 and that was an prv:employedArtifact in prv:Execution P2.

Terms not Present in the Mapping

The following Provenance Vocabulary terms are not mentioned in the mapping yet: prv:DataProvidingService, prv:DataPublisher, prv:HumanActor, prv:NonHumanActor, prv:accessedResource, prv:accessedService, prv:containedBy, prv:deployedSoftware, prv:serializedBy, prv:operatedBy, prv:performedAt, prv:usedBy

Proof Markup Language

The following table provides an explanation for the decisions that resulted in the given mapping from the common provenance terms to the Proof Markup Language. Note PML includes 3 subvocabularies for provenance (pmlp), justification (pmlj), and trust (pmlt).

Mapping Explanation
pmlj:InferenceStep is a relatedMatch to opm:Process Both terms refer to a specific execution of a process. While the term InferenceStep might seem to imply a subtype of step, it is used broadly to apply to many types of mathematical/computational process executions as well as logical inference and thus appears to be a match for opm:Process
pmlp:Information is a closeMatch to opm:Artifact Information "supports references to information at various levels of granularity and structure" and is used in examples to represent text strings and scientific data files and thus appears to be a close match to the opm:Artifact concept
pmlp:Source is related to opm:Artifact Source appears to be used both for things that would map to opm:Artifact (i.e. Documents, web pages) as well as opm:Agents (i.e. an agent/person). Sources are associated with Information that comes from them (hasSourceUsage), which, as discussed later, appears to be a form of opm:wasDerivedFrom relation where the 'usage' process is not described.
pmlp:Agent is narrower than opm:Agent pmlp:Agent is a subtype of Source and thus does not have an association with a Process as in opm. However, as a source, pmlp:Agents do appear to control undescribed processes resulting in Information that can be inputs (antecedent to) INferenceSteps. Thus pmlp:Agent appears narrower than opm:Agent. opm:Agent also appears to be used more broadly for non-human agents than pmlp:Agent, e.g. with an oven controlling a baking process. pmlp:Source thus appears to have overlap with opm:Agent as well (e.g. if ovens can be pmlp:Sources of cakes)
pmlj:NodeSet is narrower than opm:Account NodeSet describes a set of InferenceSteps, Information, and Sources leading to a conclusion. It thus appears to serve a similar role to opm:Account as a way to aggregate information provided by a witness. It's narrower in that Account does not require causal connections between all the artifacts and processes being reported.
pmlp:hasSourceUsage is narrower than opm:wasDerivedFrom PML doesn't appear to have a general causal connection between pmlp:Information instances but does provide such a link between Sources (which can be documents) and Information (i.e. a text string from that document).
pmlj:hasAntecedentList is a relatedMatch to opm:Used The antecedent list documents the inputs to an InferenceStep and thus appears to be similar to opm:used. The list is ordered whereas opm:used provides roles, so the match is not exact.
pmlj:IsConsequentOf is a relatedMatch to opm:wasGeneratedBy InferenceSteps create Information which is a result of that step. It thus appears to mirror the opm wasGeneratedBy relationship of process and artifact. In PML, NodeSets may also have pmlp:hasConclusion relationships with Information, which appears to make it related to wasGeneratedBy in some sense (NodeSet as an aggregate opm:Process).
pmlp:hasSourceUsage is related to opm:wasControlledBy hasSourceUsage when the Source is an Agent appears to imply that the Agent controlled some process by which the output Information was created.
pmlj:hasInferenceEngine is a relatedMatch to opm:wasControlledBy inference engines appear to control InferenceSteps in a way analogous to opm:Agent controlling processes and is reported in PML documentation as "the agent who ran this step". (I have not checked what the range of hasInferenceEngine is to know if it is restricted to Source or Agent or somethign else).
nothing matches opm:wasTriggeredBy The paper "Towards Usable and Interoperable Workflow Provenance:Empiracal Case Studies Using PML" notes that pml had no equivalent of wasTriggeredBy as of the 3rd Provenance Challenge...

Provenance Vocabulary terms not mentioned in the mapping yet: PML 2 has a broad range of concepts and relationships that are outside the scope of OPM. These include specific properties describing and structuring artifacts, agents/sources, and processes/InferenceSteps, some of which appear to overlap Dublin Core, some of which appear to be text related (e.g. Information that "hasRawString"), and some of which describe the algoritm and environment in which processes (process events) occur (e.g. describing InferenceEngines and InferenceRules). PML also includes a Trust vocabulary that documents beliefs/trust in sources that is not covered in OPM.

One other note: While the provenance of PML in the semantic web is clear in the choice of names (e.g. InferenceEngines), there are numerous examples where PML has been applied to non-text data and non-logic-based processing and thus the underlying term definitions do not appear restrictive.

Dublin Core

Dublin Core Metadata Terms provide a means to describe resources such that others will be able to interpret those descriptions. In particular, it provides a common vocabulary of core terms which can act as metadata keys, qualifications of those terms for specific applications, definitions of data types for the values of resource metadata, and so on. Amongst the terms available are many which relate to the provenance of the resource: who created it, when it was changed, etc.

The following table provides an explanation for the decisions that resulted in the given mapping from the common provenance terms to Dublin Core Metadata Terms.

Mapping Explanation
dcmitype:Event is related to opm:Process dcmitype:Event represents a non-persistent, time-based occurrence. An opm:Process is similarly an individual non-persistent occurrence, though with a causation-based rather than time-based identity. dcmitype:Event could also denote a future occurrence, while opm:Process refers to past occurrences only.
dct:Collection, dct:BibliographicResource, dct:PhysicalResource, and dct:MethodOfAccrual are all narrower than opm:Artifact Each of the given Dublin Core entities are 'things' which can be used or generated by processes, and are therefore types of opm:Artifact.
dct:Agent is an exact match to opm:Agent dct:Agent is a resource that acts or has the power to act, e.g. a person, organization, and software agent. Given that an action is, in OPM, a process, and an opm:Agent is the entity controlling a process, these two concepts are equivalent.
dct:ProvenanceStatement is narrower than opm:Account A dct:ProvenanceStatement is a statement of any changes in ownership and custody of a resource since its creation that are significant for its authenticity, integrity, and interpretation. Such changes are a subset of the processes which may be described in an OPM graph regarding a resource's provenance. Further, a dct:ProvenanceStatement is a statement from a single source, therefore denoting one view of a resource's history. In OPM, accounts are distinguished subgraphs, and a specific use of them is to assert provenance from a single viewpoint. Therefore, dct:ProvenanceStatement is a narrowing of the opm:Account concept.
dct:replaces, dct:source, dct:hasPart, and dct:references are narrower than opm:wasDerivedFrom Each of the Dublin Core terms listed relates an artifact to another from which it is in some way derived, so is a kind of opm:wasDerivedFrom.
dct:requires, dct:isRequiredBy is narrower than opm:used ??? (dct:requires, dct:isRequiredBy seem not to be about the past, so how could they would map to any OPM concept?)
dct:source is broader than opm:wasGeneratedBy ??? (dct:source by definition relates data (two resources), while opm:wasGeneratedBy relates data (artifact) to a process - where is the match?)
dct:contributor is related to opm:wasControlledBy dct:contributor relates a resource (data) to an agent which contributed to it. This implies a opm:Process/dct:Event in which the contribution happened, controlled by the agent. Therefore, dct:contributor is related to opm:wasControlledBy which denotes the control of a process by an agent.
dct:source is broader than opm:wasTriggeredBy ??? (dct:source by definition relates data (two resources), while opm:wasTriggeredBy relates two processes - where is the match?)

Some vocabulary, such as DC and WoT, are largely in the form of RDF-style predicates. Rather than these predicates themselves mapping to OPM terms, it is their domains or ranges which do.

Dublin Core terms which do not map directly to OPM concepts are:

  • dct:contributor, dct:creator, dct:publisher, which map to an OPM graph following a particular pattern involving an agent, a process and multiple artifacts denoting a resource in multiple states
  • dct:accrualMethod, whose range is an artifact (the specification of a method of accrual)
  • dct:accrualPeriodicity, whose range is a recurring period of time, while OPM describes only particular past instances
  • dct:available, dct:created, dct:dateAccepted, dct:dateCopyrighted, dct:dateSubmitted, dct:modified, whose ranges are equivalent to the OPM 'timestamp' concept
  • dct:identifier, which is related to the OPM 'pname' annotation
  • dct:isReferencedBy, dct:isReplacedBy, which are narrow matches of the inverse of opm:wasDerivedFrom (dct:references, dct:replaces are narrow matches of opm:wasDerivedFrom, as shown in the table)

PREMIS

PREMIS is a data dictionary for supporting long-term preservation. It focuses on the provenance of the archived, digital objects (files, bitstreams, aggregations), not on the provenance of the descriptive metadata.


Mapping Explanation
premis:Event is mapped to opm:Process using skos:relatedMatch A premis:Event describes any event applied to a premis:Object (bitstream, file, representation). This event may or may not change the premis:Object. Examples are a file format migration or an MD5 check. The premis:Event is timebased, that is why it is related to opm:Process.
premis:Object is mapped to opm:Artifact using skos:narrowMatch A premis:Object can only be a bitstream, file or aggregation (representation). It does not refer to metadata, which is the reason for the narrow match.
premis:Agent is mapped to opm:Agent using skos:narrowMatch A premis:Agent can be a person, institution or software. The premis:Agent initiates a premis:Event and can hold some premis:Rights. An opm:Agent can refer to anything causing a process, that is why premis:Agent is a narrow match to opm:Agent.
premis:relatedObjectIdentification is mapped to opm:wasDerivedFrom using skos:broadMatch A premis:relatedObjectIdentification relates two premis:Objects to each other. The relationship can be structural (a premis:Object as part of another premis:Object) or a derivation (a premis:Object can be migrated from another premis:Object). --> broader match.
premis:linkingObjectIdentifier is mapped to opm:used using skos:relatedMatch
premis:relatedEventIdentification is mapped to opm:wasGeneratedBy using skos:broadMatch a premis:relatedEventIdentification relates a premis:Object to a premis:Event. it is broadly matched to opm:wasGeneratedBy because the relationship between the premis:Object and premis:Event can be broader than just causal. The premis:Object could be used, e.g., as input for the premis:Event.
premis:linkingAgentIdentifier is mapped to opm:wasControlledBy using skos:relatedMatch premis:linkingAgentIdentifier links a premis:Agent to a premis:Event for describing that the premis:Agent initiated the premis:Event.

Except from the terms in the mapping table, the following terms from PREMIS can be relevant for provenance:

  • signature: PREMIS allows describing the signature an object is signed with.
    • signatureEncoding: The encoding used for the values of signatureValue, keyInformation. E.g. Base64
    • signer: The individual, institution, or authority (Agent) responsible for generating the signature. Could also be carried in keyInformation.
    • signatureMethod: A designation for the encryption and hash algorithms used for signature generation. E.g. DSA-SHA1
    • signatureValue: a value generated from the application of a private key to a message digest.
    • keyInformation: Information about the signer’s public key needed to validate the digital signature.

WOT Schema

The Web of Trust ontology provides a vocabulary for describing how data items' validity has been assured through being encrypted or signed, relating encrypted data to its key, keys to their users and so on. Included in this are assertions about the provenance of data items, such as when in the past a data item was signed and by what key.

The following table provides an explanation for the decisions that resulted in the given mapping from the common provenance terms to Web of Trust.

Mapping Explanation
wot:SigEvent is narrower than opm:Process wot:SigEvent is an event describing the action of a public key being signed by some other public key, whereas opm:Process denotes an event in general.
wot:PubKey is narrower than opm:Artifact wot:PubKey represents a PGP/GPG public key, which is a kind of data artifact.
wot:signer is narrower than opm:used wot:signer links from a signing event to the public key used to sign, and so is an example of a process using an artifact.

WoT has a couple of terms which do not map to concepts in the table above, but have some relation to other OPM concepts.

  • wot:sigdate is a non-causal property so does not map directly to an OPM concept, but its range is comparable to the OPM 'timestamp' concept
  • wot:signed is the same as wot:signer but in the opposite direction (range and domain are swapped), so maps indirectly to an opm:used relation

SWAN

The table below explains the decisions for the result mapping from the common provenance terms to the SWAN Provenance, Authoring and Version Vocabulary, available at http://purl.org/swan/1.2/pav/.

Mapping Explanation
pav:importedFromSource is narrower than opm:wasDerivedFrom pav:importedFromSource refers to "the original source of the encoded information (PubMed, UniProt...)", which is more specific than opm:wasDerivedFrom.
pav:previousVersion is narrower than opm:wasDerivedFrom pav:previousVersion refers to "the previous version of the resource.", which might have contributed to this latest version of the resource, and it is more specific than opm:wasDerivedFrom.
pav:createdBy is broader than opm:wasGeneratedBy pav:createdBy refers to "an entity primary responsible for making the resource". Such an entity can be interpreted as either a process or an agent. Hence it is broader than opm:wasGeneratedBy.
pav:contributors is narrower than opm:wasControlledBy pav:contributors refers to an agent contributing to the creation of a resource. It is narrower than opm:wasControlledBy. Its sub-properties pav:authors and pav:curators are also narrower than opm:wasControlledBy.
pav:importedBy is narrower than opm:wasControlledBy pav:importedBy refers to an "entity responsible for importing the data from an external source". It is narrower than opm:wasControlledBy.
pav:publishedBy is narrower than opm:wasControlledBy pav:publishedBy refers to an "entity responsible for responsible for making the resource available". It is narrower than opm:wasControlledBy.
pav:submittedBy is narrower than opm:wasControlledBy pav:submittedBy refers to an entity responsible for submitting some resources". It is narrower than opm:wasControlledBy.

Because some terms from SWAN-PAV are not well-documented and the domain and range of these properites are neither defined in the ontology, this made it hard to judge the mapping relationships between them and the core OPM terms. These terms include:

  • pav:contributedBy and its sub-properties pav:authoredBy and pav:curatedBy.

A lot of time-related properties from SWAN-PAV were neither mapped, such as pav:createdOn, pav:importedOn, and pav:lastUpdatedOn.

Semantic Web Publishing Vocabulary

I'm using the document http://www4.wiwiss.fu-berlin.de/bizer/WIQA/swp/SWP-UserManual.pdf to do the mapping. I've duplicated some of the tables in that document here for readability.

The following table provides an explantation for the mapping decisions for mapping from SWP to OPM.

Mapping Explanation
swp:SignatureMethod and swp:digestMethod skos:narrower than opm:Process swp:SignatureMethod defines the the method by which a digital signature was constructed this is a very specific form of a process. Similar reasoning applies to the swp:digestMethod
rdfg:Graph is broader than an opm:Artifact SWP essentially allows information about who asserts or commits to a named graph (rdfg:Graph) to be represented. In this sense, an rdfg:Graph is like an opm:Artifact because it is a piece of state, however, it is broader than opm:Artifact because it is not necessarily a snapshot in time.
swp:Authority is narrower than opm:Agent An swp:Authority defines who commits to a particular named graph. One could see this as an OPM process, however, it can also be seen as the entity that controls the process of assertion or quoting a named graph. In any case, it is more specific because it controls only this "commitment" process.
swp:Warrant is narrow than opm:Account Accounts within OPM provide a mechanism to denote different views of the same execution. A Warrant identifies that a particular party has authorized a named graph, thus identifying one authorized view of what has occurred.
swp:quotedBy is narrower than swp:used swp:quotedBy means that a particular authority has quoted (or used) a particular named graph. Quotation is more specific form of usage.
swp:assertedBy is narrower than opm:WasGeneratedBy swp:assertedBy means that a particular authority says that a named graph should be taken as a claim made by it. Claiming is a specific type of generation. (This may be debatable, but that was the reasoning).


SWP has a number of elements that are not readily mapped to OPM because they use properties. For the following signature related terms, the object of the properties are the artifact and the property some how describes the "type" of that artifact.

Property Definition
wp:signature The value of this property is the signature to be used to authenticate the graphs with which the subject warrant is associated.
swp:signatureMethod The value of this property is the signature method by which the signature specified for the subject warrant was constructed.
swp:digest The value of this property contains a digest value for the subject graph.
swp:digestMethod The value is the digest method by which the digest value specified for the graph subject was constructed.
swp:hasKey The value is some kind of public key which belongs to the authority. The key is represented by an XML literal containing a XML Signature keyInfo element.
swp:certificate The value is the base64 encoding of a binary (ASN.1 DER) X.509 certificate containing the public key of the authority.

SWP also defines a number of signature terms. These could be modeled as artifacts.

Finally, swp:validFrom and swp:validUntil define when a Warrant is valid. In opm this could be encoded by annotations on the account representing the warrant.

Changeset

The Changeset Vocabulary describes changes to RDF-based resource descriptions. A resource description is a set of RDF triples that "in some way comprise a description of a resource." [Tunnicliffe and Davis, 2009] The change of a resource description is represented by a cs:ChangeSet entity which encapsulates the delta between two versions of the description. Such deltas are represented by additions and removals of RDF triples.

While the Changeset Vocabulary is not a provenance vocabulary per se, a change set can be understood as the description of a process to change the content of an RDF repository. However, this process must not have been executed. Hence, a change set description is not necessarily a provenance description. If the described change actually happened then the change set description qualifies as a description of provenance. For the mapping we assume the Changeset Vocabulary is being used to describe such changes that actually happened.

Rationale of the Mapping

The following table provides an explanation for the decisions that resulted in the given mapping from the common provenance terms to the Changeset Vocabulary. The current mapping is based on the 2009-05-18 version of the vocabulary.

Mapping Explanation
cs:ChangeSet is related to opm:Process A change set can be understood as the description of a process to change the content of an RDF repository. This process, however, must not have been executed. Hence, a change set description is not necessarily a provenance description. If the described change actually happened then the change set description qualifies as a description of provenance.
rdf:Statement is narrower than opm:Artifact The ChangeSet vocabulary uses entities of type rdf:Statement as the things that are added to or removed from an RDF repository. rdf:Statement is a special kind of artifacts represented by opm:Artifact.
nothing for opm:Agent
nothing for opm:Account
nothing for opm:wasDerivedFrom The ChangeSet vocabulary provides cs:precedingChangeSet but this refers to the changeset that immediately precedes the described one. From this relationship we cannot infer that the described change set was derived in any way from the preceding change set.
cs:statement is narrower than opm:used cs:statement is an abstract property referring to an rdf:Statement included in a set of changes. Since rdf:Statement is narrower than opm:Artifact this property must be narrower than opm:used.
cs:removal is narrower than opm:used cs:removal is a sub-property of cs:statement
cs:addition is narrower than opm:used cs:addition is a sub-property of cs:statement
nothing for opm:generatedBy
cs:creatorName is narrower than opm:wasControlledBy Referring to the name of someone who created a change set is a special kind of referring to someone/something which controlled a process execution.
nothing for opm:wasTriggeredBy

Terms not Present in the Mapping

The following provenance-related terms from the Changeset Vocabulary are not mentioned in the mapping yet: cs:changeReason, cs:createdDate, cs:precedingChangeSet

The Changeset Vocabulary additionally provides the following terms that we do not consider as related to provenance: cs:subjectOfChange

Conclusions

This document reports on the W3C Provenance Incubator Group's mapping approach between a set of core provenance terms and several provenance-related models and vocabularies. As core provenance terms we selected the provenance-related terms in OPM. While we were aware of the potential limitations such a predefinition would involve it turned out to be right choice for this first attempt to identify correspondences between existing provenance terminologies. Choosing OPM terms allowed us to focus on defining the mapping instead of discussing what qualifies as common provenance terms. Such a discussion could have deviated too easily into the development of another model, which was not our aim. Instead, the conducted mapping exercise enables a better understanding of the differences and similarities of the existing models.

Our first finding is that many of the considered models and vocabularies have a set of core concepts that correspond to the notion of processes, artifacts, and agents as defined in OPM. These concepts can be mapped quite naturally between the models. While the modeling of these concepts indicates a process-centric view, several vocabularies take a resource-centric view. Specifying the mappings for these resource-centric vocabularies was more difficult. In particular, we experienced such difficulties for vocabularies that use relationships between entities as "shortcuts" for a detailed process descriptions. It is understandable that, particularly in the workflow provenance context, it is very important and mostly feasible to explicitly describe the process involved in causing the existence of a resource. However, it might not always be the case. For example, when expressing that some brain tissues were obtained from disease centers, it might be sufficient to say that some tissues were contributed by a specific disease center. Although, a provenance description could introduce a process that represents the contribution process, it would make things more verbose than necessary. Hence, resource-centric terms are important shortcuts to complement process-centric provenance vocabularies; they allow for a more compact representation of provenance which is more intuitive in some cases.

Several vocabularies provide non-causal relationships, something explicitly left out of OPM. For instance, Provenir includes the property provenir:preceded_by to represent temporal order or provenir:adjacent_to for spatial proximity; the Provenance Vocabulary allows users to describe who was responsible for a data providing service that was accessed during the execution of a data access process. While it can be argued whether such relationships are provenance related or not, they may be of great value in several application areas of provenance descriptions such as provenance based measurement of trustworthiness of content or information quality assessment in general. However, the definition of non-causal relationships in provenance vocabularies should ensure that no conflict with causality can result.

While many vocabularies provide time related terms, the time dimension is not represented in our mapping. The main reason for this lack was that OPM does not represent time related properties explicitly as one of the terms defined for node types and their relationships. While OPM enables the specification of time constraints using timestamps attached to relationships, we selected only the node types and relationships in OPM as our core provenance terms.

Further aspects of provenance that are not well captured by OPM and, thus, missing from the core provenance terms are:

  • versioning,
  • a notion of artifact identity that persists across transformations,
  • containment relationships and collections, and
  • cryptographic hashes and digital signatures.

Some of the considered vocabularies introduce rich sets of useful concepts for these aspects. In many cases these concepts can be seen as sub-types of OPM terms. To preserve their rich expressiveness, a systematic structuring of these concepts, per application domain, in the form of OPM profiles, would be necessary. Similarly, better bridges between OPM and vocabularies that are already standardized and strongly adopted (e.g. Dublin Core), but do not have the full expressiveness of OPM, would be desirable.

The presented work could be developed further into two main directions. First, the work on the mappings could be continued and made more formal. While we decided to use the SKOS vocabulary to describe the mappings it would have been possible to even use more precise relationships (e.g. rdfs:subClassOf, rdfs:subPropertyOf, owl:equivalentClass) in some cases. In other cases a more expressible mapping language would be required, rendering the formalization into a substantial activity. Additionally, the presented mapping could be complemented by a reverse mapping from the core provenance terms to each provenance vocabulary. An alternative direction would be the evolution of the selected set of core provenance terms into a more comprehensive set which also models aspects of provenance that are currently missing from the core terms. This project would basically entail the discussions we wanted to avoid to complete the presented mapping. However, the result and the lessons learned from our mapping exercise would be a valuable input to such an endeavor.

Mapping Contributors

  • Open Provenance Model
  • Provenir -- Satya
  • Provenance Vocabulary -- Olaf and Jun
  • Proof Markup Language -- Jim Myers
  • Dublin Core -- Simon and Michael and Jun
  • PREMIS -- Sam
  • WOT Schema -- Simon
  • SWAN Provenance Ontology -- Jun and Satya
  • Semantic Web Publishing Vocabulary -- Paul
  • Changeset Vocabulary -- Olaf

Contact:
Satya Sahoo