Proposal for a Working Group on Provenance

From XG Provenance Wiki
Jump to: navigation, search

Proposing a Working Group

A proposal for a working group will be more successful if we draft a charter for it, with concrete goals and deliverables that are feasible.

Based on this page and the page Suggested_Concepts we put together a well formatted version under Draft_Charter.

Discussion on Goals and Deliverables

What would be the objectives of the WG?

The group agreed to the following objectives:

  • define a provenance exchange language and protocol to publish and access provenance
  • the scope of this language will be any resource, not be just semantic web objects
  • the exchange language should have a low entry point to facilitate widespread adoption, therefore it should be easy to do simple things
  • it should have a small core model and allow for extensions (ie, species/profiles, integration of other more expressive/complementary vocabularies/frameworks)
  • the WG should release some deliverables early and end in 18 months or 2 years.

What would be the deliverables of the WG?

3.1 Deliverables

By FOO, we refer to the provenance interchange language to be defined, and the inferences allowed over it. The working group will select an appropriate name for the language.

  • D1. Conceptual Model (W3C Recommendation). This document consists of a natural language description and graphical illustration of concepts involved in FOO. Such a document will help broaden the appeal of provenance beyond the community of technical experts.
  • D2. Formal Model (W3C Recommendation). The purpose of this document is to provide a normative formalization of the conceptual model, making use of Semantic Web languages beginning with RDFS and OWL.
  • D3. Formal Semantics (W3C Note). This note consists of a mathematical definition of FOO.
  • D4. Accessing and Querying Provenance (W3C Recommendation). This document specifies how provenance can be accessed or queried in embedded documents and from remote services. Specifically, it defines how to access provenance embedded in an html document using RDFa, how to access provenance from a service by means of HTTP, and how to query provenance through a SPARQL endpoint.
  • D5. Guidelines for producing XML of the model (W3C Recommendation). This document specifies an XML serialization for FOO.
  • D6. Interoperability Guidelines (W3C Recommendation). This document explains how extant provenance models can be mapped into FOO to ensure interoperable exchange of provenance across heterogeneous systems.
  • D7. Best Practice Cookbook (W3C Note). This document includes a limited set of best practice profiles that link with other relevant models, such as Dublin Core provenance related concepts, licensing in Creative Commons, and the OpenId identity mechanism for people.
  • D8. Primer (W3C Note). This educational document provides users with an easy to understand description of the model.

Comments about the deliverables:

  • The conceptual model and the formal model will be developed in parallel, ensuring that concepts can be formalized adequately, and vice-versa, that the formalization is explained intuitively.
  • The Working Group is committed to formalizing the provenance interchange language using RDFS and OWL, in a first instance. Depending on the kinds of inferences to be supported, other Semantic Web languages may also be considered, where appropriate. A by-product of this formalization is the mapping of the provenance interchange language to RDF graphs.
  • The working group will define the scope of FOO's formal semantics. Its intent is to disambiguate concepts to ensure inter-operability.
  • A serialization to XML will help disseminate FOO to communities beyond the Semantic Web community.

Concepts

Dublic Core

Concepts from dublin core - dc = http://purl.org/dc/terms/

  • dc:contributor - agent A contributed to resource R
    • e.g. "Report dc:contributor Alice" means 'The report had material contributed to it by Alice.'
  • dc:creator - agent A created resource R
    • e.g. "Report dc:creator Alice" means 'The report was created (written) by Alice.'
  • dc:hasPart - resource R1 has a part resource R2
    • e.g. "Report dc:hasPart DataPlot" means 'The report contains the data plot.'
  • dc:modified - resource R was modified at time T
    • e.g. "Report dc:modified 12:00" means 'The report was modified (edited) at 12:00.'
  • dc:replaces - resource R1 replaces R2 (for whatever implied use)
    • e.g. "ReportEdition2 dc:replaces ReportEdition1" means 'Edition 1 of the report should now be used instead of edition 2 of the report.'
  • dc:provenance - refers to a ProvenanceStatement about a resource that reflects any changes to it since its creation that are significant for its authenticity, integrity, and interpretation.

OPM

Concepts from opm: as a short cut for open provenance model.

Graph

  • opm:OPMGraph
    • Definition: a provenance graph is defined to be a record of a past execution
    • Example: Bob's Website Factory provides proof in the form of a provenance graph that the contract was executed as agreed.
  • opm:Account
    • Definition: An account of the some past execution. Accounts offer different levels of explanation for the same execution
    • Example: Bob's Website Factory and Customers Inc both provide two different and conflicting sets of information (i.e. accounts) describing the provenance of the production of the the same website.

Nodes

  • opm:Artifact
    • Definition: Immutable piece of state, which may have a physical embodiment in a physical object, or a digital representation in a computer system.
    • Example: BlogAgg would like to know the state of an image before and after modification to see if it was modified appropriately
  • opm:Process
    • Definition: Action or series of actions performed on or depend upon artifacts, and resulting in new artifacts.
    • Example: Alice collects data from public sources and "natural experiment" data. Alice then processes and interprets the results and writes a report summarizing the conclusions. All these steps should be captured.
  • opm:Agent (*1)
    • Definition: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, or affecting its execution.
    • Example: Alice starts and facilities the tool SPSS when doing data analysis.

Edges:

  • opm:Time (*2)
    • Example: BlogAgg wants to find the correct originator of the microblog who first got the word out.
  • opm:Role
    • Definition: A role designates an artifact’s or agent’s function in a process
    • Example: Whether a data file was used as a training or test data set when running machine learning algorithms.
  • opm:Used, opm:UsedStar
    • Definition: property to express that an artifact was used by a process.
    • Example: The panda image was used by BlogAgg to generate a thumbnail image.
  • opm:WasGeneratedBy, opm:WasGeneratedByStar,
    • Definition: property to express that an artifact was generated by a process.
    • Example: A thumbnail image was generated by Blog Agg using the panda image.
  • opm:WasControlledBy (*1)
    • Definition : property to express that a process was controlled an agent.
    • Example: SPSS was controlled by Alice.
  • opm:WasDerivedFrom, opm:WasDerivedFromStar,
    • Definition: property to express that an artifact was derived from another artifact.
    • Example: The thumbnail image was derived from the panda image.
  • opm:WasTriggeredBy
    • Definition: property to express that a process was triggered by another process.
    • Example: Report writing was triggered by the interpretation of results.

Extensibility (*3):

  • Some form of annotation, based on predicate-value pairs.
    • Example: The data is of type a customer sales records. The data has size 100 megabytes.
  • Profile mechanisms, including common types, common annotations, and common graph templates
    • Example: The image has a creative commons attribution license. This pattern represents the exchange of messages in the http protocol.
  • (*) indicates terms that require refinement
  • (*1) Requires better, stricter guidelines for better inter-operabiltiy
  • (*2) To be better aligned on Time ontology
  • (*3) To be better specified to facilitate extensibility and to be better aligned with RDF-like annotations


Provenir Ontology

Available at: http://wiki.knoesis.org/index.php/Provenir_Ontology

  • provenir:part_of
    • Definition: This property is used to represent parthood relation between entities (both class and instance-level).
    • Example: A mass analyzer is part of a mass spectrometer
  • provenir:contained_in
    • Definition: This property is used to represent containment relation between entities.
    • Example: A temperature sensor is contained in an ocean buoy.
  • provenir:adjacent_to
    • Definition: Spatial proximity is represented by this property. It is defined only for agent class, where the adjacent spatial location of individuals of agent class may have an effect on data values.
    • Example: Quality of observations made by a sensor may be affected if it is adjacent to a sensor generating a magnetic field.
  • provenir:transformation_of
    • Definition: This property is similar to the ro:transformation_of property that is asserted between two entities that preserve their identity between the two transformation stages.
    • Example: An cancer cell is a transformation of a normal cell
  • provenir:preceded_by
    • Definition: This property is used define a temporal ordering of processes, which may or may not be modeled be linked by a common artifact.
    • Example: Example from RO, aging preceded by development.
  • provenir:located_in
    • Definition: An instance of data or agent is associated with exactly one spatial region that is its exact location at given instance of time.
    • Example: A sensor is located in a specific geospatial region at time instance t
  • provenir:has_temporal_value
    • Definition: This property is used to explicitly associate temporal value with individuals of Provenir classes.
    • Example: duration of a liquid chromatography process has temporal value 20 minutes.
  • provenir: preceded_by*
    • Definition: Defines a temporal (and causal or non-causal) property for distinct instances of provenir:process.
    • Example: A researcher starts a process to send email about the status of an (long-running) experiment process. The notification process is preceded by the experiment process.
  • provenir:has_participant @
    • Definition: Property linking data to process, where the individual of data class participates in a process.
    • Example: Trypsin enzyme (used to digest protein sample) participates in a proteome analysis experiment
  • provenir:derives_from $
    • Definition: Property represents the derivation history of data entities as a chain or pathway.
    • Example: The average rainfall (specific to geospatial-temporal instance) is derived from sensor readings.
  • provenir:temporal_parameter &
    • Definition: This class captures the temporal details associated with individuals of provenir:data_collection, provenir:process, and provenir:agent.
    • Example: The timestamp associated with a sensor reading
    • Example: The duration of a protein analysis process
    • Example: The time period during which a sensor was working correctly
  • provenir:spatial_parameter
    • Definition: The spatial metadata associated with instances of provenir:process or provenir:agent or provenir:data_collection classes is represented by this class.
    • Example: The geographical location of an ocean buoy is an example of spatial parameter.

Notes:
\* Unlike opm:wasTriggeredBy, provenir:preceded_by property links processes that may or may not be causally dependent.
@ Unlike opm:used, provenir:has_participant may or may not represent an existential relationship between the provenir:data and provenir:process, in other words the provenir:process may or may not require the existence of the provenir:data to initiate/terminate.
$ Unlike opm:wasDerivedFrom, provenir:derives_from may or may not represent an existential relationship between entities.
& Extensions of the Provenir ontology, such as the Janus ontology for Taverna, and Parasite Experiment ontology for biomedicine, use the OWL:Time ontology terms to represent temporal notions.

The following Provenir terms were mapped to OPM terms during the mapping exercise, but often represented broader notions of provenance (see the mapping wiki for details). These terms need to be considered during the refinement of the corresponding OPM terms:

  • provenir:data
    • Definition: This class models BFO continuant entities that represent the starting material, intermediate material, end products of a scientific experiment, and parameters that affect the execution of a scientific process. Data inherit the properties of continuants such as enduring or existing while undergoing changes.
    • Example: A protein sample, digested with trypsin proteolytic enzyme, used as input in a proteome analysis experiment.
  • provenir:process
    • Definition: This class models the occurrent entities that affect (process, modify, create, delete among other dynamic activities) individuals of data.
    • Example: The proteome analysis experiment is a process and its constituent steps, are also processes
  • provenir:agent
    • Definition: This class models the continuant entities that causally affect the individuals of process.
    • Example: The researcher performing the proteome analysis experiment and microarray instrument used in the experiment are agents.

Basic Formal Ontology

obo: http://purl.obolibrary.org/obo/

Concepts

  • "process" (obo:BFO_0000007)
  • "role" (obo:BFO_0000023)
  • "continuant" (obo:BFO_0000002) and its subclasses
  • "temporal region" (obo:BFO_0000008)
  • "function" (obo:BFO_0000034) for "recipes"

There are also some relations of interest there:

  • "begins to exist during" (obo:BFO_0000068)
  • "ceases to exist during" (obo:BFO_0000069)
  • "participates in" (obo:BFO_0000056) / "has participant" (obo:BFO_0000057)
  • "is granular part of process" (obo:BFO_0000074) for relationships processes to each other.

PML

The list below is restricted to PML concepts that do not overlap with existing OPM concepts. To keep the list small, we did not include the properties of these concepts as well as some of their specializations.

  • pmlp:IdentifiedThing: The abstract root of provenance related concepts. It organizes a collection of common metadata about the referenced object, and it does not have any instance
    • pmlp:InferenceRule: It is the recipe of a process. We can say that it is the rule applied on the input information of a process execution and used to derive the product (or conclusion) or the process execution. In the Cake scenario, it is the recipe for Bake.
      • pmlp:DeclarativeRule: It is an inference rule (or recipe) that describes the logics of the transformation of input data into a product without specifying how the transformation occurs. It is often used for representing formal inference rules including deductive and inductive rules
      • pmlp:MethodRule: It is an inference rule that describes how a product is derived from input information (e.g., an algorithm that describes how its result is derived from the algorithm’s arguments). This kind of inference rule is also use to represent named recipes where the exact way input information is transformed is unknown (e.g., “black boxes”);
 Recipe - a link between provenance and the plan/recipe that was being followed (description of the nature of the process being   
 executed). What that recipe is seems to differ across domains/systems – a workflow template, logical rules, mathematical 
 function, scientific experiment protocol, a business contract, etc. – but the basic capability to make a link between a process 
 and the recipe again seems like a useful and relatively non-controversial extension that a working group could address. PML has 
 such a construct as do other languages represented/analyzed in the group. The business scenario where provenance is to be 
 compare with the contract (the recipe for what was supposed to occur) is a use case for this. PML inferences, scientific 
 workflow systems, etc. provide others.
    • pmlp:Source (it is a generalization of opm:Agent): It is an identified thing from where we obtain information
      • pmlp:Agent: An actionable entity capable of asserting information
      • pmlp:Document: A physical information container that is not actionable. They function like database
      • pmlp:DocumentFragment: A fragment of document that can be used as source
 Sources – the idea of an agent or mutable resource from which a resource of interest (the thing were documenting the provenance 
 of) comes. Nominally this could be dealt with by recording a an agent controlling a publication process to produce the resource  
 and I think the question to resolve is whether a special construct would be useful (since the fact that an article derived from 
 the NYTimes differs in importance from the same article being handed to you by Joe the newspaper seller (both are just agent-
 process-resource constructs)). With others in the XG group having special constructs for publication/retrieval from a service, 
 it seems like consensus might be possible on this and I think having discussion of this be part of the working group scope   
 would be useful.
    • pmlp:SourceUsage: it is the connection between a source (i.e., a mutable identified thing such as an agent or document) and information (i.e., immutable things) obtained from the source
 Versioning - A connection between mutable resources and the 'versions' of it that are affected by the processes being 
 documented. Something as simple as a 'hasVersion' link (e.g. as in WebDAV and other versioning models) might suffice, though a 
 link such as process 'isPartOfLifecycleOf' resource might also be useful. This is a hard problem in the general case (thousands 
 of years of discussion around continuants and occurents) but some extension to standardize how one might link resources to a 
 mutable thing as versions might be something that could be agreed to. A website URI who's content changes, a document that has 
 versions going through edits, etc. are good use cases.
  • JustificationElement:
    • pmlp:NodeSet: The justification collection for a resource is a directed acyclic graph of node sets connected by inference steps. Each node set has a conclusion and any number of inference steps including zero. We have speculated whether opm:Account is a mechanism for alternative or complementary provenance, reason why we keep this concept in the list
    • pmlp:Query (no OPM equivalent):It is a formal representation of user's question. For example, the interest of a customer in a cake was triggered by the following request from the customer: “What are the desserts available today in your restaurant?”

The Provenance Vocabulary

The namespace of prv is http://purl.org/net/provenance/ns#

  • prv:Actor - It is broader than opm:Agent. Each opm:Agent is directly related to a process (OPM defines opm:Agent as "a catalyst of a process"). A prv:Actor can be basically any active entity. This includes entities that are directly involved in the processes described (as represented by opm:Agent) but also entities that are not directly involved (e.g. the person who maintains the Web server that served a prv:DataItem in a prv:DataAccess execution).
  • prv:involvedActor - prv:involvedActor refers to active entities that were somehow involved in the execution of a process. It is broader than opm:wasControlledBy because this involvement does not necessarily mean that the referent was responsible for controlling the execution.
  • prv:containedBy - refers to a data item that contained a data item.
  • prv:operatedBy - refers to a human actor who was operating a non-human actor at the time the provenance description refers to. OPM does not have any properties between opm:Agent.
  • prv:usedBy - refers to a data publisher (a human actor) who used a data providing services (a non-human actor) at the time the provenance description refers to. Again, OPM does not properties between opm:Agent.
  • prv:wasPerformedAt - the time an execution has been performed at. The range is an xsd:dateTime.