Draft Charter

From XG Provenance Wiki
Jump to: navigation, search

W3C Provenance Interchange Working Group Draft Charter

(November 29, 2010)


The mission of the Provenance Interchange Working Group is to support the widespread publication and use of the provenance of Web documents, data, and resources. It will define a language for exchanging provenance, and publish concrete specifications of the language using existing W3C standards (RDFS, OWL, XML,...).

1. Background

The <a href="http://www.w3.org/2005/Incubator/prov/">W3C Incubator Group on Provenance</a> has identified rapidly growing needs for provenance in social, scientific, industry, and government contexts, involving data and information integration across the Web. Provenance is unique in that it inherently draws on distributed information and thus collecting it and making sense of it require consulting different heterogeneous systems.

Over time, multiple techniques to capture and represent various forms provenance have been devised, and are sometimes known under the names of lineage, pedigree, proof, or traceability. As noted in the Incubator's state-of-the-art report, the lack of a standard model is a significant impediment to realizing such applications. It matters since provenance is key to establishing trust in documents, data, and resources. However, the Incubator's work also indicates that many provenance models exist with significantly different expressivity, fundamentally different assumptions about the system they are embedded in, and radically different performance impact. The idea that a single way of representing and collecting provenance could be adopted internally by all systems does not seem to be realistic today.

A pragmatic approach is to consider a core provenance language and extension mechanisms that allow any provenance model to be translated into such a lingua franca and exchanged between systems. Heterogeneous systems can then export their provenance into such a core language, and applications that need to make sense of provenance in heterogeneous systems can then import it and reason over it.

2. Scope

The Provenance Interchange Working Group has the following objectives:

  • define a provenance interchange language and methods to publish and access provenance;
  • the scope of this language will be any resource, not be just semantic web objects;
  • the provenance interchange language should have a low entry point to facilitate widespread adoption, therefore it should be easy to do simple things;
  • it should have a small core model and allow for extensions (ie, profiles, integration of other more expressive/complementary vocabularies/frameworks);
  • the WG should release some deliverables early and end in 18 months.

3. Deliverables and Schedule

3.1 Deliverables

By FOO, we refer to the provenance interchange language to be defined, and the inferences allowed over it. The working group will select an appropriate name for the language.

  • D1. Conceptual Model (W3C Recommendation). This document consists of a natural language description and a graphical illustration of concepts involved in FOO. Such a document will help broaden the appeal of provenance beyond the community of technical experts.
  • D2. Formal Model (W3C Recommendation). The purpose of this document is to provide a normative formalization of the conceptual model, making use of Semantic Web languages beginning with RDFS and OWL.
  • D3. Formal Semantics (W3C Note, optional). This optional note consists of a mathematical definition of FOO. It will focus on facets of formalization that have not been captured in the formal model.
  • D4. Accessing and Querying Provenance (W3C Note). This document specifies how provenance can be accessed or queried in embedded documents and from remote services. Specifically, it defines how to access provenance embedded in an html document using RDFa, how to access provenance from a service by means of HTTP, and how to query provenance through a SPARQL endpoint.
  • D5. Guidelines for producing XML of the model (W3C Note). This document specifies an XML serialization for FOO.
  • D6. Best Practice Cookbook (W3C Note). This document includes a limited set of best practice profiles that link with other relevant models, such as Dublin Core provenance related concepts, licensing in Creative Commons, and the OpenId identity mechanism for people.
  • D7. Primer (W3C Note). This educational document provides users with an easy to understand description of the model.

Comments about the deliverables:

  • The conceptual model (D1) and the formal model (D2) will be developed in parallel, ensuring that concepts can be formalized adequately, and vice-versa, that the formalization is explained intuitively.
  • The Working Group is committed to formalizing the provenance interchange language using RDFS and OWL, in a first instance (D2). Depending on the kinds of inferences to be supported, other Semantic Web languages may also be considered, where appropriate. A by-product of this formalization is the mapping of the provenance interchange language to RDF graphs.
  • The Working Group will consider defining FOO's formal semantics (D3). Its intent is to disambiguate concepts to ensure inter-operability; the Working Group will specify its exact scope.

  • A serialization to XML (D5) will help disseminate FOO to communities beyond the Semantic Web community.

3.2 Milestones

Reports will undergo the W3C development process: Working Draft (WD), Working Draft in Last Call (LC), Candidate Recommendation (CR), Proposed Recommendation (PR) and Recommendation (Rec).

<tfoot> </tfoot> <tbody> </tbody>
Specification FPWD LC CR PR Rec
D1 T+6 T+9 T+12 T+15 T+18
D2 T+6 T+9 T+12 T+15 T+18
D3 (Optional) T+12 T+18 n/a n/a n/a
D4 T+9 T+15 n/a n/a n/a
D5 T+9 T+12 n/a n/a n/a
D6 T+15 T+18 n/a n/a n/a
D7 T+12 T+18 n/a n/a n/a

4. Provenance Concepts

The Working Group will leverage the activities of the Incubator Group on provenance, its understanding of the state-of-the-art, its extensive requirements capture, its use cases and flagship scenarios, and its mapping of provenance vocabularies.

Drawing on existing vocabularies/ontologies (namely: Changeset Vocabulary, Dublin Core, Open Provenance Model (OPM), PREMIS, Proof Markup Language (PML), Provenance Vocabulary, Provenir ontology, SWAN Provenance Ontology, Semantic Web Publishing Vocabulary, WOT Schema), a set of concepts have been identified to constitute the core of a standard provenance interchange language. The number of concepts is intentionally limited, so as to ensure a cohesive and tractable core. Other concepts can be relevant to provenance, but it is anticipated that those would be defined by means of the extension mechanism of the provenance interchange language.

In the following list, the names appearing as titles are used for intuition. Concepts with similar intuition in existing vocabularies are provided.

  1. Resource: Note that it includes static or dynamic (mutable or immutable), the WG can decide whether to subclass this and make a distinction.
    • opm:Artifact, pmlp:IdentifiedThing, provenir:data, "continuant" (obo:BFO_0000002), pmlp:Document, pmlp:DocumentFragment
      • Example: BlogAgg would like to know the state of an image before and after modification to see if it was modified appropriately
    • may include a user query (eg pmlp:Query)
  2. Process execution: refers to execution of a computation, workflow, program, service, etc. Does not refer to a query.
    • opm:Process, provenir:process, "process" (obo:BFO_0000007)
      • Example: Alice collects data from public sources and "natural experiment" data. Alice then processes and interprets the results and writes a report summarizing the conclusions. All these steps should be captured.
  3. Recipe link: we will not define what the recipe is, what we mean here is just a standard way to refer to a recipe (a pointer). Out of scope is to have standard ways to describe these recipes.
    • pmlp:InferenceRule, pmlp:DeclarativeRule, pmlp:MethodRule, "function" (obo:BFO_0000034)
      • Example: Alice is processing data and executes a linear regression implementation as one of the steps, the recipe could refer to a linear regression algorithm
  4. Agent: entity (human or otherwise) involved in the process execution. An agent can be the creator or contributor
    • opm:Agent, provenir:agent, prv:Actor, pmlp:Agent
      • Example: Alice starts and facilities the tool SPSS when doing data analysis.
  5. Role
    • opm:Role, "role" (obo:BFO_0000023)
      • Example: Whether a data file was used as a training or test data set when running machine learning algorithms.
  6. Location: a link to a description of location. Out of scope is to define how the spatial information will be represented, will point to an existing ontology
    • provenir:spatial_parameter, provenir:located_in, provenir:adjacent_to
      • Example: The location where the disease was declared.
  7. Derivation
    • opm:WasDerivedFrom, opm:WasDerivedFromStar, provenir:derives_from
      • Example: The thumbnail image was derived from the panda image.
  8. Generation
    • opm:WasGeneratedBy, opm:WasGeneratedByStar,
      • Example: A thumbnail image was generated by Blog Agg using the panda image.
  9. Use
    • opm:Used, opm:UsedStar, prv:usedBy
      • Example: The panda image was used by BlogAgg to generate a thumbnail image.
      • Example: John Markoff used SPSS
  10. Ordering of Processes
    • opm:WasTriggeredBy, provenir:preceded_by, provenir: preceded_by*
      • Example: Report writing was triggered by the interpretation of results.
      • Example: Bob is a researcher of the flu epidemic starts a process to send email about the status of an (long-running) experiment process. The notification process is preceded by the experiment process.
  11. Version
    • dc:replaces, provenir:transformation_of
      • Example: When Alice releases a new report this would express that this version should be used rather than the previous one.
      • Example: Alice consults a website URI whose content changes over time, a document that has versions going through edits, etc.
  12. Participation
    • provenir:has_participant, "participates in" (obo:BFO_0000056), "has participant" (obo:BFO_0000057), prv:involvedActor
      • Example: Alice participates in reviewing a paper (NEED BETTER EXAMPLE HERE)
  13. Control, it is a subclass of participation. Related to this is a notion of "responsibility", an entity that stands behind the artifact that was produced (Alice controls the process but the organization that she worked for is responsible, so that even after she leaves the organization is still responsible), may be a useful shortcut to add.
    • opm:WasControlledBy, prv:operatedBy
      • Example: SPSS was operated by Alice.
  14. Provenance Container
    • opm:OPMGraph, dc:provenance, pmlp:NodeSet
      • Example: Bob's Website Factory provides proof in the form of a set of provenance statements that the contract was executed as agreed.
  15. Views or Accounts
    • opm:Account
      • Example: Bob's Website Factory and Customers Inc both provide two different and conflicting sets of information (i.e. accounts) describing the provenance of the production of the the same website.
  16. Time
    • opm:Time, opm:Used, opm:WasGeneratedBy, opm:WasDerivedFrom, opm:wasControlledBy, prv:wasPerformedAt, dc:modified, provenir:has_temporal_value, provenir:temporal_parameter, "begins to exist during" (obo:BFO_0000068), "ceases to exist during" (obo:BFO_0000069), "temporal region" (obo:BFO_0000008)
      • Example: BlogAgg wants to find the correct originator of the microblog who first got the word out.
      • Example: Example from RO, aging preceded by development.
      • Example: duration of a liquid chromatography process has temporal value 20 minutes.
      • Example: The timestamp associated with a sensor reading
      • Example: The duration of a protein analysis process
      • Example: The time period during which a sensor was working correctly
      • Example: "Report dc:modified 12:00" means 'The report was modified (edited) at 12:00.'
  17. Collections: SHOULD BE A LIGHTWEIGHT NOTION. Mainly focused on part of. Might be treated as a resource ultimately.
    • prv:containedBy, provenir:contained_in, provenir:contained_in, dc:hasPart
      • Example: A mass analyzer is part of a mass spectrometer
      • Example: A temperature sensor is contained in an ocean buoy.
      • Example: Report dc:hasPart DataPlot" means 'The report contains the data plot.'