Connection Task Force Informal Report

From Provenance WG Wiki
Jump to: navigation, search

Connection Task Force

Purpose and Scope of Report

Source: W3C Provenance Working Group

Authors: Eric Stephan, Kai Eckert, Stephen Cresswell, Yolanda Gil, Mike Linksvayer, Alexander Kroener, Carl Reed, Christine Runnegar

Release Date: September 29, 2011.

This Connection Task Force Informal Report is a deliverable of the first W3C Provenance Working Group face to face meeting held in Boston July 6-7, 2011. At the meeting the Connection Task Force shared its preliminary results:

  • a catalog documenting an initial set of prospective connections
  • a presentation sharing the methods used for developing the catalog and preliminary feedback was provided.

One concern shared by the task group was how to direct our future connection efforts. A consensus was reached to prepare this informal report based on a number of targeted focus areas, these included: Communities Addressing Important Issues in Provenance, Groups within W3C Developing Related Standards, Groups within W3C for Special Interest Areas, and Other Groups in Special Interest Areas. It was agreed that these task forces would focus on establishing a relationship with other groups/implementers,and keeping them ready to review our first public working drafts as they are released at T+6. The report is meant to be an initial useful collection of descriptions and links to other efforts that can be helpful to other groups interfacing with the Provenance WG, as well as a means to help the Provenance WG understand additional needs, conflicts and promising fields for collaboration.

Task Force Charter

The charter of the connection task force is still emerging, at T+6 its purpose is to help the provenance working group be aware of provenance needs and practices in different high impact communities as well as serve as an outreach capability connecting to these communities and making them aware of products emerging from the working group.

Connectivity Approaches to Community Outreach and Feedback Mechanisms

The Provenance WG membership represents diverse communities, experiences and interests relating to provenance. Most community outreach is initiated by direct contact of a Provenance WG member who is either member of of the external community or has an active collaboration with that community. In a few cases, the Connection Task Force acknowledges the existence of a relevant community but no direct contact currently exists. In these cases the Connection Task Force documents the existence of relevant communities and makes every effort to establish a future direct connection.

The connections are classified based on the following scheme:

  • 1-star *: We have a link to a connection together with a very short summary (maybe only one sentence). This is the basis of the report and we try to collect as much as possible, include contributions by others and ask for missing links before the report is finished.
  • 2-star **: We extend the description to the best of our knowledge and identify interesting links between a connection and our WG. We will do this for some of the connections, prioritized by importance for the WG and personal interests.
  • 3-star ***: We contact the connection and get our description extended, reworked and especially approved by a representative of the connection. We hope that this will be accomplished by contacting single persons or a call for contributions via mailing lists.
  • 4-star ****: We find and name a person for the connection that writes and/or approves the description of the connection and is willing to function as a bridge between the WG and the connection to make sure that conflicts are addressed and to foster communication and collaboration between the groups. We would like to have these bridge persons especially for other W3C groups.


This informal report should be treated as preliminary findings with varying degrees of detail based on the report contributors knowledge, direct access to each community, or in some cases a communities familiarity with provenance. It should also be noted that at the time of this report the provenance working group the provenance model and all aspects related to its implementation are still evolving. In the future, we anticipate writing a followup report to include: mapping existing community vocabularies to PROV, proposed extensions PROV, and new implementations using PROV written in a similar fashion to previous analysis undertaken by the W3C Provenance XG.

Target Communities

As noted in the Purpose and Scope, this report targets four major focus areas, these included: Communities Addressing Important Issues in Provenance, Groups within W3C Developing Related Standards, Groups within W3C for Special Interest Areas, and Other Groups in Special Interest Areas.

Communities Addressing Important Issues in Provenance

Creative Commons: Using Provenance in the Context of Sharing Creative Works (****)

Creative Commons Creative Commons (CC) provides licenses and public domain tools that can be used for any kind of creative works like texts, images, websites, or other media, as well as databases. CC tools are well known and used, especially in online publications. Each CC license and public domain tool is identified by a unique URL, allowing proper identification and reference of these as part of a work's provenance information.

Additionally, Creative Commons provides a vocabulary to describe its tools and works licensed or marked with those tools in a machine interpretable way: The Creative Commons Rights Expression Language (CC REL). CC REL can be expressed in RDF.

The provenance of assertions about a work's license or public domain status is of great important for licensors, licensees, curators, and future potential users. All CC licenses legally require certain information (attribution and license notice) be retained; even in the case of its public domain tools, retaining such information is a service to readers and in accordance with research and other norms. To the extent license and related information is not retained or cannot be trusted, users ability to find and rely upon freedoms to use such works is degraded. In many cases, the original publication location of a work will disappear (linkrot) or rights information will be removed, either unintentionally (eg template changes) or intentionally (here especially, provenance is important; CC licenses are irrevocable). In the degenerate case, a once CC-licensed work becomes just another orphan work.

The core statements needed are who licensed, dedicated to the public domain, or marked as being in the public domain, which work, and when? Each of these statements have sub-statements, eg the relationship of "who" to rights in the work or knowledge about the work, and exactly what work and at what granularity?

Provenance information is also necessary for discovering the uses of shared works and building new metrics of cultural relevance, scientific contribution, etc, that do not strictly require on centralized intermediaries.

Finally, in CC's broader context, an emphasis on machine-assisted provenance aligns with renewed interest in copyright formalities (eg work registries), puts a work's relationship to society's conception of knowledge in a different light (compare intellectual provenance and intellectual property), and is in contrast with technical restrictions which aim to make works less useful to users rather than more.

Mike Linksvayer acts as a bridge person between Creative Commons and the Provenance WG.

Provenance Supporting Identify Management (IdM) (***)

On the Internet, an individual’s identity (i.e. a set of characteristics that can define that individual) is often represented by digital credentials (e.g. a username/password combination, signed certificate, etc.).

Credentials can be “proofed” (i.e. validated as being accurate) to different levels of assurance. “An assurance level describes the degree to which a relying party in an electronic business transaction can be confident that the identity information being presented by a [credential service provider] actually represents the entity named in it and that it is the represented entity who is actually engaging in the electronic transaction”ref. The IdM community generally agrees with (or at least recognizes as reasonable) the set of 4 levels of assurance defined by the US National Institute of Standards and Technology (NIST), namely:

  • Level 1 - Little or no confidence in the asserted identity’s validity
  • Level 2 - Some confidence in the asserted identity’s validity
  • Level 3 - High confidence in the asserted identity’s validity
  • Level 4 - Very high confidence in the asserted identity’s validity

(See NIST)

Once an identity has been proofed to a level of assurance by a credential service provider, it may be used by one or more service providers to identify the user with whom they are interacting. Through their interaction with the user, these service providers may discover additional attributes (i.e. information about the individual), which could be useful when the user interacts with other service providers. However, while they can use the assurance level to gauge the validity of the identity as a whole (i.e. that the holder of the credentials is the same person to whom they were issued), they have little or no guidance as to the confidence they can place in the validity of specific assertions. Knowing the source of the assertions could help the receiving service provider decide how much confidence to place in the asserted attributes.

One possible solution that has started to emerge in discussions within the IdM community is to state the provenance of a given assertion, i.e. tagging the attribute so the relying party can decide whether or not to trust the source of the assertion. Going a step further, perhaps the provenance field could be cryptographically validated (i.e. digitally signed) as to who asserted the attribute.

The Connection Task Force suggests the W3C Provenance Working Group connect with the Security Services (SAML) TC at the Organization for the Advancement of Structured Information Standards (OASIS), the OAuth WG at the Internet Engineering Task Force (IETF), and the OpenID Connect WG at the Open ID Foundation.

Provenance Supporting Privacy (***)

Regulatory and self-regulatory approaches to the protection of personal data (i.e. information relating to an identified or identifiable individual, as defined by the OECD Guidelines Governing the Protection of Privacy and Transborder Flows of Personal Data ) rely on the principle of accountability. In the OECD Guidelines, this is described as follows:

“A data controller should be accountable for complying with measures which give effect to the [national privacy principles]”.

Those principles are:

  • collection limitation principle
  • data quality principle
  • purpose specification principle
  • use limitation principle
  • security safeguards principle
  • openness principle
  • individual participation principle

The APEC Privacy Framework contains a similar principle. Further, APEC is currently working on the APEC Data Privacy Pathfinder, “a set of collaborative projects … to develop and test the essential practical elements of a system that would enable accountable cross-border data flows under the guidance of APEC data privacy principles.…” ref

The accountability principle was examined closed by The Galway Accountability Project, an international group of experts convened in 2009 “… to define the essential elements of accountability and consider how an accountability approach to information privacy protection would work in practice.” ref One of the essential elements the Galway Project identified was “systems for internal, ongoing oversight and assurance reviews and external verification”. ref

Provenance may have a role to play in recording and communicating how an entity handles “personal data” through the data lifecycle – collection, storage, access, use, disclosure and deletion, thereby, allowing the entity to more easily demonstrate compliance.

Some recognition of the potential role that provenance could play with respect to privacy is to be found in the OECD paper - The Evolving Privacy Landscape: 30 Years After the OECD Privacy Guidelines :

"... These standards and tools can record and describe the actual lifecycle of personal data collected and held by an organisation (such as, provenance) and may assist organisations’ management of personal data and facilitate accountability. ..."

Accordingly, the Connection Task Force suggests the W3C Provenance Working Group connect with the W3C Privacy Interest Group (to be formed) and privacy groups that may be interested in developing accountability-based approaches to data protection.

Privacy Considerations Associated With Provenance (***)

Provenance data may, in some circumstances, contain information relating to an identified or identifiable individual (i.e. “personal data” as defined by the OECD Guidelines Governing the Protection of Privacy and Transborder Flows of Personal Data ).


  • Provenance data for a photograph may include the following personal data – the name of the individual who created the photograph; the date and time the photograph was taken; GPS coordinates of the location of the subject of the photograph; the names of any individuals appearing in the photograph; the reason(s) the photographer created the photograph; etc.

Provenance data that is “personal data” may be subject to privacy and data protection laws and/or non-binding guidelines. Such laws and guidelines typically prescribe how and in what circumstances personal data may be collected, stored, accessed, used and disclosed.

In addition to legal solutions for the protection of personal data, there is increasing interest in using technology to achieve this goal. This has lead to the development of concepts such as “privacy-by-design”, “privacy enhancing technologies” and “privacy-by-default”. (Note: while these expressions are used widely, they may have different meanings depending on the context and the person using them. For example, the W3C Working Draft dated 4 August 2011 of Web Application Best Practices from the W3C Device and APIs Working Group describes the principles of “privacy-by-design” as: • Proactive not Reactive; Preventative not Remedial • Privacy as the Default Setting • Privacy Embedded into Design • Full Functionality — Positive-Sum, not Zero-Sum • End-to-End Security — Full Lifecycle Protection • Visibility and Transparency — Keep it Open • Respect for User Privacy — Keep it User-Centric

There is presently no globally applicable binding privacy or data protection law. Consequently, technical solutions to address privacy considerations in the handling of provenance data will need to be flexible and adaptive.

In developing optimal solutions for provenance, consideration should be given to privacy considerations associated with provenance data and possible privacy-respecting approaches. Accordingly, the Connection Task Force suggests the W3C Provenance Working Group connect with the W3C Privacy Interest Group (to be formed) and the W3C Device APIs Working Group (to be formed) and the W3C Device APIs Working Group

Digital Signatures: Authenticating Provenance Records (*)

W3C Security Activity, OpenID

DCMI Metadata Provenance Task Group (****)

The DCMI Metadata Provenance Task Group (DC-PROV) aims at the definition of a very simple data model based on the Dublin Core Abstract Model (DCAM) that allows the representation of provenance information about data/metadata (DCAM Descripton Sets). For the provenance information, Dublin Core is used. The goal is to use Dublin Core on the next level and that way provide a very simple and incomplete, but understandable and interoperable way to express provenance information. As such, DC-PROV can be seen as a least common denominator for provenance information, just like Dublin Core for general metadata.

The design of the model is driven by the concepts for provenance information in RDF, which are currently discussed in the RDF working group. Probably, there will be a mechanism very similar to named graphs. So the motivation of the DC-PROV task group is twofold: on the one hand, the way should be paved for the representation of provenance in the Dublin Core model in general, on the other hand, the model should be ready for such information when the standardization and implementation in RDF for a named graph like mechanism is finished.

As Dublin Core is a completely domain-agnostic data model, there are no specific use cases that are specific for the DC-PROV extension. The model is kept as general as possible and does not even state that the statements about decription sets are provenance statements. They CAN be used to represent provenance information, but other use cases are possible, too.

An example that was used in the development process would be the Europeana Data Model (EDM): It uses Dublin Core, but - due to the fact that no standardized form of provenance information in RDF and Dublin Core was available - is based on OAI-ORE and its definition of aggregations. The aggregations are used to encapsulate sets of metadata and make them addressable and describable. If provenance information will be part of RDF (the next version is expected for 2012), the EDM will be reworked and then probably be in line with the DC-PROV model.

Compared to the work of the W3C PROV-WG, the DC-PROV model can be seen as another provenance model, but clearly with a different focus. Adhering to the philosophy of Dublin Core, it is meant to be as simple as possible and from a vocabulary point of view restricts itself mainly to Dublin Core (one or two extensions are currently discussed). As Dublin Core itself can be seen as a simple provenance vocabulary, the interesting point is mainly that the DC-PROV model allows the representation of metadata about metadata. For RDF, this is only possible in a clean way, when it is extended for such metainformation; in the same time, the PROV-WG model could immediately be used, too.

For Dublin Core as a kind of least common denominator, interoperability is a key motivation. In co-operation of the two groups, a mapping of the PROV-WG model to the DCAM/DC-PROV model is planned and very desirable, as for a lot of applications, the maybe coarse, but simple representation of information is very helpful. The PROV-WG model should instead be seen as a comprehensive model for the fine-grained representation of provenance information and by means of a mapping of other provenance models to PROV-WG, all these model could benefit from an existing translation to Dublin Core.

The PROV-WG model faces the same problem as DC-PROV regarding RDF: Without a proper extension, it will not be possible to describe the provenance of RDF data.

For suggestions and further information, contact Kai Eckert, who leads the DC-PROV task group and is member of the PROV-WG, mainly to function as a bridge.

PREMIS: Provenance for Archival and Versioning (*)

PREMIS PREMIS is a vocabulary supported by the US Library of Congress for long-term preservation of documents. It focuses on the provenance of the archived, digital objects (files, bitstreams, aggregations), not on the provenance of the descriptive metadata. It includes terms to describe versions of documents as well as signatures.

A mapping of SWAN to OPM was carried out as part of the W3C Provenance XG activities.

InterPARES: Provenance for Preservation and Authenticity (*)

The International Research on Permanent Authentic Records in Electronic Systems (InterPARES) project focuses on archival preservation in the library sciences community. It includes an international community that has been working for more than a decade on the project. The ontologies and vocabularies published in the last phase of the project are available.

Groups within W3C Developing Related Standards

RDF Working Group (****)

One of the work items on the RDF WG's charter is to “Standardize a model and semantics for multiple graphs and graphs stores”. This is expected to unify approaches such as “named graphs”, “quoted graphs”, “graph literals”, “quads” and so on.

These features are widely used in RDF deployments to model the provenance of information recorded in RDF. Indeed, the working group's collection of use cases for this “multigraphs support” identifies several provenance-related use cases. The ability to address provenance-related use cases will be an important benchmark in evaluating proposals for this work item.

At the time of writing, a straw-man proposal, directly based on the notion of “RDF datasets” in SPARQL, is included in the Editor's Draft of the updated RDF Concepts document. Being an Editor's Draft, it is a work in progress and does not have consensus within the WG. The section also contains links into the RDF-WG issue tracker where relevant mailing list discussion is archived.

The interest of the working group is limited to provenance of RDF data. Given the positioning of RDF as a web technology, the main focus is on provenance of RDF data on the web, but this is not exclusive as RDF is also used “behind the firewall” in enterprise information integration, knowledge representation and other scenarios.

Relevant output of the working group is likely to include an extension of the RDF data model to support multiple RDF graphs, with an account of the formal semantics of the extension. To actually address provenance use cases, most likely additional RDF vocabulary will be required. The RDF WG is unlikely to standardize such vocabularies, as no new vocabularies are listed in its chartered deliverables, although addition of new terms to RDF Schema would not be impossible given the charter. The RDF WG is more likely to defer the creation of such vocabulary to other or future working groups, most of all the W3C Provenance Working Group, but also the Government Linked Data WG which has work items on describing best practices for versioning, and on recommending/blessing (or standardizing) a metadata vocabulary suitable for provenance.

It is likely that the working group will also standardize a related syntax for multiple graphs, which may end up being similar to existing proposals such as TriG, Notation 3 or N-Quads. Other syntaxes may be retrofitted with multigraph support.

Richard Cyganiak is member of the RDF Working Group and functions as a bridge. Note that there is a special mailing list ( that is dedicated to the communication between the RDF WG and the Provenance WG.


POWDER - the Protocol for Web Description Resources — provides a mechanism to describe and discover Web resources and helps the users to make a decision whether a given resource is of interest. There are a variety of use cases: from providing a better means to describing Web resources and creating trustmarks to aiding content discovery, child protection and Semantic Web searches.

Object Memory Modeling (****)


Entry in Connection Catalog

An object memory is meant to support collecting data about a physical artifact (at the artifact and/or in the Web) and to improve this way documentation and communication in processes focused on artifacts as well as user interaction with artifacts. The Object Memory Model (OMM) describes the structure of an object memory - its organization as well as the nature of contained data. This structure on the one hand side wants to enable different parties to contribute data to the memory (e.g., in order to record the exchange of an artifact between business partners), but on the other hand side to provide a clear distinction of all of these contributions. As such, provenance of an object memory is two-fold: First of all, provenance of contributed data has to be represented. Second of all and depending on the use cases, the overall memory may represent the provenance of an artifact.

The OMM is conceptually related to the Physical Markup Language, but tries to be less focused on supply chain support and to reduce constraints concerning contained data. Furthermore, there is a general overlap with activities concerning the modeling of data logs, e.g., for modeling instances of business processes. Since the OMM is meant to support use cases spanning several application domains (e.g., from production over logistics to sale of an artifact), the OMM is independent from a particular application domain.

Provenance of contained data is not constrained to a particular subject. If the OMM as a whole is interpreted to be a provenance model for artifacts, then the artifact would become the subject. The OMM is supposed to be a container structure which imposes little constraints on the format of contained data. A reference implementation of the Object Memory Model has been defined using XML; other format encodings are currently investigated.

The OMM relies on metadata in order to support retrieval of contained data. The reference implementation uses metadata based on Dublin Core Metadata in order to describe provenance of contents within the memory.

OMM Provenance Use Cases:

  • User-generated Data for Documenting Object Provenance: In retail scenarios it is often perceived as desirable to have provenance information about the origin of an object or previous interactions with an object that may impact consumer behavior. Examples of this are added information about the producer of a product such as in a fair trade shop, previous owners of the object in scenarios that deal with second hand goods and more generally peoples interactions with a product over the course of its lifecycle, and other kinds of product experiences ranging from ratings to extensive reviews.
  • Event Logging: Complex production processes are error-prone. For more efficiency of production process, to avoid errors and to increase product quality the real life process data will be record on an object memory linked with objects. These objects can be products as well as the machines and tools which are needed for production. The objects interact and communicate with each other. So a product is able to communicate about a failure during a routing process, in consequence the next routing process cannot start. Besides of that such specific process information might be interesting for other companies or end users.

Groups within W3C for Special Interest Areas

Health Care and Life Sciences (HCLS) (****)

The W3C Semantic Web Activity formed a special interest group on Health Care and Life Sciences, recognizing scientific research as a driver of web technologies and in particular life sciences as a leading area in pushing semantic web infrastructure.

The W3C HCLS activity is composed of several task forces, two of them have particular relevance to provenance.

The HCLS BioRDF task force is looking into provenance representations to describe experimental methods in life sciences. One of the current goals of the HCLS BioRDF Task Force is to transform microarray gene expression results into RDF format and preserve provenance information about these gene expression results, such as what samples were used, which institutions contributed the samples, what experiment factors were used to produce the results. In the first iteration, a provenance data model was created, that captures provenance information at four different levels: the institutional level, which describes the laboratory performing an experiment and the publication reporting the results; the experimental context level, which describes samples used in the experiment and the list of genes being studied; the statistical and significance analysis level, which describes the statistical and significance analysis tools used in an experiment and results of the analysis; the dataset description level, which provides descriptive metadata about the gene list results from each study. In that iteration there was no reuse of existing provenance vocabularies/ontologies in order to maintain an independence of our data modelling. At the moment, the model is being refactored and mapped to some existing provenance vocabulary. Relevant provenance vocabularies in this area include MGED (a vocabulary specific to representing microarray experiments), EFO (a vocabulary used in the widely used ArrayExpress microarray data repository) and OBI (Ontology of Biomedical Investigations).

The W3C HCLS Scientific Discourse Task Force is looking at representing the provenance and relationships among hypotheses and claims of different scientific articles so they can be better related to one another to facilitate understanding of the state of the art in a scientific area. There are several vocabularies that represent some form of scientific provenance. The Semantic Web Applications in Neuromedicine (SWAN) vocabulary is used to represent hypotheses and claims and relate them to scientific publications and authors. SWAN includes the Provenance, Authoring, and Version (PAV) vocabulary to represent authorship. A mapping of SWAN to OPM was carried out as part of the W3C Provenance XG activities. Based on that mapping, it is apparent that many of the terms in SWAN could be easily mapped to the PROV model. Other aspects of SWAN that focus relating hypotheses and claims are out of the scope of PROV and could be part of a profile. The SWAN vocabulary was aligned as part of the Task Force with the Semantically Interlinked Online Communities (SIOC). A recent effort is DEXI (Data + Experiment), a vocabulary that unifies SWAN, OBI, MO, and myExperiment. The Task Force is also developing rethorical document models to represent scientific document structure that integrates SWAN with other discourse representations, as well as connecting with bibliographic ontologies such as PRISM and CiTO. The Ontology of Rethorical Blocks (ORB) which focuses on the markup of scientific articles with salient sections as well as authorship relations. Current discussions include representing research objects and nanopublications and annotating their provenance.

eGovernment (*)


Government Linked Data Working Group (**)


According their charter, the mission of the Government Linked Data (GLD) Working Group is to provide standards and other information which help governments around the world publish their data as effective and usable Linked Data using Semantic Web technologies.

The group will develop one or more W3C Recommendations to guide governments publishing data in which RDF vocabulary terms to use in information about certain common concept areas. One of the aims is to gather and publish use cases and requirements for vocabularies to cover several areas. One of the areas is Metadata, and for this, vocabularies are required to be suitable for provenance, data catalogs (see the dcat data catalog vocabulary and the Comprehensive Knowledge Archive Network CKAN, and VoiD), data quality, timeliness of data, status, refresh rate, etc.

The GLD WG charter names the Provenance WG as a liaison in their charter, with the intention to make sure that its use cases are understood and addressed by the Provenance WG. The working group was started in June 2011, and at the time of writing has not yet developed technical requirements or use cases for provenance, but there are members of the Provenance WG who work with government linked data, and contacts between the two groups have been established.

Semantic News Community Group (***)


The Semantic News Community Group is a forum for exploring the intersection of W3C semantic technologies and news gathering, production, distribution and consumption. The group is interested in provenance for a number of related reasons. The source of information is important in evaluating its believability. News providers stake their reputation on their ability to provide news in which the facts within have been checked and recorded to be checked. Also, the provenance of news may indicate to whom it is relevant.

Other Groups in Special Interest Areas

Geospatial Provenance (***)

Geospatial provenance wasn’t fully realized until the digital age when geospatial information systems emerged that maps emerged from being seen as a singular fixed product to a composite editable product acquired using a wide range of different technologies and sources. Early provenance first emerged from a need to track data origins, subsequent changes, and changes through data lineage ( Grady R, “The Lineage of Data in Land Geographic Information Systems (LIS/GIS)”. URISA Journal. Fall 1990). Data lineage is still vital today.

As API’s were exposed, and open standards emerged for data representation and web protocols through communities such as the Open Geospatial Consortium a backbone emerged for processing pipelines to automatically prepare, integrate, analyze, and compute geospatial data. The advent of the web also provided the geospatial community more accessibility integrating multi-media products such as video feeds, interactive ground level perspectives, and tying in semantic knowledge to existing spatial query and text query capabilities. Current geospatial provenance standards specific to geospatial data provenance standards have been provided in ISO 19115 for geospatial metadata. An excellent reference to the current thinking of geospatial processing provenance as it relates to the service oriented environment can be found in the paper: Sharing geospatial provenance in a service-oriented environment.

Additional topics important but not yet addressed in the geospatial provenance community:

  • conveying purpose, errors and confidence levels to different communities
  • privacy issues when disseminating provenance
  • data licensing
  • legal and ethical considerations

eScience Provenance (****)

eScience provenance are associated with a wide range of products collected from experimental, observational (field sensor), computational (numerical experiments), analytics, and scientific insights. While eScience provenance has many possible uses, what is captured and how it is presented to targeted communities must be carefully considered. In 2005 fundamental eScience provenance requirements and a proposed provenance architecture were identified based on eScience use case experiments in biology, chemistry, physics, and computer science. [ Extensive research] has also shown that provenance must not only provide a syntactic historical explanation as to what occurred, but also associate domain knowledge so that provenance is conveyed with a scientific context. Automated and semi-automated scientific workflow capabilities such as Taverna and Kepler have become increasingly popular convention and as a result workflow-based provenance data models have emerged to capture workflow events.

One example active eScience community is DataONE is a large NSF-funded project on data preservation.

The project is organized as a core effort, plus a number of Working Groups: "The Working Group model allows DataONE to conduct targeted research and education activities with a broad group of scientists and users. Working Groups are also designed to enable research and education activites to evolve over time. Each Working Group will have two co-leaders who organize the activity and propose solutions to particular research, education, and cyberinfrastructure problems."

The Provenance Working Group is led by Prof. Bertram Ludascher and Paolo Missier. The primary goal of the WG is to investigate the role of workflow-based provenance in the DataONE use cases, and to formulate a provenance data model and management architecture implementation that suits the DataONE needs.

As part of this effort, we are designing the "Data One Provenance Model", or D-OPM (pun intended). Inspired by the OPM, D-OPM is focused specifically on workflow-based data products and their provenance. As such, the model explicitly includes a representation of the workflow-based processes that generate the provenance.

At the current state (June 2011), the WG has only met twice, and D-OPM is still in a preliminary state. However, interesting work was done by the WG in 2010 within the scope of the DataONE summer internship program, to define a model for composing provenance over multiple heterogeneous workflow runs and traces. A summary of this work appears in this published paper.

Analytic Provenance (*)



    Initial Outline and Timetable July 14, 2011
    Identify Report Contributors August 4, 2011
    Initial Draft Report Delivery August 25, 2011
    Collect Feedback September 15, 2011
    Final Review September 22, 3011
    Final Report due to W3C September 29, 2011


This informal report was compiled by members of the connection task force and invited experts. The report provides varying degrees of detail describing 18 communities who either are addressing important issues to the provenance community, active W3C standards bodies, W3C special interest groups, and other special interest groups that heavily rely on provenance.