Use Case Report

From XG Provenance Wiki
Revision as of 16:46, 24 June 2010 by Ygil (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Provenance Dimensions

This is a list of important dimensions in provenance that the group identified in order to guide the collection of use cases. We have grouped these dimensions into three major categories: the content of provenance information, the management of provenance as it exists on the web, and the use of provenance.

Note: These dimensions are not mutually exclusive. A use case may include more than one dimension. It is helpful in a use case to identify the primary dimension that the use case is trying to illustrate and then list other secondary dimensions.

The group is currently soliciting comments for these dimensions. Please put comments and suggestions in the discussion section.

Content

  • Object - what the provenance is about
  • Attribution - provenance as the sources or entities that were used to create a new result
    • Responsibility - knowing who endorses a particular piece of information or result
    • Origin - recorded vs reconstructed, verified vs non-verified (eg with digital signatures), asserted vs inferred
  • Process - provenance as the process that yielded an artifact
    • Reproducibility (eg workflows, mashups, text extraction)
    • Data Access (e.g. access time, accessed server, party responsible for accessed server)
  • Evolution and versioning
    • Republishing (e.g. retweeting, reblogging, republishing)
    • Updates (eg a document that assembles content from various sources and that changes over time)
  • Justification for Decisions - capturing why and how a particular decision is made
    • argumentation - what was considered and debated (eg pros and cons) before reaching a solution
    • hypothesis management (eg in HLCS scientific discourse task when complementary/contrary evidence is provided by different sources)
    • why-not questions - capturing why a particular choice was not made
  • Entailment - given the results to a particular query in a reasoning system or DB, capture how the system produced an answer given what axioms or tuples it contained that led to those results

Management

  • Publication - Making provenance information available on the web (how do you expose it, how do you distribute it)
  • Access - Finding and querying provenance information
    • Finding the provenance information, perhaps through an authoritative service
    • Query formulation and execution mechanisms
  • Dissemination control - Using provenance to track the policies for when/how an entity can be used as specified by the creator of that entity
    • Access Control - incorporate access control policies to access provenance information
    • Licensing - stating what rights the object creators and users have based on provenance
    • Law enforcement (eg enforcing privacy policies on the use of personal information)
  • Scale - how to operate with large amounts of provenance information

Use

  • Understanding - End user consumption of provenance.
    • abstraction, multiple levels of description, summary
    • presentation, visualization
  • Interoperability - combining provenance produced by multiple different systems
  • Comparison - finding what's in common in the provenance of two or more entities (eg two experimental results)
  • Accountability - the ability to check the provenance of an object with respect to some expectation
    • Verification - of a set of requirements
    • Compliance - with a set of policies
  • Trust - making trust judgements based on provenance
    • Information quality - choosing among competing evidence from diverse sources (eg linked data use cases)
    • Incorporating reputation and reliability ratings with attribution information
  • Imperfections - reasoning about provenance information that is not complete or correct
    • Incomplete provenance
    • Uncertain/probabilistic provenance
    • Erroneous provenance
    • Fraudulent provenance
  • Debugging

Use Case List and Organisation

Proposed Use Case Dimensions

The group identified a set of key issues in provenance. These dimensions will be used to guide the group in terms of assessing coverage of use cases.

Template for Use Cases

Use cases follow the Use Case Template and guidelines for curation of use cases. Note that this is not a MediaWiki template, just a structure to be copied. For those interested, the rationale for template used can be found here.

Original Use Cases Proposed

Below are the initial use cases gathered by this incubator group. The cases were reviewed by a selected team of curators. This list is included here for the record. An organization and merging of use cases was developed based on this original list and is shown in the next section.

Use cases pending:

  • Ian Oliver: expressing provenance
  • Bertram Ludaescher: scientific workflow
  • Michael Panzer: digital library
  • Raphael Troncy: multimedia metadata
  • Deborah McGuinness: text analytics
  • Deborah McGuinness: combining proofs
  • Deborah McGuinness: improving decision processes
  • Jim McCuskey: closures for multiple provenance graphs
  • Lalana Kagal: private data
  • Satya Sahoo: multiple hypotheses

Other provenance-related use cases captured elsewhere include:

Actions Taken on Use Cases

The table below explains the rationale for how the original use cases proposed by the group were used and which ones were merged or further edited.


Identifier Description Curated? Action
differences Result Differences Yes Used as exemplar for Comparison dimension
anonymous Anonymous Information Merge with "privacy" use case to illustrate Dissemination dimension
quality Information Quality Assessment for Linked Data Yes Merge relevant pieces into "timeliness", "assessment", and "unreliability" use cases, since this one is not really a use case in itself
timeliness Timeliness Yes Use as exemplar for the Publication dimension
assessment Simple Trustworthiness Assessment Yes Use as exemplar for the Trust dimension
unreliablility Ignoring Unreliable Data Yes Merge with "assessment" use case to illustrate the Trust dimension
domain Answering user queries that require semantically annotated provenance Yes Use as exemplar for the Understanding dimension
biomedicine Provenance in Biomedicine Yes Merge into "domain" use case to illustrate Understanding dimension
experiments Closure of Experimental Metadata Use as exemplar for Access dimensions
reproducibility Experimental Reproducibility Analysis Merge with "differences" use case to illustrate Commonality dimension
biospecimens Locating Biospecimens With Sufficient Quality Needs to be generalized as it is too application specific as currently written, then use as exemplar for Process dimension
products Using process provenance for assessing the quality of Information products Merge into assessment use case to illustrate Trust dimension
blogs Provenance Tracking in the Blogosphere Yes Use as exemplar for the Scale dimension
tweets Provenance of a Tweet Use as exemplar for Versioning dimension
privacy Provenance and Private Data Use Yes Use as exemplar for Dissemination dimension
emergency Provenance of Decision Making in Emergency Response Use as exemplar for Imperfections dimension
collections Provenance of Collections vs Objects in Cultural Heritage Use as exemplar for Attribution dimension
granularity Provenance at different levels in Cultural Heritage Use as exemplar for Understanding dimension
associations Identifying attribution and associations Use as exemplar for Trust dimension
compliance Determining Compliance with a License Yes Use as exemplar for Accountability dimension
axioms Documenting axiom formulation Use as exemplar for Entailment dimension
policy Evidence for public policy Yes Use as exemplar for Justification dimension
engineering Evidence for engineering design Yes Use as exemplar for Dissemination dimension
contracts Fulfilling Contractual Obligations Yes Use as exemplar for Accountability dimension
versions Attribution for a versioned document Yes Use as exemplar for the Attribution dimension
environment Provenance for Environmental Marine Data Yes Use as exemplar for Entailment dimension, needs to be modified to emphasize that aspect
crosswalk Crosswalk Maintenance Use as exemplar for the Debugging dimension
merging Metadata Merging Use as exemplar for the Interoperability dimension
bug Hidden Bug Yes Use as exemplar for the Debugging dimension

Exemplifying Provenance Dimensions with Use Cases

The table below shows two major exemplar use cases to illustrate each of the provenance dimensions. Each use case is relevant to several dimensions, which is indicated in the description of the use case.


Dimensions Exemplar Use Case 1 Exemplar Use Case 2
Content
Attribution collections versions
Process biospecimens
Versioning tweets
Justification policy
Entailment environment axioms
Management
Publication timeliness
Access experiments
Dissemination privacy engineering
Scale blogs
Use
Understanding domain granularity
Interoperability merging
Comparison differences
Accountability contracts compliance
Trust associations assessment
Imperfections emergency
Debugging crosswalk bug

Use Cases

Result Differences

Owner

Simon Miles

Provenance Dimensions

  • Primary: Commonality (Use)
  • Secondary: Process (Content), Scale (Management), Imperfections/Debugging (Use)

Background and Current Practice

This use case is taken from the following paper: The Requirements of Using Provenance in e-Science Experiments, and comes from interviews with Klaus-Peter Zauner. Please see the paper for more background details.

A bioinformatics experiment, encoded as a workflow, uses a number of services, some externally provided, some written by the biologist, that analyse data drawn from publicly accessible databases. When a potentially interesting result is found, the biologist re-runs parts of the workflow with different configuration parameters to try and determine why that result was produced.

Goal

Determine or disambiguate why two processes produced different results.

In order to do this, we use the provenance of the results to examine the processes producing them.

Use Case Scenario

A user, B, downloads data from source D and performs a process using D as input. B downloads data from D using the same query and performs the same process. B compares the two process outputs and notices a difference. B determines whether the difference was caused by the process or its configuration having been changed, or by the downloaded data being different (or both).

In the original bioinformatics use case, the data was that of a human chromosome, D was GenBank, and the process was the process was the experiment itself (largely written in Tcl scripts). However, this scenario applies wherever differences in outcome (where equivalence was expected) may be due to differences in parts of the process producing the outcome or the inputs to that process.

Problems and Limitations

Without having a record of the salient difference between workflow runs, i.e. the provenance of each outcome, it becomes hard or impossible to determine the difference. Where the experiments involve large amounts of complex data, as was the case in the bioinformatics experiment, human records in a lab book are not a feasible way to provide these records.

Technical Challenges:

  • Representing the full record of what occurred in a process
  • Extracting the above record
  • Representing the provenance of data which is derived from large and complex data
  • Determining the differences between two complex provenance records

Existing Work

An approach to addressing this use case is discussed in the paper Recording and Using Provenance in a Protein Compressibility Experiment

Anonymous Information

Owner

Ronald Reck

Background

Adapted from here. Any errors caused in that adaptation are the fault of Simon Miles.

Goal

To provide information without being the attributed source or "whistle blower".

Use Case Scenario

Alice takes her child to the doctor for a cold. The doctor recommends a routine test. Later, Alice discovers the routine test was not so routine after all, as it was to rule out the possibility swine flu, a recent epidemic. Alice feels that the doctor misrepresented the test, and wants to share this with her friends, but is reluctant to do so, since casting the doctor in a negative light can have repercussions in her care at a later time. The information is made available to a network of people (Alice's friends) without the exact source of that information being made apparent.

Problems and Limitations

A user feels important information should be shared, but is reluctant to share if the information is attributed to them.

Information Quality Assessment for Linked Data

Owner

Olaf Hartig, Jun Zhao, and Chris Bizer

Curator

Paolo Missier

Background

Information quality (IQ) is a multidimensional concept with different criteria such as accuracy, believability, completeness, timeliness, etc. (find a more comprehensive list of criteria in, e.g., [1] and [2]). IQ assessment is the process of determining numerical values, called IQ scores, to certain IQ criteria. IQ assessment is commonly conceived to be a complex problem. The methods that can be applied for IQ assessment are diverse and depend on the specific criterion as well as the use case ([1] and [2] outline various methods).

The openness of the Web allows many of the linked data on the Web are derived from others by replication, queries, modification or merging. Little is known about who created data on the Web and how. This means that poor data quality can quickly propagate on the Web of Data. Unless an approach for evaluating the quality of data is established, the Web of Data would soon be widely contaminated and applications built upon them would lose their values.

Goal

Information quality assessment for Linked Data

Use Case Scenario

With the rapid growth of Linked Data on the Web more and more applications emerge that make use of this data. It is to be expected that these application will consume Linked Data from a large amount of different sources on the Web. Due to the openness of the Web these applications have to take IQ of the consumed data into account. Hence, these applications have to apply IQ assessment methods to assess certain IQ criteria such as timeliness, accuracy, and believability. The applied methods may vary depending on the importance and the criticality of the application. For many applications fairly simple methods may suffice. These methods can be based on the provenance of the assessed data.

Problems and Limitations

To apply provenance-based IQ assessment methods Linked Data consuming applications require provenance-related metadata. Hence, data publishers have to be enabled and encouraged to provide this metadata. For very simple assessments information about the provider, the creator and the creation time may be enough. Further information that might be useful is: what source data has been used for creation; where, how, and when has the data (or the source data) been retrieved from the Web; who is responsible for accessed services; how was the data created. However, the provenance information that is required depends on the assessment method applied by the users and is, therefore, difficult to predetermine. To get at least an idea of the diversity of possible assessment methods take a look at Use Case Linked Data Timeliness, Use Case Simple Trustworthiness Assessment, and Use Case Ignoring Unreliable Data. Please note, in certain assessment scenarios provenance information itself would not be sufficient; in these cases additional information is required, such as other metadata or an analysis of the data content.

Existing Work

Sig.Ma is a user interface for the Sindice Semantic Web Search Engine which allows users to filter information based on provenance (data source).

Many Linked Data browsers and search engines display basic provenance information (URL from where a RDF triple has been retrieved) next to the actual data. Examples Disco, Marbles, VisiNav

WIQA - Information Quality Assessment Framework is a set of software components that empowers information consumers to employ a wide range of different information quality assessment policies to filter information from the Web. WIQA includes a RDF data browser. In order to facilitate the user's understanding of the filtering decisions, the browser can create explanations [3] why displayed information fulfils a selected policy.

The Provenance Vocabulary provides classes and properties to describe the provenance of data from the Web. Hence, this vocabulary enables providers of Web data to publish provenance-related metadata about their data. The vocabulary is based on a model for Web data provenance as presented in [5]. Based on the Provenance Vocabulary different Linked Data publishing tools have been extended with a metadata component that automatically provides provenance information:

The tRDF4Jena library extends the Jena RDF framework with classes to represent, determine, and manage trust values that represent the trustworthiness of RDF statements and RDF graphs. Furthermore, tRDF4Jena contains a query engine for tSPARQL [4], a trust-aware extension to the query language SPARQL.

References

[1] Felix Naumann: Quality-Driven Query Answering for Integrated Information Systems. Springer Berlin / Heidelberg, 2002.

[2] Christian Bizer: Quality-Driven Information Filtering in the Context of Web-Based Information Systems. Thesis, Freie Universität Berlin, 2007.

[3] Tim Berners-Lee: Cleaning Up the User Interface, Section: The "Oh,yeah?"-Button, 1997.

[4] Olaf Hartig: Querying Trust in RDF Data with tSPARQL. In Proceedings of the 6th European Semantic Web Conference (ESWC), Heraklion, Greece, June 2009

[5] Olaf Hartig: Provenance Information in the Web of Data. In Proceedings of the Linked Data on the Web (LDOW) Workshop at WWW, Madrid, Spain, April 2009 Download PDF

Timeliness

Owner

Olaf Hartig

(Curator: Paolo Missier)

Provenance Dimensions

  • Primary:
    • Use: Trust (Information Quality)
  • Secondary:
    • Content: Evolution and versioning -> Republishing, Process (Data Creation, Data Access)
    • Management: Publication, Access
    • Use: Interoperability

Background and Current Practice Scenario

Timeliness refers to the property of a piece of data to be "recent enough" to still be useful for a specific application. A typical example is stock ticks, which may become obsolete very quickly for the purpose of real-time trading, but stay timely for a long time for time series analysis on historical data, for example.

Estimating or measuring the age of data, and therefore determine its timeliness relative to a certain use, has been a long-standing problem of interest in the data quality community. Assessing timeliness relies on knowledge of the creation date/time of a piece of data. This metadata may be made available in various ways, which are typically data- and application-specific. In particular, when the creation time of a piece of data is not available, surrogates can sometimes be used, such as the time of last access, which however leads to an approximation on the timeliness assessment.

Goal

Thus, the goal is to enable users of Web data to make informed decisions on data fitness for purpose, based (in part) on its age, and therefore on its timeliness. This use case explores the use of provenance as a novel way to address the timeliness assessment problem, in the context of Web data.

Use Case Scenario

Alice uses an application that provides her with a particular stream of data. To fix the idea, we will consider data that contains the latest traffic news. To achieve a more complete and more balanced view and to guarantee the latest information the application takes data from multiple data sources on the Web into account.

In turn, Bob and Carol publish local traffic data on the Web, as two separate data sources. However, they both take the data from the same data source, X, which holds traffic data about the whole country. This data source changes frequently.

During its execution, Alice's application compares two data items, B and C, that carry different traffic information for the same road. Alice needs to choose one of the two, and the criteria she uses is based on data timeliness, and specifically, timeliness is directly related to the creation date of the items.

These sources are published by Bob and Carol, respectively, using X as their common data source, and they both come with some form of provenance associated to them. In Bob's case, however, the provenance includes the creation date/time of the version of the data item provided by X, that B is based upon. This is precisely the metadata that Alice needs to determine B's timeliness. On the other hand, Carol's application, which is used to publish C using some version of the data provided by X, is unaware of the creation date of that version. Instead, Carol includes the date of her last access to X as a surrogate.

In practice, B and C carry two types of time-related provenance metadata, stronger for B, and weaker for C. Alice uses this metadata to make her decision, for instance, to ignore B as it is older than C, conscious however that there is a chance of making the wrong choice (for example when C has been updated recently from a version of the data provided by X that is, however, older than the version used to update B).

Problems and Limitations

Here are the main technical challenges in this use case, followed by a brief description of a specific setting, whereby we argue that provenance can be used effectively in combination with based on Linked Data principles.

  • both B and C, the providers, must associate several pieces of provenance metadata to the data they publish. This must include the originating source (X), and should include the creation date for the version of X that the data is based upon, or at least, the last access date. Without these, timeliness assessment is inaccurate at best, or impossible.
    • This is a provenance content and management issue.
  • The difference in semantics between the two dates, described above, must be made explicit so that Alice is aware of the potential errors. Without this, provenance is insufficient for Alice to reach a correct decision.
    • This is a provenance content and management issue.
  • Alice must be able to understand the representation of provenance for both B and C. Ideally, these is represented in an uniform way, although it has been generated independent by two different providers. Without this, Alice's code will be "hard-wired" to Bob and Carol's provenance formats, which makes it hard to reuse and to extend.
    • This is provenance use issue.

We propose to cast the use case in a setting where data is published according to the Linked Data principles. In this case, Alice's application has access to it through a Web interface, in the form of RDF graphs. This setting, which is becoming increasingly common in practice, partially facilitates addressing the two technical challenges above, by providing a uniform way of addressing and accessing the data. By itself it does not, however, solve the problem of provenance semantics and interoperability.

Existing Work

[Hartig and Zhao SWPM09] describe an approach to develop a timeliness assessment method for Web data.

Simple Trustworthiness Assessment

Owner

Olaf Hartig

Curator

Paolo Missier

Provenance Dimensions

  • Primary:
    • Use: Trust (Information Quality)
  • Secondary:
    • Content: Attribution (Responsibility), Process (Data Access), Evolution and versioning (Republishing)
    • Management: Publication, Access
    • Use: Understanding

Background and Current Practice

Trustworthiness of an information object such as a data item is the subjective belief or disbelief in the truth of the information represented by the object. The trustworthiness of information objects is often used to filter them or to make decisions while processing them.

While trustworthiness and trust is studied extensively in the context of active entities such as persons, agents, or peers, few work exist that study trustworthiness as an information quality criterion. Hence, computer systems that use the trustworthiness of information objects for filtering or decision making usually apply a very simple assessment approach: the object is related to some kind of a source for which a trust score can be determined using one of the methods that exist for active entities; this score is then adopted for the trustworthiness of the information object.

Goal

The goal is to enable users of information objects such as Web data to assess the trustworthiness of these objects in order to make informed decisions on their fitness for use.

This use case discusses provenance information as a means to enable such trustworthiness assessments.

Use Case Scenario

Alice operates facilities to provide information objects to users. Various information providers make use of this possibility. A user, Eve, consumes the provided information. However, Eve does not want to use available information objects that - even if relevant to her task - are not trustworthy enough. Since Eve does not want to check and verify all information she decides to consider these information objects trustworthy that originate from trusted providers. Based on this simple method Eve assesses the trustworthiness of relevant information objects and ignores unsuitable objects accordingly.

This general scenario is motivated by the need to apply trustworthiness based filter methods in applications that consume Linked Data from the Web. In a Linked Data specific instance of the general scenario Alice may operate a Linked Data server on which she publishes a dataset that includes RDF links to other datasets. Some of these links are provided by other parties because Alice allows others to upload relevant linksets to her server. Bob and Carol took this opportunity and use Alice's server to publish linksets to their own datasets as part of Alice's dataset. Eve uses a Linked Data based application that accesses and processes data from Alice's dataset. During processing this data the application discovers several RDF links that seem to link to further relevant data and, therefore, are worth following. However, the application has the information that Eve does not trust Bob and, thus, decides to ignore the RDF links from Bob.

In addition to trustworthiness assessment of Linked Data, many other specializations of the general scenario can be considered (e.g. trustworthiness of blog posts in a feed aggregator, trustworthiness of photos uploaded to a news portal, trustworthiness of soccer match results reported via a sports platform).

Problems and Limitations

The main technical challenges of this use case are:

  • Alice must associate the provided information with provenance-related metadata. This must include information which allows Eve to attribute the different information objects retrieved from Alice to the original providers. This is a provenance content and management issue.
  • The different kind of sources that participate in the scenario must be taken into account when representing provenance to support trustworthiness assessment as outlined. While Alice is the source from which the information has been retrieved she is not the original provider of all the information. Nonetheless, the information that Alice controls the providing service may be relevant for trustworthiness assessments too, because it may give her the chance for manipulation. This is a provenance content and use issue.
  • If an application filters information objects as the result of provenance-based trustworthiness assessments the use should get the chance to understand the decisions made. This is a provenance use issue.

Existing Work

Hartig ESWC09 describes tSPARQL which is a trust-aware extension to the query language SPARQL. tSPARQL allows to describe trust requirements in SPARQL queries. Using tSPARQL an application can filter (intermediate) solutions for graph patterns in SPARQL queries based on the trustworthiness of the data from which the solutions originate. The tRDF4Jena library provides a query engine for tSPARQL.

Ignoring Unreliable Data

Owner

Olaf Hartig, Chris Bizer

Curator

Paolo Missier


Provenance Dimensions

  • Primary:
    • Content: Process (Data Access + Data Creation)
  • Secondary:
    • Content: Attribution (Verifying attribution), Evolution and versioning (Republishing)
    • Management: Publication, Access
    • Use: Trust (Information Quality)

Background and Current Practice

The decision to rely on data or to ignore it may have very different reasons. This use case focuses on the requirement for data that is verifiable to be unmodified. Data that is created by some party and provided by others is vulnerable to manipulation; so is data transfered over an insecure channel. However, the creator as well as the publisher may provide a digital signature for the data so that any attempt for manipulation can be detected.

Goal

The goal is to enable a user who consumes data that has been created based on multiple source data items, to ignore data items for which the integrity of involved source data cannot be guaranteed.

Use Case Scenario

Bob publishes a statistical dataset that he created by combining data provided by many different sources. Bob accesses these sources using various channels, some of which are insecure. Alice, a friend of Bob, realizes Bob's dataset is a valuable source for her studies. However, she considers only these statistical records as reliable that are based on source data that are guaranteed to be unmodified.

The domain, statistical data, used in this scenario is an example; it can be replaced by other domains where data is harvested from multiple sources and combined afterwards. The scenario itself does not depend on a specific technology. A possible instance of the scenario could be a Linked Data application that aggregates data from multiple sources who publish statistical data according to the Linked Data principles.

Problems and Limitations

To enable Alice to ignore data items for which the integrity of involved source data cannot be guaranteed she requires provenance information about all the data items she considers to use. This information must include the source data items used to create each data item as well as information about how Bob retrieved these source data items. The latter should include information about the corresponding transmission channel and the result of Bob's attempts to verify digital signatures in case the retrieved data was signed. The technical challenges are:

  • Bob has to make available provenance-related metadata about the pieces of his dataset. This is a provenance management issue (dimensions: Publication, Access).
  • The metadata must include the aforementioned information. This is a provenance content issue (dimensions: Attribution, Process).

Existing Work

See WIQA framework on IQ in Linked Data main page.

Answering user queries that require semantically annotated provenance

Name

Answering domain-specific queries by end users

Owner

Paolo Missier, Jun Zhao, Marco Roos, M. Scott Marshall

Provenance Dimensions

  • Primary
    • Use: Understanding

Background and Current Practice Scenario

Provenance metadata bears the potential of helping users achieve a better understanding of data products, as well as of the processes that led to them. An example is given in the scenario section below. Answering questions that user scientists have regarding their data products requires the ability to store and later query domain-specific information. Other use cases in this collection articulate this need in various ways. The need for domain-specific metadata has been clear to the information retrieval community for a long time. More recently, the Semantic Web community has been active in addressing this need through the development of a wealth of domain ontologies for a broad variety of application domains, as well as of languages for expressing such ontologies in a standard way, and conventions for sharing and exchanging them within and across communities.

Goal

The goal of this use case is to illustrate the role of semantics and domain-specific metadata in answering a variety of users' questions regarding data products that have been created through a known and documented process.

Use Case Scenario

Paul, a bioinformatician, uses a workflow to match an input list of his genes to gene identifiers from both the UniProt database and the Extrez gene databases. Using these genes, he then goes on to search for encoded proteins and protein pathways associated with these genes, using the KEGG Pathway database. Paul would like to be able to find out:

  1. all the genes that participate in some pathway p;
  2. all the pathways derived from UniProt genes;
  3. how a particular data product (such as a pathway p) was derived from other specific data products (say a collection of genes).

(More queries of a similar nature can be devised if needed)

The question is, what kind of domain-specific metadata is needed, or potentially useful, in order for a system that is aware of Paul's workflows to answer Paul's questions.

Problems and Limitations

Provenance captured during workflow execution, and more generally, provenance that describes the users' interaction with a number of databases, is a natural source of metadata that should be leveraged to answer the users' questions. Specifically, we need:

  1. firstly, the "raw" provenance trail for the workflow execution, as well as its structure. OPM provenance graphs can be used for this purpose.
  2. Annotations of the nodes in the provenance graph, providing semantic descriptions of the services, data products, parameters, that appear in the scenario

The technical challenges include the need for a better understanding of how best to associate annotations with structural lineage layer, and how to preserve the intrinsic identifiers of data products, for example, the UniProt gene IDs associated with the UniProt genes.

Existing work

We are aware of an early prototype where domain-specific provenance is added to OPM, and such semantics-augmented OPM is represented using RDF. This is described in a paper presented at the SWPM'09 workshop (ISWC'09): SWPM'09 paper

Semantic extensions to OPM have also been recently proposed in this paper, presented at the 2009 All Hands Meeting, Oxford, UK.

Additionally, [KDG+08] describes reasoning about semantic properties of datasets in the workflow as part of provenance records. [GGR+09] describes how this is done for the case of data collections.

Provenance in Biomedicine

Owner

Satya Sahoo, Brent Weatherly, Amit Sheth

Provenance Dimensions

  • Primary: Process (Content), Agent (Content), Justification for Decisions (Content)
  • Secondary: Attribution (Content), Query (Management), Scale (Management), Trust (Use), Commonality (Use)

Background and Current Practice

This use case is derived from the biomedicine domain. The research objective is to identify vaccine, diagnostic, and chemotherapeutic targets in the human pathogen Trypanosoma Cruzi (T.cruzi) [1]. Parasite researchers generate data using different experiment protocols such as expression profiling, proteome analysis, and creation of new strains of pathogens through gene knockout. These experiment datasets are also combined with data from external sources such as biology databases (NCBI Entrez Gene, TriTrypDB) and information from biomedical literature (PubMed) that have different curation methods and quality associated with them.

The biologists issue queries over the integrated datasets and the results are interpreted according to the source details including curation method used in external databases, confidence measures associated with experiment material and methods, and research personnel/institution.

Goal

Compare, integrate, and process large volumes of biomedical data from different experiment processes (using heterogeneous materials, equipment, protocols, and parameters) and external sources including databases and published literature.

In this use case, the goal is to capture and store domain-specific provenance to support both biology and administrative queries including:

1) Enable researchers in the parasite research community to infer phenotype of the related organisms from the work done on T. cruzi.

2) Enable project managers or principal investigators to track progress and/or view successful creation of pathogen strains.

3) Allow new researchers such as visiting faculty or post-docs to learn the lab-specific methods by studying existing results and the associated experiment protocols

Use Case Scenario

Biologists interpret two result sets A and B according to the source of the data. In A, the experiment data is combined with data from a database with human curated information. In B, the experiment data is combined with data from a database with results of a prediction algorithm. The result set A has higher confidence value and is used in further analysis.

A set queries in context of this use case:

Query 1: List all groups using a target region plasmid X?

Query 2: Find the name of the researcher who created a strain Y of the T.cruzi parasite?

Query 3: Which gene was used to create a cloned sample Z?

Problems and Limitations

The provenance information in life sciences in general is difficult to capture and represent. The provenance of biomedical data is essential to accurately understand the significance of results that integrates information from multiple sources.

Technical Challenges:

  • Capturing and modeling domain-specific provenance in life sciences is essential (for example, instruments and sample types used in an experiment)
  • The scale of provenance information in life sciences is extremely large and effective storage mechanisms are needed
  • Dedicated query mechanism is required that takes into consideration provenance query and data characteristics (for example, efficient pattern matching algorithm to compare provenance of biomedical data)

Existing Work

[1] Semantic Provenance for eScience: ‘Meaningful’ Metadata to Manage the Deluge of Scientific Data

[2] Ontology-driven Provenance Management in eScience: an Application in Parasite Research

Closure of Experimental Metadata

Name

Closure of Experimental Metadata

Owner

Jim McCusker

Background

Experimental workflow covers numerous laboratories, systems, and information models. For example, specimens are managed in a biospecimen management software package, the details of a particular experiment are encoded in MAGE object models, and the final analytic worflow is executed in statistical systems or workflow management tools like Taverna or GenePattern. A consistent data model for provenance for all of these tools allows researchers to have a complete picture of how the data was produced and what it would take to reproduce.

Goal

Gain a complete understanding of the experiment and its artifacts.

Use Case Scenario

A biologist is evaluating the results of an experiment, and wants to ensure that there were no confounding classes that weren't controlled for. A number of samples were used in the experiment, and the biologist needs to discover the full history of the samples in question, including how they were prepared, and the (deidentified) clinical history of the sources of the samples. The biologist must pull together the closure of information about artifacts that were used in the experiment.

Problems and Limitations

Currently, information about experiments and their samples can only be pulled together through query of multiple database systems with different information models. A unified model for this history would allow users to perform one federated query (possibly many times to follow the transitive closure) over one information model.

Locating Biospecimens With Sufficient Quality

Name

Locating Biospecimens With Sufficient Quality

Owner

Joshua Phillips

Background

In order to identify specimens of sufficient quality, researchers need information about how the specimen has been stored, when was it thawed, what procedure was used to collect it, etc. The Cancer Biomedical Informatics Grid (caBIG) [1] program has implemented caTissue [2], a biorepository tool for biospecimen inventory management, tracking, and annotation. The caTissue information model includes information about collection. storage, quality assurance, and distribution of specimens. This information could be represented as, for example, an OPM graph. Multiple provenance related queries are described in [3].

[1] https://cabig.nci.nih.gov/ [2] https://cabig.nci.nih.gov/tools/catissuesuite [3] https://cabig-kc.nci.nih.gov/Biospecimen/KC/index.php/CaTissue_1.1_Deployment_Guide_Chapter_6:_Deploying_caTissue_caGrid_Data_Service#Running_the_caGrid_test_queries

Goal

Describing caTissue data in a common provenance model enable:

  1. Answering more sophisticated provenance-related queries than are now possible using the caBIG Query Language (CQL).
  2. A more flexible approach to describing provenance than is possible using the current caTissue information model.
  3. Combining this information with provenance information from other tissue banking repositories to enable high-level, federated query.
  4. Combining this information with other provenance information (e.g. micorarray experiment provenance) to enable assembling more complete analysis..

Use Case Scenario

A researcher executes a query across multiple tissue repositories for all specimens that

  • were collected by procedure XYZ
  • have a clinical diagnosis of ABC
  • were fixed in formalin 30 minutes or less and were embedded in a low melting point paraffin

The researcher receives candidate specimens from each system.

Problems and Limitations

  1. caBIG data sources are currently not exposed through SPARQL endpoints.
  2. The caTissue model is expressed as UML which would need to mapped to some appropriate ontology.

Using process provenance for assessing the quality of Information products

Owner /Curator

Paolo Missier

Provenance Dimensions

  • Primary: Trust (Use)
  • Secondary: Attribution (Content),

Background and Current Practice

Assessing the quality of Information for specific application domains, notably in eScience, is predominantly an information consumer task, aimed at establishing whether a piece of information is fit for use in the context of an application. Unfortunately, a quantitative assessment based on well-defined quality metrics is not always possible, indeed for many common but complex types of scientific data, no agreed upon metrics are available. One reason is that many variables (indicators) contribute to the accuracy, precision, and ultimately, reliability and trustworthiness of a data product that emerges from a complex experimental pipeline, and analytical models that explain overall quality in terms of those variables are difficult to develop.

A promising alternative to analytical models is to learn (heuristic) models that correlate indicator variables obtained from a large collections of datasets, with their user-perceived quality levels. Namely, such machine learning approach relies on large number of examples and corresponding user decisions, i.e., accept vs reject, to establish correlations between the state of the indicator variables and the outcome.

Information Quality assessment is rarely performed at all on the output of complex eScience processes. We have described some of the information quality problems that arise in a specific area of the life science, namely qualitative proteomics, in a recent survey [1] and a partial, non-inductive but practical approach to the problem is described in a recent demo [2].

Goal

This scenario illustrates a case where such correlation between indicator variables and perceived quality of a piece of information is computed. This is relevant to process provenance in that the indicator variables represent statements about the history (for example derivation, attribution) of information.

Use Case Scenario

Suppose a scientists runs a workflow to identify some of the proteins that are manifested on a 2D gel. A number of technologies are routinely used for this purpose. Protein identification by mass spectrometry, for example, relies on a "mass fingerprint" that describes the protein peptides found on the gel, and works by matching such fingerprint using a database large set of pre-computed fingerprints for known proteins. In this example, a web service may be used to match a fingerprint against a database. Throughout the experimental pipeline, a number of experimental problems may contribute to a poor outcome. These may include some environmental parameters in the lab, details of the wet lab portion of the experiment, as well as the parameters used for the match, the type and version of the database, and more. Each of these details can potentially be captured as the experiment is performed, and later used as a source of quality indicators.

The user's goal in this case is twofold: (a) to manually label a set of experiments with a personal assessment of the quality of the outcome, and (b) to use such labelling, in combination with the available quality indicators, captured as described above, in order to induce quality models to be made available to the community and later applied to further, unlabelled experiments.

Problems and Limitations

There are two types of problems. Firstly, those that are common to all inductive methods: the method inherently heuristic, in that it induces a model that is subject to errors (misclassifications), and the minimal amount and variety of labelled examples needed to generate a useful model varies widely, as it depends upon the amounts of correlation that can be established.

Secondly, and more to the point in the context of provenance, the types of indicator variables that can be captured during the experiment can be very heterogeneous, i.e., specific to the type of data and the type of experiment, making it difficult to generalize to a commonly useful provenance model. It should be clear that a generic provenance model, used for example to describe causal relationships across data elements consumed and produced by a workflow, is not sufficient. Domain-specific annotations on such causal graph are also needed in order to use provenance as a valuable source of quality indicators.

Existing Work

  1. D. Stead, N. Paton, P. Missier, S. Embury, C. Hedeler, B. Jin, A. Brown, and A. Preece, "Information Quality in Proteomics," Briefings in Bioinformatics, vol. 9, 2008, pp. 174-188.
  2. P. Missier, S.M. Embury, R.M. Greenwood, A.D. Preece, and B. Jin, "Managing information quality in e-science: the Qurator workbench," SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, New York, NY, USA: ACM, 2007, pp. 1150-1152.

Provenance Tracking in the Blogosphere

Owner

Chris Bizer

(Curator: Satya Sahoo)

Provenance Dimensions

  • Primary: Attribution (Content), Evolution and versioning (Content)
  • Secondary: Scale (Management), Law Enforcement (Management), Understanding (Use), Trust (Use), Incomplete provenance (Use)

Background and Current Practice

Within the Blogosphere, topics are discussed across blogs that refer to each other, for example on personal blogs, project weblogs, and on company blogs. The cross references are in the form of links at the bottom of a blog post, hyperlinks within a blog post, and quotation of text from other blogs. Blog posts are also aggregated and republished by services like Technorati, BlogPulse, Tailrank, and BlogScope, that track the interconnections between bloggers. Correct attribution of blogs, as they are processed, aggregated and republished on the Web, is an important requirement in the blogosphere.

Goal

Enable applications on the Web to attribute content from different sources to a specific individual or an organization.

In this use case, blogs are an example of content flow between websites, and it is important to trace back republished posts to their original source.

Use Case Scenario

A website X collates Web content from multiple sites on a particular topic that is processed and aggregated for use by its customers. It is imperative for website X to present only credible and trusted content on its site to satisfy its customer, attract new business, and avoid legal issues.

In the context of this blogosphere use case, a blog aggregator service or an user wants to identify the author of a blog without violating privacy laws. In some scenarios, the aggregator service or user may have only incomplete attribution information. In case the author of a blog is listed by name (first name, last name), disambiguation of an author is difficult with multiple blog authors sharing the same name and this may require use of additional user information (for example, email address) without violation of user privacy or privacy laws.

Problems and Limitations

The provenance of Web content in general and blog posts in particular are necessary to users for correct attribution and to aggregating services. Aggregating services require provenance information to not only attribute content but also offer additional services such as ranking of popular blog posts.

Technical Challenges:

  • Enable Trace back and correct attribution without violating user privacy and privacy laws
  • Disambiguating content authors with incomplete provenance information
  • Extend existing vocabulary for representing posts, such as SIOC, to model finer granularity provenance information.

Existing Work

The SIOC project has developed a vocabulary for representing posts. This vocabulary is often used together with FOAF (that represent information about the physical person related to a sioc:User, e.g. its name, lastname, phone, social network, etc.) and SKOS, used mainly to represent topics and taxonomy relationships between these topics.

Provenance of a Tweet

Owner

Proposed by Paul Groth

(Curator: Satya Sahoo)

Provenance Dimensions

  • Primary: Republishing
  • Secondary: Interoperability, Responsibility, Understanding

Background Current Practice (optional)

Many services are now available that allow one to microblog or publish short status messages about events, link, what one is doing, etc. A canonical example of this is the service [Twitter]. Users post 140 character public short messages called Tweets. A phenomenon within Twitter is the idea of retweeting. Reposting someone else's tweet as your own message. Initially, users denoted a retweet with the prefix RT and the @user sign to denote attribution to the user. For example:

RT @ivan_herman A good link to have for #linkeddata: Data Incubator http://bit.ly/6lComj
7:21 AM Nov 26th from TwitBird iPhone in reply to ivan_herman

However, after a while the provenance of a tweet is lost as people retweet retweets. Because of the 140 character limitations, provenance metadata soon overwhelms actual content and the original content of a tweet is sometimes lost. This spurred Twitter to introduce new functionality that allows users to retweet from within the site without including provenance metadata within the Tweet itself.

Current practice includes tracking microblog messages within a single website. See twitter retweet functionality.

Goal

Broadly, determine how content changes (i.e. its version) across the web and who is responsible for those changes. In this specific instance, determine the original author and content of a microblog message (e.g. tweet). Determine any changes and the attribution of those changes as the microblog message is reposted (e.g. retweeted). Do this across many different web sites.

Use Case Scenario

The general use case is as follows:

  1. Alice retrieves (copies) content from Bob's website.
  2. Alice modifies this content and posts it to another website.
  3. Carol accesses the new website and wants to determine who is responsible for what part of its content.

Twitter Specific Use Case:

  1. Bob posts a message on twitter, which is automatically cross-posted to Facebook.
  2. Alice who is a friend of Bobs, reposts the message on Facebook adding a comment.
  3. Carol sees the message of Alice on Facebook and wants to know what the original message was and what is comment and who is responsible for what in the message.


Problems and Limitations

Current solutions only work inside one web site. There are difficulties maintaining what is comment and what is original. There is also a distinction between editing a post and just passing it on.

Technical Challenges:

  • How to track the alterations to Web content outside of a web browser.
  • identifying users that create and modify content. Do you need signatures or OpenIds or just a URL.
  • Aggregating changes from multiple websites requires agreed upon representations.

Unanticipated Uses (optional)

I think the same approach could be used for this use case and the Use_Case_Provenance_in_Blogosphere.

Existing Work (optional)

An excellent discussion of the issues with retweeting and the introduction of retweet functionality can be read in a blog post by Evan Williams: Why Retweet works the way it does

Provenance and Private Data Use

Owner

Rocio Aldeco-Perez and Luc Moreau (contact)

(Curator: Simon Miles)

Provenance Dimensions

  • Primary: Accountability (use)
  • Secondary: Dissemination control - Law Enforcement (use), Process (content)

Background and Current Practice Scenario

A lot of on-line facilities offer personalised services by requesting private information to their users. Such private information must be used under a set of rules that describe which processing can and cannot be performed over such data. If these usage rules are not followed, personal data could be exposed and used against the interest of its owner.

Evidence of the importance of this issue can be seen in legislative frameworks related to the use of private information, such as the Data Protection Act in the UK, the European Directive on Private Data, and Hippa and Safe Harbor in the US.

Goal

Here, we adopt Weitzner's notion of accountability [1]: "accountability must become a primary means through which society addresses appropriate use..." Information accountability means the use of information should be transparent so it is possible to determine whether a particular use is appropriate under a given set of rules, and that the system enables individuals and institutions to be held accountable for misuse.

The goal of this use case is to perform auditing tasks about previous usage of private data, and check that such usage is compliant with rules regulating the use of private information. Inspired by the UK Data Protection act, we identified the following specific tasks:

  • Legal Purpose: To verify that a set of data was processed for a valid purpose.
  • Declared Purpose Compliant: To verify that a set of data was used in a processing that is compatible with the purpose by which was collected.
  • Authentication: To verify that a set of data, which was collected from a user, was used by processes that initiated such collection.
  • Minimal Set: To verify that all the data that was collected from a user was used at some point.

Use Case Scenario

The general scenario structure is as follows.

1. Alice wants to interact with an online service. In order to do so, she needs to provide personal information.

2. The online service uses that personal information for a particular, pre-stated purpose.

3. Later, Alice suspects that the personal information was used in a way other than the pre-stated purpose.

4. Upon request, an independent authority determines Alice's doubts are founded and performs equivalent check across many individuals who have used the service.

It can be applied to a particular domain below, which gave the inspiration for this use case.

1. Alice wants to buy some medicine from an on-line pharmacy. In order to get her medicine, she needs to provide her name, address, date of birth, gender, social security number, the number of her clinic and her doctor’s name.

2. The pharmacy collected that set of data with the purpose of "on-line sales". So her name, address, date of birth, social security number, the number of her clinic and her doctor’s name are used to register the sale of that medicine with the Health Service. The name and address are used to send the medicine to Alice.

3. Later, the pharmacy creates a record of the monthly sales, which includes the medicine’s name and the quantity sold.


What if the pharmacy decides to include the Alice's name next to the medicine she bought in the record of monthly sales? Alice does not provide her name to be used in a record that could be used to find specific individuals that suffer from certain illnesses related to the medicines they bought. How can Alice be sure that her information was used in a way compatible with the purpose by which she initially send it?

In practice, independent institutions, as the Information Commissioner in UK, make audits to verify that individuals or institutions that manage personal information are following the data protection rules, in that way they can be held accountable for information misuse.

If the pharmacy creates a register containing the information that plans to collect from Alice, the processes to be performed over it and the purpose of such information collection, then we can use that register as a set of rules that the pharmacy should follow when using Alice’s information.

If, at the same time, Alice and the pharmacy are asserting provenance information related to their actions, later, such provenance information can be compared against the registered set of rules to verify if the pharmacy effectively used Alice’s information in the right way.

Thus, if the pharmacy registers the creation of a record of monthly sales that includes medicines' name and the quantity sold related to the on-line sales purpose, then it can create the record but it cannot use Alice's name on it. If, despite this, the pharmacy does that, we can find it out by checking the provenance information related to such an activity to later make the pharmacy accountable for misusing Alice’s information.

Many of alternative on-line scenarios can be considered in this use case, such as, Universities, Facebook, Google, Governmental services, etc.

Problems and Limitations

Here are described the main technical challenges in this use case.

  • Institutions or individuals that manage personal information (in this case the pharmacy) should register in a well defined fashion the purposes and the way by which they plan to collect and use users’ information. This process is similar to the Notification Process established by the Information Commissioner Officer (see [2]). This registered information will be treated as the rules that such institutions should follow while processing personal information. An example of the document produced during the notification by a pharmacy can be found in [3]. This problem can be addressed by using semantic web technologies to represent purposes of collection, tasks performed over users' information and the set of information that will be collected from users. This is a metadata representation issue.
  • All the entities involved (in this case the on-line pharmacy and Alice) need to capture in a standard way the provenance information related to their actions. In that way, the analysis of the actions of the entities can be automated. This is a provenance content and management issue.
  • To effectively make entities accountable for misuse of information, we need to guarantee that the provenance information created by the involved entities implements some form of entity identification and provenance integrity. Then, if a problem is found in the processing of personal information, the right entity can be made accountable by checking its identity. At the same time, if provenance integrity is guaranteed, entities can be sure that the actions that they asserted are represented in the provenance information and any other entity was able to change it. This problem can be addressed by the use of cryptographic techniques, such as signatures to verify the entities’ identity and cryptographic hashes to check the integrity of provenance chains. This is a provenance content and management issue.
  • Provenance information created by the entities involved in a processing can be compared against the registered rules to verify if they used personal information in the right way. This is a provenance use issue.

Existing Work (optional)

Aldeco-Pérez, R. & Moreau, L. Provenance-based Auditing of Private Data Use International Academic Research Conference, Visions of Computer Science, 2008 [4]

Provenance of Decision Making in Emergency Response

Name

Provenance of Decision Making: Tracing Decisions Made in Emergency Response Situations

Owner

Iman Naja and Luc Moreau

(Curator: Simon Miles)

Provenance Dimensions

  • Primary: Justification for Decisions
  • Secondary: Attribution, Accountability (Use), Imperfections (Use)


Background and Current Practice

In response to major incidents, different emergency services (e.g., Police, Fire Brigade, Ambulance Service, Local Authorities, the Military) must perform their designated roles according to their assigned responsibilities, as well as co-ordinate effectively with each other and with volunteers to ensure maximal efficiency in saving and protecting lives and relieving suffering.

In a flooding disaster, several decisions are made by emergency workers and implemented on the ground. Police and Fire Brigade must evacuate civilians from flooded building or buildings under the threat of being flooded according to some prioritisation scheme. For example, critical infrastructure like care homes, hospitals, and schools have higher priority over houses and businesses. In turn, houses with vulnerable occupants have more priority over other houses. Evacuated civilians who need medical attention are taken to a triage area to be examined by medics. Medics use a triage scheme to prioritise the care and delivery of patients to be taken to hospitals in ambulances.

Goal

Goals:

  • Identify the decisions that may have to be revisited given some new observation on the ground invalidating or complementing previous knowledge.
  • Answer what-if questions (hypothesis management) as well as why-not questions.
  • Trace the flow of decision to hold the decision-makers accountable for any major incorrect decisions made.

Provenance will help identify dependencies between decision and facts.

Use Case Scenario

A chain of reasoning, based on some initial data, leads to a decision which affects all instances of class of problem. Later, the following occur. First, some of the data is found to have been wrong and/or what was true at the time is updated to the current situation. In these cases, the same reasoning is applied to the updated data, changing the future decisions for the same class of problem. Second, a hypothetical situation in which some data is different is considered using the same reasoning, to see if the outcome would be different. Finally, for a specific instance, the data and reasoning which led to the decision are explained. The specific domain which motivates this scenario is as follows.

Argumentation - wrong data: In a flooding event, search and rescue workers decide to evacuate buildings R11 through R15 which are houses. While evacuating, the rescue officers notice that there are people screaming for help in a nearby care home R16 with no one attempting to evacuate them. The provenance of the decision to evacuate R11 to R15 is used to deduce that that initial decision to evacuate R11 through R15 was made based on the information that R16 is empty. A new decision needs to be made that dictates whether to carry on with evacuating R11 to R15 or to re-prioritise and evacuate the critical infrastructure R16 that is not empty as previously assumed.

Argumentation - facts updated: Ambulance A1 was assigned to transport patients from the triage area to Hospital H1. Upon arriving, medics in A1 learn that H1 is full and can no longer take more patients in. A1 report this new fact to the medics in the triage area. The medic M in charge needs to contact the other ambulances that are on their way to A1 as well as initiate a re-triage for patients waiting to be taken to H1. M uses the provenance of the decision made to transport patients based on the information that H1 is available and re-prioritises triage patients as well as produces new goals for current ambulances heading towards H1.

Hypothesis (what if): In a related scenario, search and rescue decide not to evacuate buildings R21 through R27 because the street is no longer under threat of flood. They decide to deliver medical supplies, food, and water to the residents of those houses. However, a police officer observant is not sure that is the correct action to be undertaken. Provenance is used to indicate whether the decision was made based on definitive information. Provenance is also used to check whether different decisions should have been made in case the street is not really secure, for example deducing that the priority of allocating resources to distribute needed goods outweighs the priority of allocating these same resources to secure flood defences, e.g. sandbags, to secure the street.

Why-not questions: A few months after the incident, an inquiry committee is put in place to examine what went wrong and highlight where more lives could have been saved. In the reports of medics in the triage area, patient P1 who was categorised as delayed, i.e. non-urgent, suddenly seized and - even though immediate medical care was provided by medics - the patient died. The provenance of the decision to triage the patient as non-urgent is reviewed to answer the question of why the patient was not triaged as urgent.


Problems and Limitations

Because of the hectic situation on the ground, not all provenance information of how decisions were made may be recorded immediately. Additionally, emergency responders may not record all the information that they took into consideration when making certain decisions. It must be clarified for emergency responders that even though recording such data may look like a waste of time on the spot, it is crucial for tracking back decisions during the emergency as well as for accountability after the emergency.

Also, not all provenance information may be available to everyone involved in responding to the disaster. For example, information produced by the Police may not be accessible to the Ambulance Services or to the voluntary sector. This may restrict the interference of observers because of their inability to judge whether decisions were made based on missing information or whether a wrong decision was made even though the information was available.

Finally, in large scale disasters there may be hundreds of emergency responders and thus the scale of recorded provenance may grow enormously.



Unanticipated Uses

The scenario above describes a particular case of using technology. However, by allowing this scenario to take place, the technology allows for other use cases. This section captures unanticipated uses of the same system apparent in the use case scenario.

Provenance of Collections vs Objects in Cultural Heritage

Use Case Collections vs Objects Cultural Heritage

Provenance at Different Levels in Cultural Heritage

Owner

Laura Hollink and Marieke van Erp

(Curator: Satya Sahoo)

Provenance Dimensions

  • Primary: Understanding (multiple levels of description)
  • Secondary: Trust (Reliability), Interoprability


Background and Current Practice

Online cultural heritage collections may contain art objects and structured metadata. Provenance of the art objects, such as information about who created it and which people have owned it over the years, provides information about the cultural meaning of the object. In addition, provenance information of the metadata is needed to determine if the metadata is reliable or biased. For example, metadata created by contemporaries of the artist gives a different perspective than metadata created by the current museum curator. More and more, metadata is now created automatically; it is extracted from text or other sources. In these cases, the reliability and perspective of the metadata depends on the sources it was based on. When, where and by who were these sources created? Provenance of these underlying sources is therefore needed.

Currently, collections contain only one level of provenance, namely the provenance of the art objects. Provenance of the metadata of those objects, and provenance of the sources on which the metadata was based, are rarely recorded.

Goal

Provide a means to deal with the historical record of art and cultural heritage objects at different levels. Including the following three levels: the object itself, its metadata, and the sources on which the metadata was based.

Specially, in this use case, find cultural heritage objects based on only a particular category of metadata related to that object.

Use Case Scenario

A researcher, maybe an historian, investigates the events around the end of the colonial era. She searches the annotated collection of a cultural heritage institute for object, such as drawings, sculptures and books, created in the relevant period. She needs to distinguish between different views on the data. Therefore, she makes a distinction between drawings made by Indonesians and drawings made by the Dutch colonizers. Moreover, she wants to base her selections only on metadata created by Indonesians. Part of the collection is annotated automatically, for example by entity extraction from text. In these cases, she wants only those items of which the automatic metadata was based on Indonesian resources.

Problems and Limitations

We have identified provenance at three levels: provenance of the museum objects, provenance of the metadata of those objects, and provenance of the sources on which the metadata was based. The latter is especially important when the metadata was created automatically.

One thing to note is that in theory the levels of provenance can go on endlessly. We could, for example, record the provenance of the material on which the sources are based. In practice, this is not feasible. In this use case we argue that we need provenance of the sources on which metadata is based, especially if we are dealing with automatically created metadata. Further levels of provenance do not seem necessary at this point.

Not only text extraction algorithms but also human art experts base their metadata on external sources. There is nothing that prevents us from recording the provenance of the sources used by human art experts.

Technical Challenges:

  • The metadata is not all digitized or digitized at differing qualities. For example, some information may be in databases while other information consists purely of scanned items.
  • Many objects have metadata that is multilingual.
  • Determining at what level the metadata is applicable.

Existing Work (optional)

Identifying Attribution And Associations

Owner

Yolanda Gil

Curator

Paolo Missier

Provenance Dimensions

  • Primary: Attribution (Content)
  • Secondary: Access (Management), Trust (Use)

Background and Current Practice

This use case is inspired by the Trellis project,  where we studied how to capture the analysis process when contradictory and complementary information are available. The analysis process is captured as argumentation structures that refer to source attribution and trust metrics.

Millions of people consult web resources daily, and analyze painstakingly complementary information and often contradictory information. As they perform this analysis, they look carefully at the attribution of the information: what entities were involved in providing the information: the writer (eg Ty Burr), the publication (eg The Boston Globe), the owner (eg the New York Times Company), the origin of the information (eg the person being quoted), etc. Some information providers are generally considered more trustworthy or reliable than others (eg a newspaper of record).  Some sources are considered authoritative in specific topics.  Some sources are preferred to others depending on the specific context of  use of the information (e.g., student travelers may prefer cheaper travel sources, while business people may prefer more reliable ones).  Some sources are preferred simply because they are known to keep their information up to date.

However, all this work that millions of people do every day is lost, ie, not captured on the web. Therefore, we all must start from scratch and simply rely on the rankings of search engines and our own limited expertise to do any task on the web. Users cannot easily access information about who is providing the information they are seeing. Worse yet, machines cannot access that information and assist users by reasoning about attribution and by assessment of sources.

Goal

Users could consult for any resource on the web what is its attribution, ie, the sources or entities that were involved in creating the resource.

Similarly, tools could be developed to do the same and assist users by reasoning about attribution and providing assessments about the entities involved in producing that information.

Use Case Scenario

A user finds a document on the web that quotes a New York Times article from  the REUTERS agency that contains the statement "At a press conference last Monday, a US Federal Reserve spokesperson reiterated that its chairman was not planning to raise the current interest rates".  The user may decide whether to believe (use) this information because of one or more of the entities that created it: the NYT, or REUTERS, the Fed spokesperson, or its chairman. The user would first have to find this set of attributions, and then use some criteria to discern whether to believe it. Some users may have stable criteria to do this kind of assessment, eg, always believe what the NYT publishes, never believe spokespersons for the US Federal Reserve.

In some cases, the user may not be very knowledgeable on the topic. When this is the case, they would want to know for example what other people consider to be the reliability of the entities that the user is currently trying to assess. This could potentially result in queries to some repository regarding what criteria others used before to dismiss or use information from the sources currently being assessed.

Other times, users may be knowledgeable enough to have their own criteria for assessing sources but may simply not find the information credible. In these cases, the user would want to know how many other independent sources can confirm this information. This could result in queries to retrieve similar assertions but exclude any resource that has similar attribution (same writer but maybe different newspaper would be considered similar attribution, while same newspaper but different spokesperson would be considered different).

Finally, attribution may not be the sole source for making these kinds of assessments. Other kinds of associations may be used as well. Examples include: cited-by, endorsed-by, criticized-by, opposed-to, endorsed-by, financed-by, etc etc.

Problems and Limitations

It is unclear that all the associations used to assess sources are based on provenance. For example, if a document was financed by an entity, it is unclear that entity was really involved and has any responsibility in producing the information.

Other kinds of associations have to be carefully represented. Consider, for example, a Web resource that recommends  a set of readings in the history of astronomy, and is  maintained by an astronomy department on a university Web  site. If the Web page is authored by a faculty member in  the astronomy department, then a user would attribute the information to the university, the department, and the authoring  professor. If the Web page is authored by a student on a  temporary internship, who happens to like astronomy as a  hobby, the user would not put as much weight in the  association of the resource with the astronomy department or  the university. This example illustrates that automatic association with entities can be tricky.

Existing Work

[Gil and Ratnakar ISWC02] describe an approach to enable users to express their  assessment of complementary and contradictory sources of information. As the  user considers information from different sources relevant to their purpose, they can view the ratings that other users assigned to the entities involved, and use those ratings to assess the information at hand.  Sources were assigned a reliability rating, and individual sources could be selected to express the criteria used to accept or dismiss information. The user could also assign credibility ratings based on other information available.

Determining Compliance with a License

Owner

Proposed by Paul Groth

(Curator: Simon Miles)

Provenance Dimensions

Primary: Accountability (licensing)

Secondary: Interoperability, Understanding

Background (optional)

There are a massive amount of images on the web. These images come with a variety of licenses including Creative Commons licenses. Often times images come with no license or the license is not clear. In addition, these licenses often place different requirements on any derivative works. For example, a Creative Commons license may require that all derivative works have the correct attribution, or may require that the original work (e.g. image) can only be used in non-commercial works. Thus, using a particular image can impact what license a derivative work can have.

Determining whether a derivative work (e.g. mashup or set of slides) is compliant with all of its contents license is time consuming. Additionally, if those contents are also produced from other content, then determining compliance is even more difficult and sometimes may be impossible.

Goal

Given a document, determine whether that document is compatible with the licenses of all of its constituent parts and find a license for that document that is compatible

Current Practice Scenario (optional)

Current practice is for a user to by hand verify compatibility with all known licenses.

Use Case Scenario

Bob is creating a new powerpoint presentation for a meeting.

  1. Bob starts with a Powerpoint presentation he received from Carol.
  2. He modifies, adds, and deletes slides. During these modifications he uses images that were already on his hard drive and those that he just got from the web.
  3. Bob wants to post the new slides on the Web. But before doing this, he needs to determine whether his usage of other material prevents him from posting.
  4. If he can post his slides, Bob wants to assign a license to them that is compatible with the licenses of the material used within.

Problems and Limitations

Many licenses do not have a computer representable form. Many documents do not have an explicit license associated with it.


Unanticipated Uses (optional)

This applies to any document not only images.

Existing Work (optional)

Documenting Axiom Formulation

Owner

Yolanda Gil

Curator

Paolo Missier

Provenance Dimensions

  • Primary: Attribution (Content)
  • Secondary: Entailment (Content), Argumentation/Justification (Content), Understanding (Use), Trust (Use)

Background and Current Practice

Much of the content of the semantic web is built by hand, including ontologies, linked data, mashups, etc. This means that many axioms and assertions were formulated by a person based on their understanding of how to model the domain at hand. In creating those axioms, the developer often consults documents and sources, makes some assumptions, and integrates information. It would be useful to record in enough detail what were the original sources consulted, what  pieces seemed contradictory or vague, which were then dismissed, what additional  hypotheses were formulated in order to complement the original sources, and ultimately how axioms came about.  However, this kind of information is not captured in current practice. Ontologies, assertions, and resources lack such records to provide rationale for their design, and as a result it makes it hard for others to reuse those ontologies and data. That kind of information would reveal for example what aspects or areas of the ontology they can be more confident about, for example because more resources were used to develop those areas or because more backing by sources is provided.

There are several other potential benefits to including this rationale  within an ontology, such as supporting its maintenance, facilitating its  integration with other ontologies, and integrating (or transferring) knowledge  among heterogeneous systems. 

This kind of information would also be useful to justify answers to end users not in terms of what reasoning steps were used but in terms of what initial knowledge the system had and where it came from. This type of justification is distinct from justifications of the reasoning itself, because here we are not interested in justifying the system's inferences but rather what sources were consulted to give the system its initial set of axioms.

Goal

A user wants to understand what an ontology axiom or an assertions means in the context of how it was created, in this use case when axioms and assertions are created by hand.

Use Case Scenario

A user is trying to create a resource (eg, a set of RDF triples, or a mashup) with ski resorts near their city for weekend trips. The user consults many sources and creates the resource indicating that they are due to combining information from four different sources. Source A is published by the visitor's bureau of the city and lists all nearby ski stations but only within a 50 mile radius of the city. Source B is published by the state and shows two additional stations that are reasonably close. Source C shows the traffic patterns on weekends for the roads to the ski stations. Source D is a national weather source that shows the snow conditions over winter and spring months. The user builds a small ontology of ski resort properties used to decide whether to include a given ski resort or not.

A second user discovers this resource and wonders why a certain ski station is not included. After checking the sources used to create the resource he decides that the ski station he saw missing was in a bordering state and therefore was not included. He also checks and finds out that the traffic source used is unknown to him. This user can decide to add on to what the original user did or dismissing it as not useful for their purposes.

A third user uses this resource and is surprised to find in it a ski station that he thought is very close to the city but does not get very much snow during the skiing season. He looks at the properties of the ski resorts that are defined in the ontology, and finds the definition of the property for "snow conditions". Its definition is attached to the source D documentation which discusses average snow fall. The user realizes that the creator of this resource did not take into account the true condition of the snow (eg, packed powder, fresh powder, etc) or the proportion of lifts that are open during the season. This explained why the ski station was included in the resource. The user decides to dismiss the resource as those criteria do not reflect his own criteria.

Note that the original user that created the resource may have consulted many other documents besides sources A,B,C,D. It may be useful to record as sources that were consulted but not found to be useful for the user, this may help someone determine how thorough or informed the user was (or what aspects they focused on) and based on that decide whether they would find the resource useful.

Problems and Limitations

In the simplest cases, axioms can be associated with the documents and resources (or portions) that back them up. In more complex cases, entire groups of axioms (graphs) may need to be associated with such provenance information.

The provenance of a set of axioms may be much larger than the axioms themselves. So there need to be mechanisms in place for management of this provenance: efficient reasoning on the axioms without provenance when needed, and inclusion of the provenance information when needed.

As the original resource evolves, the provenance information needs to be updated/ extended accordingly.

Existing Work

The use case is described in terms of the use of semantic web ontologies and data, but its motivation comes from uses of ontologies for engineering problems. Consider for example a system developed to estimate  the duration of carrying out specific engineering tasks, such as repairing a damaged  road or leveling uneven terrain.  Users invariably wanted explanations about where the  answers came from in terms of the sources we consulted and the sources that we chose  to pursue.  They wanted to know whether well-known engineering manuals were consulted, which were given more weight, whether practical experience was considered to refine theoretical estimates, and what authoritative sources were consulted to decide among competing recommendations.  In other words, the analysis process that knowledge engineers/developers perform is part of the rationale that needs to be  captured in order to justify answers to user queries.   

[Gil EKAW 02] describes a tool that enables  knowledge base developers to keep track of the knowledge sources and intermediate  knowledge fragments that result in a formalized piece of knowledge.  The resulting  ontology is enhanced with pointers that capture the rationale of its  design and development.

Evidence for Public Policy

Owner

James Cheney

(Curator: Simon Miles)

Provenance Dimensions

Primary: Justification for Decisions: hypothesis management (Content)

Secondary: Understanding (Use)

Background and Current Practice

Government standards (e.g. in the UK, the "green book") mandate how the source and intermediate data are to be linked to the final reports and conclusions produced by the study. In particular, the study needs to fully explain how the primary data were collected, how secondary data were analyzed and interpreted, and which analytical or interpretive processes were used to draw conclusions. The report may also need to link to associated publications or dissemination materials.

These standards are meant to make it possible for other experts to fully understand the quality of the study, and for decision makers (usually non-experts) to make qualitative judgments about the strength of the evidence for the conclusions.

Goal

The conclusions of studies bearing on public policy must be linked to their supporting data in order to meet standards inposed by funding organizations.

Provenance technology can greatly decrease the effort involved in producing acceptable linked data, and doing this in a standard or automatic way may be more reliable or useful than current practice.

Use Case Scenario

A researcher is performing a study involving multiple steps, including data gathering, analysis, then summary conclusions. The conclusions are passed to decision makers without the same expertise as the researcher. In using the conclusions, the decision makers require the same data as informed them to be analysed using different methods or considering different hypotheses. Later, the study data, analysis and conclusions are compared with the actual effects of the decision. This general scenario is derived from the more specific case as follows.

A social scientist is studying the relationship between education and poverty with support from a government grant.

The study involves recruiting participants, distributing and collecting surveys, and performing telephone interviews. The data collected through this process are initially recorded on paper and then transcribed into an electronic form. The paper records are confidential but need to be retained for a set period of time.

Once the data are collected and transcribed, the scientist processes and interprets the results and writes a report summarizing the conclusions. The conclusions may then be incorporated into policy briefing documents used by civil servants or experts on behalf of the government to identify possible policy decisions that will be made by (non-expert) decision makers. This process may involve considering hypotheticals or reevaluating the primary data using different methodologies than those applied by the scientist originally. The report and its linked supporting data may also be provided online for reuse by others or archived permanently in order to make it possible to compare the conclusions of the study with actual effects of decisions.

Problems and Limitations

This use case is hard to achieve using current technology because it requires a great deal of additional effort from scientists to manage the supplementary data and links manually.

Provenance technology that meets standards such as the "green book" could dramatically decrease this effort.

There are additional challenges, such as maintaining links between (confidential) data stored on paper and (usually public) intermediate data and final reports, that may not be solvable solely by provenance technology in computer systems. However, the availability of this technology may make it more attractive to carry out some kinds of studies using Web-based surveys or IP telephony for which provenance technology could provide support.

Unanticipated Uses

Longitudinal studies comparing different methods of analyzing and collecting data.

Comparisons between the predicted and actual effects of policy decisions.

Existing Work

Use case drawn from work reported by Peter Edwards and Lorna Philip, University of Aberdeen. (See http://wiki.esi.ac.uk/UseCasesForProvenanceWorkshop).

Evidence for Engineering Design

Owner

James Cheney

(Curator: Simon Miles)

Provenance Dimensions

Primary: Attribution: responsibility (Content)

Secondary: Process: updates, Justification for Decisions:argumentation

Background and Current Practice

Many different versions of a design are considered in the development of an industrial product or process. Detailed records of these versions and their relationships documenting the evolution of a design (or family of designs) over time are needed for several reasons: intellectual property protection, forensic investigation, and fault analysis (or prevention).

Basic technology for this is widely used in software engineering, e.g. source code management/version control. However, other engineering disciplines have a wide variety of alternative approaches to essentially the same thing (e.g. CAD/CAM systems).

Goal

The goals of this scenario are to be able to assert authorship and defend intellectual property claims, track the re-use of designs, ensure the authenticity and authority of a design (or other engineering record) and uncover the reasons underlying any particular design decision.

Use Case Scenario

Alice is an engineer working for Bob's Widget Factory. Alice is part of a team that designs an extensive range of widgets of various sizes, shapes and purposes. Many widgets are not designed from scratch but are based on an existing design in response to customer requests or problems. Also, sometimes parts of the design of multiple different widgets are combined to make a new widget with multiple uses. When this happens, sometimes the different features interact in unexpected ways, leading to the need for further changes to the design. On the other hand, sometimes problems are found in old designs that have been re-used, leading to questions such as identifying which other designs may be affected by the changes, and identifying products that have been sold to customers which may need to be repaired or recalled.

Eve's Widget Hut competes with Bob's Widget Factory. Eve's company often reverse-engineers widgets from Bob's Widget Factory and sells them more cheaply than Bob can because Eve does not have to pay the design costs. Bob would like to be able to prove that Alice's designs are original, and that Eve is using the same design ideas, to defend patent claims or force Eve to pay royalties.

Problems and Limitations

The kinds of data involved in engineering settings varies widely, from source code text in programs to application-specific CAD/CAM formats to images or physical objects developed as mock-ups or prototypes. Ideally, all of these different (electronic and physical) objects could be linked, making it possible to follow the development of a final design through multiple stages.

Intellectual property claims sometimes need to prove a negative assertion, that is, absence of a link between one design and another. In contrast, provenance information can demonstrate presence of a link but it is not clear how it can be used to prove that two designs were really independent - especially if one party is motivated to deceive. Obtaining completeness guarantees may require developing cryptographic standards and protocols for "notarization" by trusted observers or hardware enforcement mechanisms. (See also "completeness" in Information Quality Assessment for Linked Data and Fulfilling Contractual Obligations.)

Existing Work

This is (very loosely) based on discussion of provenance in engineering design by Alex Ball (see http://wiki.esi.ac.uk/UseCasesForProvenanceWorkshop).

Toyota Recall

Fulfilling Contractual Obligations

Name

Fulfilling Contractual Obligations

Owner

Jim Myers

Provenance Dimensions

Primary: Accountability, Interoperability Secondary: Attribution, Process, Entailment, Understanding, Imperfections

Background and Current Practice

In scientific collaborations and in business, individual entities often enter into some form of contract to provide specific services and/or to follow certain procedures as part of the overall effort. Proof that work was performed in conformance with the contract (expectations) of the project leadership is often required in order to receive payment and/or to settle disputes. Such proof consists of basic information about what was done by whom as well as identifiation of the witnesses asserting the information is true and information such as signatures (written or digital) that make it suitable as a legal record. Documentation today includes laboratory notebooks, invoices detailing proceedures performed, shipping receipts, etc.

Goal

Provide strong proof that work was performed that meets contractual requirements. Such proof must:

  • document work that was performed on specific items (samples, artifacts)
  • provide a variety of evidence that would preclude various types of fraud,
  • allows combination of evidence from multiple witnesses, and
  • be robust to providing partial information, eg.providing information limited to that required to address contractual concerns to, for example, protect privacy or trade secrets.

Use Case Scenario

An organisation agrees to perform a process, under a set of requirements on how that process should be performed. The organisation is later asked to provide proof that no obligation or prohibition was violated, deliberately or accidentally, by action or omission. The specific motivating case from which this general scenario derives is as follows.

Foo Corp. accepts a contract to perform an analysis of the contents of several chemical samples provided by Bar Corp. as part of their effort to meet government safety regulations. The contract specifies how the samples are to be handled and requires the use of a technique validated for the class of chemicals involved. When the results indicate contamination that may force a broad recall, Bar Corp. sues Foo Corp. claiming error in their processing. Foo Corp. has not done anything wrong but needs to defend itself against several claims:

  • That the work was never done
  • That the work was done by an untrained technician
  • That equipment was improperly calibrated
  • That one or a few Foo Corp. technicians tampered with the samples
  • That samples were left at room tmperature too long
  • That samples were accidentally or intentionally swapped during a transfer beween processing steps
  • That records were tampered with to remove evidence of improper work

Problems and Limitations

Foo Corp. has significant experience working in a regulated industry and has purchased equipment that produces exportable provenance and has electronic systems for employees to record their work, as well as internal process checks performed by employees who do not know the identity of samples. It also has electronic records documenting employee training, instrument calibration, etc.

In responding to the Bar Corp. suit, Foo has several concerns:

  • Samples from multiple companies are processed in the same runs and the information about those samples and even the identity of those companies should remain private. However, the fact that samples from those companies do not show evidence of contaminants is useful evidence.
  • Foo has developed enhanced techniques that it feels reduce the error in its analyses and that it wants to keep secret. For example, Foo Corp. is able to keep sample temperature constant to a fraction of degree but only wishes to prove that it maintained the sample temperature to within the 2 degree range specified in the contract.

In developing its system, Foo had to find a provenance, metadata, and records management solution that addressed numerous challenges:

  • The provenance of a given sample has to be assembled from records provided by multiple independent systems that have their own internal IDs tht have to be matched to the global sample IDs. (A similar issue exists in that one of the instruments has only internal IDs for user accounts.)
  • The required evidence involves finding the provenance of 'related' samples (those processed the same day, on the same instrument, by the same technician, of the same chemical type, etc.) and showing that results across these collections are consistent (an instrument processed othe samples in the same batch correctly and w/o contamination, a technician was performing analyses all day with only one at a time and no odd gaps in their work schedule)
  • Each account of processing is signed and dated as close to the source as possible and with minimal delay. (Foo Corp. primarily uses instruments that produce cryptographically signed provenance statement directly using an internal certificate, others send unsigned statements directly to a central server for signing, all records are give a signed timestamp by a third-party clock (a notary service) as early as possible)
  • The system is capable of providing 'derived' signed records that include provenance for any of the 'related' sets of samples required (see above) and further to do so
    • while anonymizing some samples,
    • producing summary information, e.g. a signed statement that temperature remained constant within 2 degrees across all processing steps in the record, that hides detail without simply removing metadata of a certain type, and
    • producing less granular records that can be understood by humans (e.g. reducing gigabytes+ of provenance to a printable summary
  • The system is capable of providing raw provenance records to a trusted third party to generate derived anonymize and summarized records and documenting that it is the third party who is responsible/liable for asserting that the derived records are valid (correctly reflect the content of the oiginal records).

This scenario highlights a few related issues that appear across other scenarios:

  • provenance subsystems often have different identifier schemes and end-to-end provenance management will require means to assert known aliases/correspondences.
    • The relations may not be true aliases, e.g. they could be subpart relationships (e.g. only a part of a sample is run through an instrument). Provenance systems may also need to document when measurements on one thing reflect a property of another (e.g. the chemical composition of the subsample is assumed to be that of the whole sample, we expect the type of person's blood to match that of a sample of their blood, for temperaure at a thermostat to reflect that in the room, etc.)
  • the concerns of witnesses and end-users of provenance information have different perspectives and it is important that provenance systems be able to combine accounts, extract subaccounts, and shift to different levels of granularity. Managing aliases, synchronization of witness clocks, mapping across part-of, type-of relationships are all required.
    • Trusted third parties may play an important role and the concept of an 'interpreter' (or 'judge') may be needed along with that of 'witness' to describe how 'derived' records are created.
  • Records-related information (direct signatures, signatures notarizing and timestamping other signatures, mechanisms that provide evidence of completeness (e.g. numbering pages in a bound notebook) will need to be maintained for provenance and propagated to derived records to create chains of evidence.

Unanticipated Uses (optional)

The use case as presented is business-oriented and legal defense. Analogous cases could be ceated for an academic setting and defense of work a part of peer review, ethics inquiries, etc.

Existing Work (optional)

There is work across electronic records, e-notbooks, LIMS and asset management systems, workflow, e-Science, and semantic web commuities that address parts of this scenario. I've drawn from experience as part of the Collaborative Electronic Notebook Systems Association (censa.org, 1998-2008) where many requirements for documentation of scientific research and analyical sample processing in the Chemical and Pharmaceutical industries were discussed in the context of FDA regulatons, patent policies, and rules of legal evidence.

Attribution for a Versioned Document

Name

Attribution fora versioned document

Owner

Jim Myers

Provenance Dimensions

Primary: Attribution Secondary: Versioning, Publication, Process

Background and Current Practice

When a document (e.g. scientific paper) is created today, assignment of authorship (which maps to recognition in the community and ultimately to fame and fortune), as well as the ordering of author names, is often a judgement call by senior/primary authors. Contributions recognized by inclusion s an author include both irect authroship of text as well as contributionsto the data and analysis reported in the paper. Mistakes can be made if primary authors do not fully comprehend the contributions of other project members.

Systems that can provide more complete information about the contributions of individuals to such an effort will change and potentially improve decisions about authorship.

Goal

Provide more complete information to lead authors about contributions to the text and work being presented in a paper to aid in authorship decisions with the goal of improving decisions and increasing the transparency of the process.

Use Case Scenario

Multiple users contribute to an artifact A, with some contributions made indirectly through contributions to other artifacts used in producing A (i.e. inputs or artifacts representing earlier versions of A). Users wish to explore the record of how A was created to understand who contributed and whether specific contributions truly affected the final result in order to properly assign credit (give attribution for) the creation of A. The specific motivating case from which this general scenario derives is as follows.

Alice and Bob take on the task of writing up their recent effort with Charlie, Doug, and Ellen to synthesize a new protein. The group has used a provenance tracking system while working an is using a provenance-aware version tracking system to create the text. Charlie has done the core work in the lab with Bob doing the analysis. Charlie sends Alice some text via email to create the first paper version which Bob then edits several times. Bob includes a reference to data created by Doug in the document as the source of Figure 2, Ellen writes a few paragraphs outlining a difficult step in the analysis and creates a new version which Alice, after reading it, ultimately rejects and rewrites starting from an earlier version. After several months of hectic intermittent work, Alice realized the deadline has come and she quickly adds herself, Bob, and Charlie as co-authors. She does a quick check within her document editor and is reminded that Doug contributed data, so she adds him and fires the paper off. Ellen's name does not come up since she did not contribute to the versions of the document that survived...

Problems and Limitations

Fairly simple provenance systems could be used to provide the type of capability outlined here and reduce the type of mistakes that would have led to Doug being left off the author list in the scenario above - he was a direct contributor to an artifact that was included in the paper.

However, the case with Ellen points out limitations of such a simple system - attribution is based on intellectual contribution not physical causality and though the two are often aligned, they are not always. Ellen's ideas that she expressed in her text did contribute to how Alice eventually explained that point. The causal history of those ideas is not captured by the document versioning system. One could further imagine that Frank, another collaborator in the same group contributed to early versions of the paper about his work on a second protein with a subsequent decision to write that up seperately resulting in his text (and intellectual contribution) being removed in later versions. Alice might accidentally include him as an author given a simple provenance report that he contributed text.

Some of these issues could be solved by more sophistication related to recognizing that papers are not atomic artifacts, and 'editing' processes do not necessarily result in contributions to every byte of artifact state. Similarly, everyone involved in a high-level 'experiment' process may not have contributed to a specific data set and paper from the set of several that were produced. While managing composite artifacts and composite processes does not address the disconnect between physical causality and intellectual causality, it is a start.

Additional capability to recognize that version artifacts are all states of a logical paper and that a paper is just one manifestation of an intellectual contibution defined by a proposal, talks, workflows, multiple papers, etc. would solve more issues - one could document that a co-PI contributed to the ideas in the paper through their contribution to the proposal and that some editing operations resulted in refinement of the intellectual idea whereas others (e.g. editing for grammar) do not and thus would not result in credit a a paper author.

Unanticipated Uses (optional)

This use case is primarily about deciding authorship attibution. However, the core problems that some aspects of causality are not recorded (personal communications, A heard B's talk) and that there are multiple process spaces (intellectual versus physical document editing here,but more generally there are physical, mathematical, intellectual, management, economic, andother spaces) that do not fully align (do not share common definitions for artifacts and processes) are general. Provenance systems that recognize composite artifacts and processes and mappings between provenance spaces would solve a broad range of other problems. For example, who's responsible when a lot of money is spent with little result? One must follow the trail across economic, management, and physical work processes at least to understand whether fiscal controls, management decisions, lazy workers, bad parts or other problems were involved. Or - when debugging, how would one know to check thata piece of software not only ran w/o error but also did a "Fourier transform" correctly unless one can map between the mathematical processing that was planned against the results from the physical/digital processing that occurred?

Existing Work (optional)

I've been involved in several discussions in the context of OPM about composite issues and some of the spaces issues. I'm also aware of some work in curation and text mining related to artifacts having several meanings (the string "John Doe" is both an instance of a name and a person and one can talk about the provenance of both...)

Provenance for Environmental Marine Data

Owner

Irini Fundulaki

(Curator: Simon Miles)

Provenance Dimensions

Content: Attribution (Responsibility), Process (Reproducibility), Evolution and Versioning (Updates), Justification for Decisions, Entailment Use: Understanding, Trust

Background and Current Practice

This use case is inspired from the use of environmental marine data in forecasting models, marine biology and climate change studies. The idea is that multiple sources (observation points) record continuously the physical, biological and chemical parameters of the coastal areas of the Mediterranean sea. This data is transmitted, and stored as the database (warehouse) of a central service that provides this data to forecasting models as well as marine biologists who study correlations between the changes in the fauna and flora of the coasts and the aforementioned parameters. The central service can also provide the forecasting models and the marine biologists with archival data from manually curated databases. Marine biologists also work with vocabularies (ontologies and schemas) that describe the fauna and flora of the seas. Marine biologists could setup in-house experiments to verify certain findings. The central service can build tools using the collected data in the endeavor for the protection of the coastal marine environment as well as the businesses and urban development for the coasts of the Mediterranean sea. Users can build materialized views on the data (i.e., extract and copy the data in their own workspace) and it is crucial in this context to be able to maintain these views as efficiently as possible. In addition, one can assign trust values on the data sources, and compute the trustworthiness of the experimental results. We assume that data are represented in the RDF data model.

Goal

Detect and record the origins of sensor data (including faulty sensors), thus allowing, amongst other things, to rerun experiments that used the incorrect or affected data

Use Case Scenario

Consider a network of sensors (marine observation points) that transmit information related to the (a) wind, (b) sea surface and bottom temperature (c) wave height and dimension and (d) ecosystem forecasts (nitrates, prosphates etc.) among others. Consider that user A is a marine biologist who is querying the warehoused data in order to extract data to be subsequently integrated with the available ontologies and vocabularies.

A scenario:

(a) imagine that user A has classified an animal under a class of the vocabulary she is using, but new data shows that this classification is not correct. Nevertheless, the user would like to keep the information that had been implied and be able to annotate it with the sources that had provided it.

(b) imagine that one of the sensors was faulty and the marine biologists want to know which of the experiments conducted had used the sensor's data. This knowledge would allow the marine biologist to repeat only the experiments that have used the faulty sensor's data.


Problems and Limitations

Existing provenance models for relational data are inadequate to achieve the scenario above. Imagine that user A wants to retrieve part of data from the service's database to be used in experiments with standard query languages (SPARQL in the case of RDF Data). The experiment can be again a SPARQL query. We want to be able to store the provenance of the result so that if an update occurs (e.g., the data used in the experiment is later shown to be faulty), we could use the stored information to perform just the experiments that used the faulty data. Due to the intricacies of the SPARQL query language operators, the existing provenance models for relational data do not deal sufficiently with outer joins.

We believe that the issues of (a) representing (b) querying and (c) storing provenance is crucial for the above use case in the following aspects: (a) reproducibility of experiments (b) view maintenance (c) trust (d) entailment and (e) attribution (responsibility). A small number of solutions have been proposed for the representation of provenance information for RDF graphs. An important aspect of RDF graphs is that they have both an extensional as well as an intentional aspect that should be taken into account when managing graphs with provenance information. The concept of named graphs was proposed in [1] as a way of representing explicit provenance information of RDF triples. Intuitively, an RDF named graph is a collection of triples associated with a URI which can be referred by other graphs as a normal resource; this way, one can assign explicit provenance information to this collection of triples. Unfortunately, authors in [1] do not discuss RDFS inference, queries and updates in the presence of RDF named graphs and existing work on querying and updating RDF has been extended either with named graphs (such as Sparql and Sparql Update), or with RDFS inference support [4,5], but not with both. Authors in [8] discuss the concept of networked graphs which allow users to define RDF graphs both, by extensionally listing content, but also by using views on other graphs. In [2] the authors showed that named graphs alone are not able to capture the provenance of implicit RDF triples, and introduced the notion of graphsets that is defined as a set of RDF named graphs, itself associated with a unique identifier and with a set of triples whose ownership is shared by the named graphs that constitute the graphset. [3] proposed the use of colors to capture the provenance of RDF data and schema implicit and explicit triples. The provenance of a triple is recorded as a fourth column, hence obtaining a quadruple, and can be seen as representing the source the triple comes from. Colors can capture provenance in a fine granularity level and are a generalization of RDF Named Graphs: an RDF named graph can be modeled by arbitrary sets of triples sharing the same color. To capture the provenance of implicit RDF triples in that work, authors propose an algebraic structure defined by a set of colors and an operator that works on colors and returns the composite color that represents the provenance of an implicit triple. To perform this computation, implicit triples and their colors are obtained by extending the RDFS inference rules as defined in RDFS Semantics to handle quadruples instead of triples.

The problem of querying provenance of RDF triples has not been adequately studied in the literature. Authors in [2,3] have studied provenance propagation and querying of the typeOf, subclassOf and subpropertyOf RDF hierarchies when triples with their provenance are modeled as quadruples, but have not discussed provenance of triples obtained from the evaluation of SPARQL queries. The fundamental question that is raised in this context, concerns the use of provenance information to tackle problems such as view maintenance and trust. The research question is whether e.g., in the case of trust, a trust value can be re-computed without looking at the input data in the case of an update. The problem is similar for view maintenance. Existing work on provenance in the relational context [6,7] cannot capture the intricacies of SPARQL OPTIONAL operator that introduces negation. Last, storage of RDF triples carrying provenance information is a subject that needs to be studied towards a solution for managing the provenance of RDF triples.

Existing Work

[1] J. Carroll, C. Bizer, P. Hayes, and P. Stickler. Named graphs, Provenance and Trust. In WWW, 2005.

[2] P. Pediaditis, G. Flouris, I. Fundulaki, and V. Christophides. On Explicit Provenance Man- agement in RDF/S Graphs. In TAPP, 2009.

[3] G. Flouris, I. Fundulaki, P. Pediaditis, Y. Theoharis, and V. Christophides. Coloring RDF Triples to Capture Provenance. In ISWC, 2009.

[4] PSPARQL. psparql.inrialpes.fr.

[5] J. Perez, M. Arenas, and C. Gutierrez. nSPARQL: A Navigational Language for RDF. In ISWC, 2008.

[6] P. Buneman, J. Cheney, and S. Vansummeren. On the Expressiveness of Implicit Provenance in Query and Update Languages. In ICDT, 2007.

[7] T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007.

[8] Simon Schenk Steffen Staab. Networked Graphs: A Declarative Mechanism for SPARQL Rules, SPARQL Views and RDF Data Integration on the Web. In WWW 2008.

Crosswalk Maintenance

Name

Crosswalk Maintenance

Owner

Kai Eckert

Provenance Dimensions

  • Primary: Debugging
  • Secondary: Attribution

Background and Current Practice

This use-case is taken from DC-09 conference article. Please refer to this paper for more details.

University libraries need to handle metadata from diverse sources that is usually encoded in incompatible metadata formats and of disparate quality. To facilitate a unified search interface on this heterogeneous metadata accumulation, the metadata formats need to be aligned. Typically, a format that forms a common denominator of all formats involved is chosen and the metadata is converted into this target format using crosswalks.

These crosswalks are usually hand-crafted by metadata experts and then transferred into program logic or transformation stylesheets. In the case of errors in the resulting metadata, the crosswalk has to be improved. The identification of the erronous part of the crosswalk can be tedious and after the crosswalk change, the whole set of resulting metadata has to be recreated, as it can not be determined, which parts of it are affected by the change.

Goal

The goal is to support the maintenance of crosswalks. Additional provenance information is provided for each resulting metadata record that enables efficient debugging.

Use Case Scenario

The program logic that is derived from the mappings is extended to not only write the resulting metadata elements, but additionally for every element the following information:

  • the version of the crosswalk used
  • the number of the mapping rule used
  • the source fields used

With this information, at least the following maintenance steps can be supportet:

  • Crosswalk updates: After a change in the crosswalk, we can recreate all records that are affected.
  • Fixing mapping errors: If an error in the metadata is found, the responsible rule in the crosswalk can directly be identified.

Problems and Limitations

This use-case requires provenance on statement level, which has to be supported by the underlying infrastructure. However, in RDF exist two mechanisms that support this: Reification and Named Graphs.

A drawback is the overhead that is produced by the additional information. As this information has to be stored for every statement, the needed storage space might increase by some factor.

Unanticipated Uses

Other use-cases that require provenance on statement-level, like Use_Case_Metadata_Merging

Existing Work

Working examples by means of RDF Reification can be found here: DC-09 conference article

Metadata Merging

Name

Metadata Merging

Owner

Kai Eckert

Provenance Dimensions

  • Primary: Attribution
  • Secondary: Understanding, Trust

Background and Current Practice

This use-case is taken from DC-09 conference article. Please refer to this paper for more details.

Libraries have to deal with metadata from various sources. Usually the data is just transformed to a common internal format (see Use_Case_Crosswalk_Maintenance), but sometimes, there exist several different records from different sources that describe the same ressource. In this case, one has to decide for a specific source or the metadata records have to be merged.

A specific example are subject information for a given ressource, that can be provided by various means (manually and automatically created). While generally manually created subject headings would be prefered, it is nevertheless desirable to also store subject headings from other sources and make them accessable.

It is important that there are no compromises regarding the quality of the resulting metadata.

Goal

The goal is to prevent information loss while merging metadata from different sources. Therefore, provenance information on statement-level has to be provided.

Use Case Scenario

Additionally for every element we store the following information:

  • the source used, as well as some characteristics of the source (e.g. automatic or manual indexing)
  • the rank for the subject heading, if one is given by the source

Note, that this information is specific for subject headings and might be extended for other metadata fields or applications. With this information, at least the following advanced queries can be supportet:

  • Merging annotation sets: Without using the provenance information, we just get the union of all statements. By making use ofit, we can regain the metadata statements of a specific source.
  • Extended queries on the merged annotations: We can query the data by some criteria, like using only manually created statements, or only statements with a given rank higher than a threshold.

Problems and Limitations

This use-case requires provenance on statement level, which has to be supported by the underlying infrastructure. However, in RDF exist two mechanisms that support this: Reification and Named Graphs.

If the user is supposed to use this additional information, the retrieval interface has to be adapted to provide the possibility to select between the different sources. But is is also possible to hide this from the user and just use the data internally.

Unanticipated Uses

Other use-cases that require provenance on statement-level, like Use_Case_Crosswalk_Maintenance

Existing Work

Working examples by means of RDF Reification can be found here: DC-09 conference article

Mapping Digital Rights

Owner

Andre Freitas

Provenance Dimensions

Dissemination control: Licensing

Background and Current Practice

The Web is evolving towards a single information space where applications can potentially consume information from massively distributed sources. However, the organizations and individuals behind the published information resources usually have different motivations and constraints about the way the available information can be used. The ability to express and later consume the digital rights associated with published information will play a fundamental role in the way organizations will adopt new web technologies in an increasingly dynamic web environment, where information can be constantly aggregated and transformed.

Goal

Provide a comprehensive and both human and machine processable description of the digital rights provenance associated with the usage of an information resource.

Use Case Scenario

Alice is creating a web application with the objective of providing to investors a comprehensive view about financial markets including financial statements, stock market time series, qualitative and quantitative analysis, economic and political news and changes in legislations. The application consumes information from different sources, including Government Agencies, Media Companies and Financial Companies from different countries. Since the set of information sources is not completely defined a priori by Alice, the application needs to dynamically assess the digital rights associated with the information.

Examples of constraints and digital rights provided by each source include:

- Media Company A makes its data available for both commercial and non-commercial use.

- Media Company B requires Alice’s application to display the source of the information.

- Media Company C only allows Alice’s application to display the title and the link to a headline.

- Media Company D requires a commercial license if the data is commercially used.

- Government Agency A allows only non-commercial use of information and requires that the information should not be translated, formatted or modified.

- Government Agency B does not allow the provided information to be displayed overseas.

- Government Agency C requires that the license text should be displayed together with the information. - Financial Company A does not allow the aggregation or derivation of the information.

- Financial Company B limits the number of accesses by the type of commercial contract.

- Financial Company C aggregates data from a different set of organizations, forwarding their usage terms to Alice’s application.

Alice also defines the digital rights of her application and also forwards some of the digital right terms defined by the sources.

Problems and Limitations

In the scenario described above, the application created by Alice need to assess the digital rights associated with a given information resource. Since Alice cannot predict which information sources will be used, and considering that the delegation of this decision to the end users could overload them with licensing information, there is the need to define appropriate mechanisms for minimizing the amount of user intervention in the system.

The definition of a standardized vocabulary for expressing digital rights provenance will play a fundamental role in the process of allowing information producers to associate machine processable usage terms to information resources. Despite being suitable for some application domains, the attribution of pre-defined license types (such as GPL, LGPL and creative commons) does not have the granularity necessary to cover common digital rights constraints necessary in some domains. Currently, the Open Digital Rights Language (ODRL) [1] initiative is starting to cover some of the dimensions required for a complete digital rights vocabulary.

The definition of an architecture built over a digital rights provenance vocabulary which can provide the base for the application of policies and for the enforcement of digital rights is also a critical issue on the practical deployment of digital rights on the Web.

Existing Work (optional)

[1] The Open Digital Rights Language (ODRL): http://odrl.net/

[2] Creative Commons (CC): http://creativecommons.org/

[3] Digital Rights Management (DRM): http://en.wikipedia.org/wiki/Digital_rights_management

[4] liblicense: http://wiki.creativecommons.org/Liblicense

Computer Assisted Research

Name

Computer Assisted Research

Owner

Jim Myers

Provenance Dimensions

Primary: Comparison Secondary: Debugging, Process, Justification for Decisions

Background and Current Practice

Researchers 'stand on the shoulders of giants' benefiting from the results of prior and complimentary research efforts. Today, most of the work to discover related work is manual - potentially automated search followed by manual reading of papers, attendance at talks, and personal correspondance. The rich records possible through capture of metadata and provenance raise the potential for much more integration of reference and peer research into ongoing research activities.

Goal

To enhance productivity in the research process by automating the discovery of related work.

Use Case Scenario

A user pursues their work goals within a system that captures information about their plans and activities. Using this information the system interacts with community and reference data and literature systems to provide just-in-time information that aids the user in refining their project plans, debugging problems that occur as they work, and analyzing their results.

Alice has decided to extend her research through the use of a shared instrument facility - using an unfamiliar technique to better characterize a molecule she has synthesized. Bob, her graduate student, enters the general plan into his electronic notebook and, after a moment, is presented with several experimental protocols from published work that give him a good sense of best-practices in terms of calibrating the instrument, performing interleaved conrol experiments, etc. Bob selects and modifies one and begins his work. While running the instrument, Bob has trouble using a feature in the instrument control software. With the click of a button, Bob pulls up several cases where other users have seen the same error and he adopts a work-around described by a colleague in the facility's shared database. With the data in hand, Bob meets with Alice and they ponder the existence of several unexpected peaks in the spectra they have. Suspecting an impurity, they query the literature services provided through their library to get a list of impurities that have been seen in similar chemical syntheses. A few more clicks and Bob and Alice have pulled recent reference spectra of these compounds into their analysis software but none fit. They then check the instrument's online log and discover that the compound studied by a recent user is a match. Mystery solved and one more experiment run before the alloted instrument time is used. They also alert the facility staff who then notify two additional groups who may be impacted.

Problems and Limitations

The 'point' of this use case is to emphasize that provenance of related processes can be used in 'real-time' to steer work. The value from such a capability could be a significant factor in encouraging users to record metadata and provenance - they see immediate feedback in terms of useful reference information and recommendations.

The types of queries required to support the use cases above are just the same as those envisioned for historical/after-the-fact uses. They thus share the problems of those use cases, i.e. finding 'relevant' nformation depends on significant domain-specific metadata in addition to provenance. Real-time use does add a constraint that relevant information would need to be programatically accessible, efficiently indexed, and aggregated from themultiple sources available to a given user.

Unanticipated Uses (optional)

The specific use case focuses fairly closely on data, but one can envision uses that overlap with social networking and general recommender systems, e.g. "People like you (who have used this data and that workflow) have read this paper and 60% attend that conference" ...

One might also imagine that information now collected by surveys, such as information on the popularity of different sofware tool in a community, would be automatically available (perhaps on as statistical aggregates to protect privacy) and dynamically updated. Such information would give an increased sense of presence within a comunity and allow self-analysis within the community about their practices.

"Research" could also be replaced with "work" in general to create a similar use case in a business setting.

Existing Work (optional)

Human-Executed Processes

Name

Provenance in support of artifacts derived by human-executed processes.

Owner

Paulo Pinheiro da Silva, Jitin Arora

Background

In many scientific scenarios, data is collected over a period of time and processed through complicated processes to produce artifacts. Some scientific processes are fully or semi-automated but a large number of scientific processes are executed by humans who may use machines basically to record their artifacts. If the artifacts lead to an unexpected claim, then a higher standard of acceptance may be applicable. It may then be necessary to produce all the datasets that contributed to this artifact, and possibly the detailed sequence of steps that led to the generation of this artifact. During this process, additional input may have come from sources that are hard to capture such as GUI or keyboard input by a scientist. In the following scenario, we describe the need to capture provenance in a dynamic manner, going beyond a static template implied by the execution of an automated process, that can last a long time, and that may not be finished yet.

Goal

To enable scientists to justify their final conclusions by providing a detailed trace of the steps leading to the generation of an artifact, most notably the data sources that were involved in its generation. Further, to justify the partial conclusions in a similar way that they justify their final results with the additional need to explain how these partial conclusions will lead to a final conclusion in case the process has not been entirelly executed.

Current Practice Scenario

We observe that most scientists are careful about capturing and recording provenance information. However, currently a very limited amount of provenance may be embedded inside certain data file formats such as JPEG images or Excel spreadsheets which may not be easily accessible and is not adequate to provide a complete trace. Another limited option is to capture provenance information and to encode it into databases that eventually become silos of provenance information that are hard to access and use.

Use Case Scenario

Geologist Janet uses a number of off-the-shelf tools to generates gravity maps of some regions using experimental data gathered from field visits. The process of generating the maps is somehow well-known and it not complex enough for Janet to justify the creation and use of a workflow engine. Some areas of the maps have significantly different values than nearby areas. In order to provide credibility to his claim that such rapid variations do indeed exist, Janet must discover and make available the specific data sets that were used in generating those maps as well as the parameters input to the process when it was executed.

Problems and Limitations

To answer these questions, a complete trace of the execution of the process is needed, including the source datasets, the identification of the tools that were used, input parameters provided in such ways as keyboard input, shell environment variables, etc. More complex is the problem of justifying the existence of intermmediate artifacts that are in the process of being used to generate a final conclusion (i.e., an artifact) that does not exist yet.

Semantic Disambiguation of Data Provider Identity

Name

Semantic disambiguation of data provider identity

Owner

Aleksey Chayka

Provenance Dimensions

Primary: Content: Attribution (verifying attribution)

Secondary: Content: Evolution and versioning (republishing) Use: Interoperability, Understanding (Presentation)

Background and Current Practice

In the W3C recommendation Uniform Resource Identifier (URI): Generic Syntax, a resource is defined as “anything that has identity”. To date, the good practice of associating only unique URI with corresponding resource on the Web is not supported by any large-scale web infrastructure. This means that there is no easy and “standard” way for preventing the creation of URI aliases; as a consequence, a new URI is minted for the same resource any time a statement is made about it in different locations of the Web. Identification of a proper source that can use a bunch of URIs becomes a problem.

There are currently two major approaches which can potentially help to solve the problem. The first is the Linking Open Data Initiative 2, which has the goal to “connect related data that wasn’t previously linked”. The main approach pursued by the initiative is to establish owl:sameAs statements between resources in RDF. While the community has made a huge effort to link a significant amount of data, their approach depends on specialized, data source dependent heuristics to establish the owl:sameAs statements between resources, and it requires the statements to be stored somewhere, along with the data. However, such an approach has several concerns. First, in most Web scenarios it is hard to find standard web users making an effort to create owl:SameAs statements for their data. Second, an error in an identity statement might have long ramifications on the entire Web of Data. Finally, reasoning over massive numbers of owl:sameAs statements in distributed ontologies is computationally a complex and highly expensive task, which may lead to the conclusion that these linked data are more suitable for browsing than for reasoning or querying.

The second approach is presented in Jaffri et al. 3. In their work resulting from the ReSIST project, these authors recently came to a conclusion that the problem of proliferation of identifiers and the resulting coreference issues should be addressed on an infrastructural level. As a solution, they propose what they call a Consistent Reference Service. However, their point about URI potentially changing “meaning” depending on the context in which they are used, is philosophically disputable: the fact that several entities might be named in the same way (“Spain” the football team and the country) must not lead to the conclusion that they can be considered the same. Furthermore, their implementation of “coreference bundles” which establish identity between entities, are in fact very similar to a collection of owl:sameAs statements, that was described in the previous approach.

Goal

The goal is disambiguation of information source identities, despite of the way how the sources can be identified by other users. A proper identity of a source can serve for reasoning about it.

Use Case Scenario

Bob wants to cite the 43rd President of the United States in his e-mail to a group of people. He wants to refer to the President in such a way so everybody recognize the correct person. Millions of people may call him as “43rd” which will uniquely identify the person for each of those people. Billions of people may identify him as “George W. Bush”, for the rest probably the best identifier of a person would be “the President of the United States George W. Bush”. Bob’s goal is to let other people find (identify) the source of information that he wants to cite. Note that even if a source is misidentified, its wrong ID still can be used inside a group of people that have common conventions on identification of such a source 1.

Problems and Limitations

Abstracting from the human perception of a source ID, the problem is to unify the way how a source can be identified to be further consistently queried or reasoned. For the web data, identification of a source itself (with the help of a proper domain name and local URI) is not an issue. But when a source operates with RDF graphs where one can find statements referring to other sources, the question of semantic disambiguation of types and the sources themselves arise. Whether alternative IDs or some facts about the source will be available, the disambiguation can be resolved by such a system as Entity Name System 1. The provenance of a source will help to disambiguate the source itself (i.e. differentiate data originator from a mediator).

References

[1] P. Bouquet, T. Palpanas, H. Stoermer, and M. Vignolo, "A Conceptual Model for a Web-scale Entity Name System," in ASWC, Shanghai, China, 2009

[2] http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

[3] A. Jaffri, H. Glaser, and I. Millard. Uri identity management for semantic web data integration and linkage. In 3rd International Workshop On Scalable Semantic Web Knowledge Base Systems. Springer, 2007

[4] P. Bouquet, H. Stoermer “OKKAM : Enabling an Entity Name System for the Semantic Web” in: Proceedings of the I-ESA2008 Workshop on Semantic Interoperability, 2008

Hidden Bug

Hidden Bug