Use Case Provenance for Environmental Marine Data

From XG Provenance Wiki
Jump to: navigation, search


Irini Fundulaki

(Curator: Simon Miles)

Provenance Dimensions

Content: Attribution (Responsibility), Process (Reproducibility), Evolution and Versioning (Updates), Justification for Decisions, Entailment Use: Understanding, Trust

Background and Current Practice

This use case is inspired from the use of environmental marine data in forecasting models, marine biology and climate change studies. The idea is that multiple sources (observation points) record continuously the physical, biological and chemical parameters of the coastal areas of the Mediterranean sea. This data is transmitted, and stored as the database (warehouse) of a central service that provides this data to forecasting models as well as marine biologists who study correlations between the changes in the fauna and flora of the coasts and the aforementioned parameters. The central service can also provide the forecasting models and the marine biologists with archival data from manually curated databases. Marine biologists also work with vocabularies (ontologies and schemas) that describe the fauna and flora of the seas. Marine biologists could setup in-house experiments to verify certain findings. The central service can build tools using the collected data in the endeavor for the protection of the coastal marine environment as well as the businesses and urban development for the coasts of the Mediterranean sea. Users can build materialized views on the data (i.e., extract and copy the data in their own workspace) and it is crucial in this context to be able to maintain these views as efficiently as possible. In addition, one can assign trust values on the data sources, and compute the trustworthiness of the experimental results. We assume that data are represented in the RDF data model.


Detect and record the origins of sensor data (including faulty sensors), thus allowing, amongst other things, to rerun experiments that used the incorrect or affected data

Use Case Scenario

Consider a network of sensors (marine observation points) that transmit information related to the (a) wind, (b) sea surface and bottom temperature (c) wave height and dimension and (d) ecosystem forecasts (nitrates, prosphates etc.) among others. Consider that user A is a marine biologist who is querying the warehoused data in order to extract data to be subsequently integrated with the available ontologies and vocabularies.

A scenario:

(a) imagine that user A has classified an animal under a class of the vocabulary she is using, but new data shows that this classification is not correct. Nevertheless, the user would like to keep the information that had been implied and be able to annotate it with the sources that had provided it.

(b) imagine that one of the sensors was faulty and the marine biologists want to know which of the experiments conducted had used the sensor's data. This knowledge would allow the marine biologist to repeat only the experiments that have used the faulty sensor's data.

Problems and Limitations

Existing provenance models for relational data are inadequate to achieve the scenario above. Imagine that user A wants to retrieve part of data from the service's database to be used in experiments with standard query languages (SPARQL in the case of RDF Data). The experiment can be again a SPARQL query. We want to be able to store the provenance of the result so that if an update occurs (e.g., the data used in the experiment is later shown to be faulty), we could use the stored information to perform just the experiments that used the faulty data. Due to the intricacies of the SPARQL query language operators, the existing provenance models for relational data do not deal sufficiently with outer joins.

We believe that the issues of (a) representing (b) querying and (c) storing provenance is crucial for the above use case in the following aspects: (a) reproducibility of experiments (b) view maintenance (c) trust (d) entailment and (e) attribution (responsibility). A small number of solutions have been proposed for the representation of provenance information for RDF graphs. An important aspect of RDF graphs is that they have both an extensional as well as an intentional aspect that should be taken into account when managing graphs with provenance information. The concept of named graphs was proposed in [1] as a way of representing explicit provenance information of RDF triples. Intuitively, an RDF named graph is a collection of triples associated with a URI which can be referred by other graphs as a normal resource; this way, one can assign explicit provenance information to this collection of triples. Unfortunately, authors in [1] do not discuss RDFS inference, queries and updates in the presence of RDF named graphs and existing work on querying and updating RDF has been extended either with named graphs (such as Sparql and Sparql Update), or with RDFS inference support [4,5], but not with both. Authors in [8] discuss the concept of networked graphs which allow users to define RDF graphs both, by extensionally listing content, but also by using views on other graphs. In [2] the authors showed that named graphs alone are not able to capture the provenance of implicit RDF triples, and introduced the notion of graphsets that is defined as a set of RDF named graphs, itself associated with a unique identifier and with a set of triples whose ownership is shared by the named graphs that constitute the graphset. [3] proposed the use of colors to capture the provenance of RDF data and schema implicit and explicit triples. The provenance of a triple is recorded as a fourth column, hence obtaining a quadruple, and can be seen as representing the source the triple comes from. Colors can capture provenance in a fine granularity level and are a generalization of RDF Named Graphs: an RDF named graph can be modeled by arbitrary sets of triples sharing the same color. To capture the provenance of implicit RDF triples in that work, authors propose an algebraic structure defined by a set of colors and an operator that works on colors and returns the composite color that represents the provenance of an implicit triple. To perform this computation, implicit triples and their colors are obtained by extending the RDFS inference rules as defined in RDFS Semantics to handle quadruples instead of triples.

The problem of querying provenance of RDF triples has not been adequately studied in the literature. Authors in [2,3] have studied provenance propagation and querying of the typeOf, subclassOf and subpropertyOf RDF hierarchies when triples with their provenance are modeled as quadruples, but have not discussed provenance of triples obtained from the evaluation of SPARQL queries. The fundamental question that is raised in this context, concerns the use of provenance information to tackle problems such as view maintenance and trust. The research question is whether e.g., in the case of trust, a trust value can be re-computed without looking at the input data in the case of an update. The problem is similar for view maintenance. Existing work on provenance in the relational context [6,7] cannot capture the intricacies of SPARQL OPTIONAL operator that introduces negation. Last, storage of RDF triples carrying provenance information is a subject that needs to be studied towards a solution for managing the provenance of RDF triples.

Existing Work

[1] J. Carroll, C. Bizer, P. Hayes, and P. Stickler. Named graphs, Provenance and Trust. In WWW, 2005.

[2] P. Pediaditis, G. Flouris, I. Fundulaki, and V. Christophides. On Explicit Provenance Man- agement in RDF/S Graphs. In TAPP, 2009.

[3] G. Flouris, I. Fundulaki, P. Pediaditis, Y. Theoharis, and V. Christophides. Coloring RDF Triples to Capture Provenance. In ISWC, 2009.


[5] J. Perez, M. Arenas, and C. Gutierrez. nSPARQL: A Navigational Language for RDF. In ISWC, 2008.

[6] P. Buneman, J. Cheney, and S. Vansummeren. On the Expressiveness of Implicit Provenance in Query and Update Languages. In ICDT, 2007.

[7] T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007.

[8] Simon Schenk Steffen Staab. Networked Graphs: A Declarative Mechanism for SPARQL Rules, SPARQL Views and RDF Data Integration on the Web. In WWW 2008.