ProvenanceRDFNamedGraph

From Provenance WG Wiki
Revision as of 15:48, 21 September 2011 by Gklyne (Talk | contribs)

Jump to: navigation, search

RDF Named Graph

Specific requirements that the Prov-WG has on RDF named graphs

RDF WG Dataset Proposal

RDF WG Provenance Use Case

RDF WG Graph use cases

Provenance Requirements

  1. Ability to collectively refer to a set of provenance assertions - to make additional assertions regarding date, author, and related provenance information
    • This is a requirement of the account construct, which also identifies a scope within which some properties must hold (e.g. at most one process execution generating an entity). [Luc]
    • This is supported by XG-Requirement #1 (documented below) [Yolanda]
  2. Ability to retrieve the provenance of a RDF resource (requires finer level of granularity at the level of Subject, Predicate and Object - which is not supported by RDF named graph construct)
    • This is supported by XG-Requirement #4 (documented below) [Yolanda]
  3. Ability to retrieve the provenance of a set of RDF statements (Named Graphs are a means to identify a set of statements and thus makes them identifiable, a requirement to assign provenance information to them) [Kai, following Telco of Sep 15 2011]
  4. Ability to compare two graphs and determine if they are equal. This is useful if we want to compare provenance.
    • Jena API has method for both graph compare (isIsomorphicWith) implementing the "method described in RDF document" and graph merge (implemented in our work on provenance query operators)[Satya]
  5. Ability to group of triples down to a single triple such that we can describe that groupings provenance.
  6. [GK] Possible need to use named graphs to record contextualization of provenance assertions - depends on discussions about provenance - see http://lists.w3.org/Archives/Public/public-prov-wg/2011Sep/0000.html. See in particular the paragraph starting "I then think our discussion becomes one of how the contextualization of an assertion is captured..." about the middle of the email.
  7. An entity contains a fixed list of attribute-value pairs. There must be a mechanism by which it is possible to identify which attribute-value pairs have been asserted. Named graphs may be used for this. [Luc]
    • Is this different from grouping a set of assertions? [Satya]
  8. Would be nice to have a mechanism to sign a provenance graph. Is this a requirement for named graphs? [Luc]
    • We can make assertions about a named graph - author, signature, "validator" etc. [Satya]
    • Signing an RDF document by a publisher is part of the XG-Requirement #4 (documented below) [Yolanda]
  9. Evolution: An important requirement is the ability to describe the provenance of a dynamic, evolving resource. Over time, there may be updates and even new versions that change some aspect of the resource. A challenge is to describe how the new incarnations of the resource relate to one another, and to determine whether provenance records should be self-contained and attached to each incarnation, or instead refer to prior ones for details. As resources may be republished, perhaps repackaging, summarizing, or mixing their contents, their provenance records need to reflect such processes and their implications on the contents. (This is XG-Requirement #2, documented below) [Yolanda]
  10. Entailment: Another important requirement is to distinguish what is directly asserted by the entities and processes that produce the resource from other information that may be inferred from those assertions or perhaps derived or hypothesized by a third party. (This is XG-Requirement #3, documented below) [Yolanda]
  11. Querying: Provenance information may be made accessible in some manner, and there must be mechanisms to find the provenance for a given resource. Query formulation and execution must be provided for provenance information. Ideally, there should be a convenient way to formulate queries that span primary and provenance information. (This is XG-Requirement #5, documented below) [Yolanda]

[GK] Requirement from discussion with Andy Seaborne

In a meeting with Andy Seaborne this morning, we discussed provenance requirements and RDF named graphs, in light of some options that the RDF group might be considering.

The resulting requirement that we articulated was that for the purposes of provenance, we must be able to treat two "named graphs" with identical graph content as two distinct entities.

Use-case

Suppose we have some resource R.

Observer A makes a provenance assertion about R on Monday 2011-09-19, which is expressed as an RDF graph Pra

Observer B makes a provenance assertion about R on Friday 2011-09-23, expressed as RDF graph Prb

To express provenance about the provenance assertions, we may wish to say:

Pra statedBy A; onDate "2011-09-19" .
Prb statedBy B; onDate "2011-09-23" .

It may be that the provenance assertions Pra and Prb have identical content; i.e. they are RDFG graphs containing identical triple sets. For the purposes of provenance recording, it is important that even when they express the same graphs, Pra and Prb are distinct RDF nodes. If Pra and Prb are treated as a common RDF node, one might then infer:

_:something statedBy A ; onDate "2011-09-23" .

which in this scenario would be false.

Discussion

A particular consequence of this is that an RDF "named graph" specification based on graph literals (where RDF literals are self-denoting), somewhat like formulae in Notation 3, would have to be used with care. That is, if Pra and Prb are graph literals, then Pra = Prb, and the given provenance-of-provenance statements could not be expressed as suggested above.

(This does not preclude a graph literal approach being used, but the above use-case might need to be constructed slightly differently.)


Note: Provenance Requirements on RDF from the W3C Provenance Incubator Group

This note is included here as documentation for some of the requirements listed above. It documents requirements for RDF that were reported by the W3C Provenance Incubator Group. Details on these requirements are in the Report on "Provenance Requirements for the Next Version of RDF" presented at the at the 2010 RDF Next Steps workshop (see also the slides from the presentation).

XG Requirements

  1. Identity -- (XG-Requirement #1) A key challenge is to be able to refer to the artifact that we are describing the provenance for. Within the RDF context, the artifact could be a single RDF statement, a set of statements or an arbitrary set of Web resources.
  2. Evolution -- (XG-Requirement #2) An important requirement is the ability to describe the provenance of a dynamic, evolving resource. Over time, there may be updates and even new versions that change some aspect of the resource. A challenge is to describe how the new incarnations of the resource relate to one another, and to determine whether provenance records should be self-contained and attached to each incarnation, or instead refer to prior ones for details. As resources may be republished, perhaps repackaging, summarizing, or mixing their contents, their provenance records need to reflect such processes and their implications on the contents.
  3. Entailment -- (XG-Requirement #3) Another important requirement is to distinguish what is directly asserted by the entities and processes that produce the resource from other information that may be inferred from those assertions or perhaps derived or hypothesized by a third party.
  4. Publication -- (XG-Requirement #4) A publisher of provenance information needs to use some provenance representation language and link the provenance assertions to the actual resource information. The publisher may choose to publish only a subset of the provenance records, and should be able to identify themselves possibly with a signature that is verifiable by others.
  5. Querying – (XG-Requirement #5) Provenance information may be made accessible in some manner, and there must be mechanisms to find the provenance for a given resource. Query formulation and execution must be provided for provenance information. Ideally, there should be a convenient way to formulate queries that span primary and provenance information.

Misc. related materials

Some of Tim's musings while developing a data integration system (comments welcome):

http://webr3.org/blog/semantic-web/rdf-named-graphs-vs-graph-literals/