Using named graphs to model Accounts

From Provenance WG Wiki
Revision as of 17:30, 10 November 2011 by Gklyne (Talk | contribs)

Jump to: navigation, search

author: Tim Lebo

contributors: Richard Cyganiak, Luc Moreau, Andy Seaborne, Sandro Hawke, and Satya Sahoo.

This section describes a proposal to use RDF named graphs and OWL to model PROV Accounts (as part of PROV OWL encoding).

Objective

Accounts have two purposes [1]:

  • It is the mechanism by which attribution of provenance can be assserted[sic]; it allows asserters to bundle up their assertions, and assert suitable attribution;
  • It provides a scoping mechanism for expression identifiers and for some contraints[sic] (such as generation-unicity and derivation-use).

By using RDF and choosing appropriate URIs for PROV Entities, etc., this second purpose is moot. Thus, we can focus on fulfilling the first purpose.

Named graphs

Although named graphs are not a part of the 2004 RDF recommendation [2], Carroll et al. described named graphs in their seminal 2005 paper Named Graphs, Provenance and Trust [3]. The adoption of named graphs evolved from practical needs while developing RDF applications and they are a commonplace feature of most mature RDF APIs and systems. The first formal recognition of named graphs came with the SPARQL 1.0 recommendation in 2008 [4].

The following three points should clarify what named graphs are:

  • Named graphs let you specify a subset of the "global" RDF graph.
  • Named graphs are not graph literals.
  • Named graphs are more like file paths.

Next, we discuss each point in turn.

Named graphs let you specify a subset of the "global" RDF graph. The point of RDF is to interconnect data across the world. Its objective was to make a unified, connected global RDF graph. While RDF's design achieves this rather well, many applications require a "subset" mechanism. Named graphs became that mechanism and permit applications to operate, query, or specify some RDF, not all of it. The use of RDF "subsets" comes from practical needs, which contrasts from the idealistic, theoretical union of all RDF ever found in the world.

Named graphs are not graph literals. Named graphs are passed by reference [5], not by value [6]. When I tell you about a named graph, I can only tell you where it is and what was in it when I last looked. If you go look in the same location, you may find a different RDF graph. Thus, named graphs behave much like URLs and file paths on disk -- they are merely containers for whatever you put in them, their contents can change over time, and they can disappear altogether. Graph literals, on the other hand, provide by value representation. They are a feature of the non-standard N3 language [7] and are not widely used.

Named graphs are more like file paths. Since named graphs are not graph literals, and their contents (value) can be transient, they must be treated in the same manner as file paths on disk, or URLs on the web.

Representing a named graph

Named graphs surface in a variety of ways. In all cases, they are used to "section off" a particular subset of RDF -- so that one is not considering all RDF ever found or handled.

Named graphs are predominantly represented in the following ways:

  1. As a URI that resolves to an RDF file
  2. In part of a file (e.g. TRiG)
  3. In part of a SPARQL endpoint
  4. In part of an API method call (e.g. Sesame)
  5. In RDF itself (e.g. SPARQL 1.1 service description)

Next, we discuss each form in turn. While forms 1-4 are discussed to illustrate the use of named graphs, the fifth will be incorporated to model PROV Accounts.

As a URI that resolves to an RDF file As part of Linked Data design [8], URIs that are used to name and describe resources in the world should be resolvable on the web. For example, a URI for the author of this document, <http://purl.org/twc/id/person/TimLebo>, can be requested using HTTP to receive further descriptions of that person. In terms of named graphs, this URI -- and any other that returns RDF -- can be seen as a named graph in a global, web-wide named graph set [9].


In part of a file (e.g. TRiG) Perhaps the most straightforward introduction to how a named graph can be represented is how it can be encoded in a text file. TRiG [10] is much like Turtle [11], but uses curly braces to section off different parts of the RDF graph. An unnamed section is the default graph. The following example [12] shows one default graph and two named graphs <http://www.w3.org/People/Berners-Lee/card> and <http://www.cs.rpi.edu/~hendler/foaf.rdf>.

@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf:  <http://xmlns.com/foaf/0.1/> .
@prefix con:   <http://www.w3.org/2000/10/swap/pim/contact#> .
@prefix card:  <http://www.w3.org/People/Berners-Lee/card#> .

# This document contains a default graph and two named graphs.

{ 
   card:i rdf:type foaf:Person .
   <http://www.cs.rpi.edu/~hendler/foaf.rdf#jhendler> a foaf:Person .
}

<http://www.w3.org/People/Berners-Lee/card> { 
   card:i rdfs:label "Tim Berners-Lee";
        con:assistant card:amy .
}
 
<http://www.cs.rpi.edu/~hendler/foaf.rdf> { 
   <http://www.cs.rpi.edu/~hendler/foaf.rdf#jhendler>
       a foaf:Person;
       foaf:depiction <http://www.cs.rpi.edu/~hendler/hendler.gif>;
       foaf:firstName "Jim" .
}


In part of a SPARQL endpoint By loading a couple of triples into a named graph of a triple store, consumers can query just that small set of triples. For example, even though there are millions of triples in a SPARQL endpoint, the following query will only select the two triples [13] about Tim Berners-Lee that we placed in the endpoint's section called <http://www.w3.org/People/Berners-Lee/card> (see results).

select ?s ?p ?o
where {
  graph <http://www.w3.org/People/Berners-Lee/card> {
    ?s ?p ?o
  }
}


In part of an API method call (e.g. Sesame) All mature RDF graph APIs provide an optional parameter to specify which named graphs should be affected. For example, the Sesame Java API [14] accepts a list of 0 or more named graphs (which they call contexts) [15] when adding RDF data to a Repository. The interface for this method is shown here:

org.openrdf.repository

Interface RepositoryConnection

void add(File        file,
         String      baseURI,
         RDFFormat   dataFormat,
         Resource... contexts)
         throws IOException,
                RDFParseException,
                RepositoryException

    Adds RDF data from the specified file to a specific contexts in the repository.

    Parameters:
        file - A file containing RDF data.
        baseURI - The base URI to resolve any relative URIs that are in the data against. This defaults to the value of file.toURI()
                  if the value is set to null.
        dataFormat - The serialization format of the data.
        contexts - The contexts to add the data to. Note that this parameter is a vararg and as such is optional. If no contexts 
                   are specified, the data is added to any context specified in the actual data file, or if the data contains no
                   context, it is added without context. If one or more contexts are specified the data is added to these contexts,
                   ignoring any context information in the data itself. 

When invoking the following Java method calls to Sesame's RepositoryConnection, the Repository will load the same two triples [16] into the section called <http://www.w3.org/People/Berners-Lee/card>:

            conn = repository.getConnection();
            conn.add(new File("/Users/tlebo/Desktop/prov-wg/hg/prov/ontology/components/Account/tbl.ttl"), 
                              "", RDFFormat.TURTLE, vf.createURI("http://www.w3.org/People/Berners-Lee/card"));
            conn.commit();
            conn.close();


In RDF itself

In RDF itself (e.g. SPARQL 1.1 service description) This final form of representing a named graph will be used to model PROV Accounts, since it allows us to described RDF named graphs in RDF itself. Regardless, the four forms of representing a named graph remain useful and should continue to be applied.

The SPARQL 1.1 Service Description "provide[s] a mechanism by which a client or end user can discover information about [a] SPARQL service such as ... details about the available dataset." [17]. The Service Description includes an RDF vocabulary [18] that can be used to describe a sd:Service available at a sd:url that will list any sd:NamedGraph it offers, along with the named graph's sd:name. The entire vocabulary is illustrated here.

Using this vocabulary, we can describe [19] the named graphs that were created in the SPARQL endpoint in our example above:

@prefix sd: <http://www.w3.org/ns/sparql-service-description#> .
@prefix :   <http://dvcs.w3.org/hg/prov/raw-file/tip/ontology/components/Account/tbl-jah-in-logd.ttl#> .

:logd_endpoint
  a sd:Service;
  sd:url <http://logd.tw.rpi.edu/sparql>;
  sd:availableGraphDescriptions [ 
     a sd:GraphCollection;
     sd:namedGraph :tbl_graph, 
                   :jah_graph;
  ];
.

:tbl_graph 
  a sd:NamedGraph; 
  sd:name <http://www.w3.org/People/Berners-Lee/card>;
.

:jah_graph
  a sd:NamedGraph; 
  sd:name <http://www.cs.rpi.edu/~hendler/foaf.rdf>;
.

Knowing that the SPARQL endpoint at http://logd.tw.rpi.edu/sparql contains and offers the named graphs <http://www.w3.org/People/Berners-Lee/card> and <http://www.cs.rpi.edu/~hendler/foaf.rdf> allows any client to construct and submit SPARQL queries to obtain results using the SPARQL Protocol for RDF [20], which we demonstrated above.

Equivalence of named graphs

It is very important to recognize the following:

A graph's name does not identify its contents.

That is, two graphs with the same name does not imply that they contain the equivalent RDF subgraphs.

(TODO: clarify using g-box and g-snap)

As a simple example, compare tbl-jah.trig and tbl-jah-2.trig, whose diff is shown below. Both TRiG files name their graphs <http://www.w3.org/People/Berners-Lee/card> and <http://www.cs.rpi.edu/~hendler/foaf.rdf>, but specify different contents.

@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-	@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .	@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf:  <http://xmlns.com/foaf/0.1/> .			@prefix foaf:  <http://xmlns.com/foaf/0.1/> .
@prefix con:   <http://www.w3.org/2000/10/swap/pim/conta	@prefix con:   <http://www.w3.org/2000/10/swap/pim/conta
@prefix card:  <http://www.w3.org/People/Berners-Lee/car	@prefix card:  <http://www.w3.org/People/Berners-Lee/car

# This document contains a default graph and two named g	# This document contains a default graph and two named g

{ 								{ 
   card:i rdf:type foaf:Person .			   |	   card:i rdf:type foaf:Person;
   <http://www.cs.rpi.edu/~hendler/foaf.rdf#jhendler> a    |	          foaf:mbox_sha1sum "965c47c5a70db7407210cef6e4e
							   >	   <http://www.cs.rpi.edu/~hendler/foaf.rdf#jhendler> a 
							   >	         foaf:surname "Hendler" .
}								}

<http://www.w3.org/People/Berners-Lee/card> { 			<http://www.w3.org/People/Berners-Lee/card> { 
   card:i rdfs:label "Tim Berners-Lee";				   card:i rdfs:label "Tim Berners-Lee";
        con:assistant card:amy .				        con:assistant card:amy .
							   >	   card:i foaf:img <http://www.w3.org/Press/Stock/Berner
}								}
 								 
<http://www.cs.rpi.edu/~hendler/foaf.rdf> { 			<http://www.cs.rpi.edu/~hendler/foaf.rdf> { 
   <http://www.cs.rpi.edu/~hendler/foaf.rdf#jhendler>		   <http://www.cs.rpi.edu/~hendler/foaf.rdf#jhendler>
       a foaf:Person;						       a foaf:Person;
       foaf:depiction <http://www.cs.rpi.edu/~hendler/he	       foaf:depiction <http://www.cs.rpi.edu/~hendler/he
       foaf:firstName "Jim" .				   |	       foaf:firstName "Jim";
							   >	       foaf:title "Tetherless World Constellation Chair"
}								}

... and their proliferation

So, when we are given a named graph's name (say, <http://www.cs.rpi.edu/~hendler/foaf.rdf>), we must further inquire as to where that named graph resides.

In addition to its name and location, its last modified time can be used to adequately distinguish (or, identify) a named graph. Since any of the contents of the locations listed above can change, so can any of the named graphs they describe. Thus, we can stand on solid ground when we use the following to cite a named graph that interests us:

  • The named graph's name
  • The named graph's location
  • The named graph's last modified time

These attributes can be used to construct an owl:hasKey [21] to infer the identity of named graphs named with different URIs.

TODO: Instead of speaking of a graph's “location”, it might be more accurate to speak of the “RDF dataset” it is in. The term is defined in the SPARQL spec. -Richard

Named meta graphs

Named meta graphs are named graphs that describe other named graphs.

Traditional modeling

Two-graph provenance example [22]

TODO: confirm that this illustrates the traditional modeling of other-graph descriptions.

:G1 {
  [] a :Publishing;
     :date "2011-09-30"^^xsd:date;
     :webAddress <http://example.com/>;
     :triples :G2.
}
:G2 {
  :s1 :p1 :o1.
}

Adding named graphs' location(s)

(TODO: As a principle (of AWWW), one name can only refer to one thing [23]. But direct (lazy?) modeling violates this.)

(TODO: "graph" here seems to refer to graph-a-location but also "graph the contents of the location". But those are different things. The RDF-WG has the concept of "graph box" (g-box) which is a thing that hold on "graph-value" (g-snap - snapshot) [24].)

TODO: This is a rephrasing of http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs/Options#The_Tag_Names_Something.2C_Locally

Because citing a named graph's name may not provide adequate information about which graph is being referenced, we need to extend the traditional modeling shown above to prepare to model PROV Accounts. This extension, which reuses the SPARQL Service Description vocabulary, includes the graph's location as well as its name. For example, the following TriG file [25] describes two named graphs. The first, :about_tbl_card describes the second, <http://www.w3.org/People/Berners-Lee/card> by saying that the second graph is "about" Tim Berners-Lee, using the popular dcterms:subject property. The prov:hadLocation <> triple distinguishes any named graph named <http://www.w3.org/People/Berners-Lee/card> from the one in this file; only the latter is being described.

@prefix sd: <http://www.w3.org/ns/sparql-service-description#> .
@prefix :   <#> .

:about_tbl_card {
   :tbl_card
       a sd:NamedGraph;
       sd:name          <http://www.w3.org/People/Berners-Lee/card>;
       prov:hadLocation <>;
       dcterms:subject  card:i;
       rdfs:comment "The named graph <http://www.w3.org/People/Berners-Lee/card> in this file is about Tim Berners-Lee.";
   .
}

<http://www.w3.org/People/Berners-Lee/card> {
   card:i rdfs:label "Tim Berners-Lee";
        con:assistant card:amy .
}

In addition to describing the other named graph (<http://www.w3.org/People/Berners-Lee/card>), :about_tbl_card also describes itself towards the end of the file. It says that it is about the two named graphs in this file.


@prefix : <#> .

:about_tbl_card {
   :about_tbl_card
       a sd:NamedGraph;
       sd:name          :about_tbl_card;
       prov:hadLocation <>;
       dcterms:subject  :tbl-card,
                        :about_tbl_card;
       rdfs:comment "The named graph #about_tbl_card in this file is about two named graphs in this file.";
   .
}

Named meta graphs of cache graphs

Now that we've shown how "pairs" of named graphs can be used -- where one describes the other -- we can introduce one application of named meta graphs. This will get us one step closer to modeling PROV Accounts, while exercising the more fundamental aspects of PROV to describe the origin of a named graph's contents.

A cache graph is a named graph whose contents is created by constructing a "verbatim" copy of some "external" RDF resource. For example, the following TRiG [26] uses a meta named graph to describe the origin of a second, named, cache graph.

TODO: Sindice uses graph names that correspond to the URL that was retrieved [27].

@prefix sd:   <http://www.w3.org/ns/sparql-service-description#> .
@prefix prov: <http://dvcs.w3.org/hg/prov/raw-file/tip/ontology/ProvenanceOntology.owl#> .
@prefix :     <#> .

:about_that_big_graph_below {

   :what_I_got_from_dbpedia
      a sd:NamedGraph;
      sd:name <hash:GRAPH_SHA256-488328c605c2f9532819d4ff8258650d2e03087e00454d25f5a39659f01d0b8307>;
      prov:hadLocation <>;

      dcterms:subject <http://dbpedia.org/resource/World_Wide_Web_Consortium>; 
      rdfs:comment 
         "The named graph <hash:GRAPH_SHA256-488328c605c2f9532819d4ff8258650d2e03087e00454d25f5a39659f01d0b8307> in this file is about W3C.";

      prov:wasGeneratedBy :the_process_that_created_that_big_graph;
      rdfs:comment 
         "The named graph <hash:GRAPH_SHA256-488328c605c2f9532819d4ff8258650d2e03087e00454d25f5a39659f01d0b8307> in this file is from DBpedia.";
   .

   :the_process_that_created_that_big_graph
      a prov:ProcessExecution;
      rdfs:seeAlso :download, 
                   :reserialize; # How does PROV handle composition of processes?
   .
   :download
      a prov:ProcessExecution;
      dcterms:description 
        "curl -H 'Accept: application/rdf+xml' -L http://dbpedia.org/resource/World_Wide_Web_Consortium > World_Wide_Web_Consortium.rdf";
   .
   :reserialize
      a prov:ProcessExecution;
      prov:followed :downlaod;
      dcterms:description "rapper -g -o turtle World_Wide_Web_Consortium.rdf";
   .
}

# ...

<hash:GRAPH_SHA256-488328c605c2f9532819d4ff8258650d2e03087e00454d25f5a39659f01d0b8307> {

   <http://dbpedia.org/resource/Acid1>
       dbpedia-owl:owner <http://dbpedia.org/resource/World_Wide_Web_Consortium> .

   <http://dbpedia.org/resource/Agora_%28web_browser%29>
       dbpprop:developer <http://dbpedia.org/resource/World_Wide_Web_Consortium> .

   <http://dbpedia.org/resource/Alan_Kotok>


   # ...

}


curl -H "Accept: application/rdf+xml" -L http://dbpedia.org/resource/World_Wide_Web_Consortium


is also on a SPARQL endpoint

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?o
WHERE {
  GRAPH <hash:GRAPH_SHA256-488328c605c2f9532819d4ff8258650d2e03087e00454d25f5a39659f01d0b8307>  {
    ?s ?p ?o
  }
} 


TODO(Tim): finish out the example and show how named graphs can be combined with PROV's OWL encoding (Showing the distinction between an Account and ProvenanceContainer). The solution should be able to represent RDF circumscribed by named graphs as well as just by file distinctions.


TODO: part/whole [28]

Named meta graphs of PROV assertions

The OWL encoding of PROV [29] defines the classes and properties that can be used to make PROV assertions using RDF. In addition to listing the classes and properties, it includes inferences that can be applied to any PROV assertions found, which can provide a consumer with more associations than what was originally provided.

One of the simplest RDF assertions that can be made using the PROV OWL encoding is the derivation of one file from another:

<http://dvcs.w3.org/hg/prov/raw-file/tip/ontology/components/Account/tbl-jah-2.trig.prov.ttl#result>
<http://dvcs.w3.org/hg/prov/raw-file/tip/ontology/ProvenanceOntology.owl#wasDerivedFrom> 
<http://dvcs.w3.org/hg/prov/raw-file/tip/ontology/components/Account/tbl-jah-2.trig.prov.ttl#original> .

(Perhaps even simpler is to modify the contents of the same file path; for this situation, consider this example.)

This triple is asserted in a file by itself [30], as well as part of a file that gives a more complete picture of what happened [31].

The dcterms:description in the larger file sums up the RDF PROV assertions:

"Tim copy-pasted tbl-jah.trig to tbl-jah-2.trig and added a couple triples from two foaf files on the web."

But who said Tim did this? Well, Tim did. And if someone else wanted to report on what Tim did, we need a way to distinguish these claims, which is why PROV needs Accounts. So, we need to point at that triple (in the simple case) and the larger description (of twenty triples) and say that Tim says that Tim made these changes. Similarly, we need to permit anyone else to point to Tim's materials and results and say what he did with them.

Accounts are meta named graphs that describe a second named graph that went through an "assertion" ProcessExecution

We could have a second PROV Account that varies from the previous claim:

<http://dvcs.w3.org/hg/prov/raw-file/tip/ontology/components/Account/tbl-jah-2.trig.prov.ttl#original>
<http://dvcs.w3.org/hg/prov/raw-file/tip/ontology/ProvenanceOntology.owl#wasDerivedFrom> 
<http://dvcs.w3.org/hg/prov/raw-file/tip/ontology/components/Account/tbl-jah-2.trig.prov.ttl#result> .

Much like the crime-file's SpellChecking processExecution - Nothing changed but it was reviewed.

PROV-O examples

2011-11-08 (a month after this document was written): http://dvcs.w3.org/hg/prov/file/3ba83e9ffa92/ontology/components/Account/different-accounts-can-include-the-same-entity.ttl

Comments

These should be considered, but have not bee incorporated into the document yet.

  • “Sameness of named graphs” – The entire section seems questionable to me. It is true that nothing stops us from using the same graph name with different contents in different datasets, but this is equally true of any URI anywhere in RDF. As far as RDF Concepts and RDF Semantics are concerned, the URI ex:foo might represent one thing RDF graph G1 and something else in RDF graph G2 – it is only the social contracts and conventions around URI ownership and web architecture that discourage such behaviour and allow us to maintain the fiction that URIs in RDF actually identify specific entities. The same could be said of graph names. -Richard [32]
  • I'm skeptical about the assertion that a named graph's “last modified time” contributes to its identity. In the example, the last modified time of the individual graphs is unknowable. It perhaps makes sense to talk about a last modified name of the enclosing TriG file, but not of the graphs themselves. -Richard [33]

2011 Oct 06 prov-wg telecon

  • Luc: how to find out if a prov assertion is in an Account?
    • what is the PROV OWL construct?
    • Tim: SPARQL query into the named graph
  • Sandro: what if named graphs change?
    • Tim: you can serialize it, if you REALLY want to and are worried that it may change in the current location.
  • Sandro: what is Tim's definition of named graph?
    • Tim: A location to put a subset of RDF (though, Richard has warned against "location").
  • Luc: accounts can be hierarchical. they can be nested in other accounts.
    • Tim: void:subset should do the trick.
    • Tim should get related work from Satya on grouping named graphs.
    • Tim: GraphCollection in sd:
  • Luc: what we are scoping is what we say about those resources (not how we are naming them).
    • Account 1: saying one thing about one resource
    • Account 2: say another thing about another resources.

public-rdf-prov@w3.org

  • this describes how an application that really wants to track changes might go about naming of the significant concepts: it does not rely on the publisher doing anything (Sandro has written up the version where the publisher publishes in a way that makes the state at a particular time explicit) [34]
  • N3: log:includes is the relationship of a location and its contents. It's at a point in time, when the application rules run. To capture the possibility of observations at different times, each observation generates a URI and makes claims about the observation [35]

RDF-WG F2F2

  • publisher vs. consumer -
  • knowledge has different perspectives.
  • named graphs are about decentralization and pluralism.
    • TODO: avoid the "global RDF" graph pitch - different people have different perspectives.

Graham's ORE note

OAI/ORE uses named graphs to model resource maps. In order to also talk about resource maps, they are modelled as a subclass of named graphs as defined by the TRiX work. In due course, I expect that this will be superseded by the current RDF work, but it does exemplify a way to use RDF named graphs, and also to explicitly model them and make statements about them.

 <rdfs:Class rdf:about="http://www.openarchives.org/ore/terms/ResourceMap">
   <rdfs:label>Resource Map</rdfs:label>
   <rdfs:comment>
     A description of an Aggregation according to the OAI-ORE data model. Resource Maps are serialised to a machine readable format according to the implementation guidelines.
   </rdfs:comment>
   <rdfs:subClassOf rdf:resource="http://www.w3.org/2004/03/trix/rdfg-1/Graph"/>
   <rdfs:isDefinedBy rdf:resource="http://www.openarchives.org/ore/terms/"/>
 </rdfs:Class>

References