Graphs Design 6.1/Crawler Example

From RDF Working Group Wiki
Jump to: navigation, search

This is a more detailed walk through of the Shared Web Crawler and Archiving Web Crawler use cases, showing several ways they can be addressed using Graphs Design 6.1.


The Scenario

Craig's computer system is doing RDF Web crawling. It has a list of URLs from which it will fetch RDF content. It will parse that content and save the resulting RDF Graphs. It will then make available the information it gathered, and some metadata about how the information was gathered. That information will be obtained and used by Dave's machine.

Some more details:

If we adopt Graphs Design 6.1, there are still many ways to address this scenario. Each section below presents one of these ways; they all use Design 6.1.

These examples do not include as much metadata as one would probably like. In particular, it clients probably SHOULD pay attention to cache management headers like Last-Modified, Expires, ETag, and Cache-Control. Hopefully the examples are detailed enough to show how one could include such header information.

Blank Nodes Referring to RDF Graphs

With this approach, we refer to the RDF graphs using blank node labels.

_:g1 { <a> <b> 1 }
_:g2 { <a> <b> 2 }
{
   _:g1 a rdf:Graph.   # this says that _:g1 names the graph itself
   _:g2 a rdf:Graph.   # ditto for _:g2
   [ a eg:DereferenceOperation;
     eg:source <http://alice.example.org/page1>;
     eg:atTime "2012-04-02T160701"^^xs:dateTime;
     eg:result _:g1;
     eg:status 200;
   ].
   [ a eg:DereferenceOperation;
     eg:source <http://alice.example.org/page2>;
     eg:atTime "2012-04-02T160702"^^xs:dateTime;
     eg:result _:g2;
     eg:status 200;
   ].
   [ a eg:DereferenceOperation;
     eg:source <http://alice.example.org/page3>;
     eg:atTime "2012-04-02T160703"^^xs:dateTime;
     eg:status 404;
   ].
}

Pros:

  • Semantically simple; the blank node "_:g1" refers to the RDF graph that was parsed from the content obtained from page1.

Cons:

  • Uses blank nodes

Use the Original URLs as Graph Labels

With this approach, the graph tags are the URLs used to obtain those graphs.

<http://alice.example.org/page1> { <a> <b> 1 }
<http://alice.example.org/page2> { <a> <b> 2 }
 
{
   [ a eg:DereferenceOperation;
     eg:source <http://alice.example.org/page1>;
     eg:atTime "2012-04-02T160701"^^xs:dateTime;
     eg:resultTagged <http://alice.example.org/page2>;
     eg:status 200;
   ].
   [ a eg:DereferenceOperation;
     eg:source <http://alice.example.org/page2>;
     eg:atTime "2012-04-02T160702"^^xs:dateTime;
     eg:resultTagged <http://alice.example.org/page2>;
     eg:status 200;
   ].
   [ a eg:DereferenceOperation;
     eg:source <http://alice.example.org/page3>;
     eg:atTime "2012-04-02T160703"^^xs:dateTime;
     eg:status 404;
   ].
}

Pros:

  • For simple uses, the default graph can be ignored

Cons:

  • If there are multiple dereferences of one URL, the result tag and the source will have to be different for some of them
  • Legitimate, accurate crawler results will logically conflict (different graphs for the same label) if a source changes between the crawls.
  • Multiple datasets, from different crawls (where some sources have changed contents), can't be properly combined without application-logic

Snapshot URLs as Graph Labels

With this approach, the graph tags are the URLs created on the crawler which can be used to obtain the graphs.

<http://craig.example.org/snap/e02cce51a67d8ca63f5d2ced5c5068b996ab6026> { <a> <b> 1 }
<http://craig.example.org/snap/93a60479a59194657189180397825328d70e8916> { <a> <b> 2 }
 
{
   [ a eg:DereferenceOperation;
     eg:source <http://alice.example.org/page1>;
     eg:atTime "2012-04-02T160701"^^xs:dateTime;
     eg:resultAt <http://craig.example.org/snap/e02cce51a67d8ca63f5d2ced5c5068b996ab6026>;
     eg:status 200;
   ].
   [ a eg:DereferenceOperation;
     eg:source <http://alice.example.org/page2>;
     eg:atTime "2012-04-02T160702"^^xs:dateTime;
     eg:resultAt <http://craig.example.org/snap/93a60479a59194657189180397825328d70e8916>;
     eg:status 200;
   ].
   [ a eg:DereferenceOperation;
     eg:source <http://alice.example.org/page3>;
     eg:atTime "2012-04-02T160703"^^xs:dateTime;
     eg:status 404;
   ].
}

Pros:

  • Can handle multiple retreivals
  • Clients may be able to just store the graph URLs and deref only when needed

Cons:

  •  ???