Rdb2RdfXG/LinkedDataUpdateLogs

From W3C Wiki

Publishing LinkedData Update Logs

Problem

When RDB data is published on the Web e.g. as LinkedData it is important to keep track of DB (and hence RDF) updates so crawlers know what has changed (after the last crawl) and should be re-retrieved from that endpoint.

To have a centralized registry (such as e.g. implemented by PingTheSemanticWeb service) does not seem to be feasible when Linked Data becomes more popular - think of millions of Linked Data endpoints pinging such a registry.

Possible solution: Standardized LinkedData Update Logs

Each LinkedData endpoint provides information about updates performed in a certain timespan as a special/standardized LinkedData source.

Let's assume the Example.com company provides a Linked Data endpoint with information about their products, employees etc. The endpoint is reachable via http://example.com/lod/.

The LOD endpoint contains a special LOD space below http://example.com/lod/updates which contains information about updates.

http://example.com/lod/updates could for example return the following RDF:


http://example.com/lod/updates/2007   rdf:type   r2rul:UpdateCollection .
http://example.com/lod/updates/2008   rdf:type   r2rul:UpdateCollection .


http://example.com/lod/updates/2008 could then return the following RDF:


http://example.com/lod/updates/2008/Jan   rdf:type   r2rul:UpdateCollection .
http://example.com/lod/updates/2008/Feb   rdf:type   r2rul:UpdateCollection .


This nesting could continue until we finally reach a URL, which exposes all updates performed in a certain second in time. For very frequently updated LOD endpoints (e.g. Wikipedia) this interval of one second will be sufficiently small enough, so the related update information can be still easily retieved. For rarely updated LOD endpoints (e.g. a personal Weblog) links should only point to non-empty UpdateCollections in order to prevent crawlers from performing unnecessary HTTP requests.

http://example.com/lod/updates/2008/Jan/01/17/58/06 then would for example contain RDF links (and additional metadata) to the LinkedData documents updated on Jan 1st, 2008 at 17:58:06, e.g. following triples:


http://example.com/lod/updates/2008/Jan/01/17/58/06/123   r2rul:updatedResource   http://example.com/lod/users/JohnDoe .
http://example.com/lod/updates/2008/Jan/01/17/58/06/123   r2rul:updatedAt         "20080101T00:00:01"^<xsd:dateTime> .
http://example.com/lod/updates/2008/Jan/01/17/58/06/123   r2rul:updatedBy         http://example.com/lod/users/JohnDoe .


Individual updates are identified by a sequential number (i.e. "123" in the example). Arbitrary meta data can be attached to these updates, such as the time of the update (probably redundant since that can be inferred from the URL) or a certain person who performed the update.

This mechanism as well as some base update log vocabulary (i.e. the r2rul:updatedResource, r2rul:updatedAt, r2rul:updatedBy properties) could be standardized by this XG.

Issues

  • Relevance for this XG
  • Timezone of timestamps