Rdb2RdfXG/LinkedDataUpdateLogs
Publishing LinkedData Update Logs
Problem
When RDB data is published on the Web e.g. as LinkedData it is important to keep track of DB (and hence RDF) updates so crawlers know what has changed (after the last crawl) and should be re-retrieved from that endpoint.
To have a centralized registry (such as e.g. implemented by PingTheSemanticWeb service) does not seem to be feasible when Linked Data becomes more popular - think of millions of Linked Data endpoints pinging such a registry.
Possible solution: Standardized LinkedData Update Logs
Each LinkedData endpoint provides information about updates performed in a certain timespan as a special/standardized LinkedData source.
Let's assume the Example.com company provides a Linked Data endpoint with information about their products, employees etc. The endpoint is reachable via http://example.com/lod/
.
The LOD endpoint contains a special LOD space below http://example.com/lod/updates
which contains information about updates.
http://example.com/lod/updates
could for example return the following RDF:
http://example.com/lod/updates/2007 rdf:type r2rul:UpdateCollection . http://example.com/lod/updates/2008 rdf:type r2rul:UpdateCollection .
http://example.com/lod/updates/2008
could then return the following RDF:
http://example.com/lod/updates/2008/Jan rdf:type r2rul:UpdateCollection . http://example.com/lod/updates/2008/Feb rdf:type r2rul:UpdateCollection .
This nesting could continue until we finally reach a URL, which exposes all updates performed in a certain second in time. For very frequently updated LOD endpoints (e.g. Wikipedia) this interval of one second will be sufficiently small enough, so the related update information can be still easily retieved. For rarely updated LOD endpoints (e.g. a personal Weblog) links should only point to non-empty UpdateCollections in order to prevent crawlers from performing unnecessary HTTP requests.
http://example.com/lod/updates/2008/Jan/01/17/58/06
then would for example contain RDF links (and additional metadata) to the LinkedData documents updated on Jan 1st, 2008 at 17:58:06, e.g. following triples:
http://example.com/lod/updates/2008/Jan/01/17/58/06/123 r2rul:updatedResource http://example.com/lod/users/JohnDoe . http://example.com/lod/updates/2008/Jan/01/17/58/06/123 r2rul:updatedAt "20080101T00:00:01"^<xsd:dateTime> . http://example.com/lod/updates/2008/Jan/01/17/58/06/123 r2rul:updatedBy http://example.com/lod/users/JohnDoe .
Individual updates are identified by a sequential number (i.e. "123" in the example). Arbitrary meta data can be attached to these updates, such as the time of the update (probably redundant since that can be inferred from the URL) or a certain person who performed the update.
This mechanism as well as some base update log vocabulary (i.e. the r2rul:updatedResource, r2rul:updatedAt, r2rul:updatedBy properties) could be standardized by this XG.
Issues
- Relevance for this XG
- Timezone of timestamps