Infrastructure/Memento

From W3C Wiki

W3C is currently evaluating the use of the Memento Protocol (RFC-7089), which is a standardized architecture for accessing different versions of a document. Memento is supported by the Internet Archive's Wayback Machine (the largest and oldest web archive), as well as most of the public web archives around the world, including the British Library, UK National Archives, archive.today, and the Icelandic archive; others are in the process of implementing it, such as Perma.cc.

Our primary goal is to support this for our technical specifications, which already follow a convention for including links to the canonical URL, the current date-stamped version, and previous versions. Initially, we intend to expose the specification in our Technical Reports directory (or TR documents for their URL location at w3.org/TR/); later, we would like to expand this to include version of specifications at other locations, such as Editor's Drafts.

Other goals include supporting Memento for our wikis and for our documents using our CVS (Concurrent Versioning System). The wiki, in particular, should be easy to do because of an existing MediaWiki extension.

The Memento project team has already explored what this might look like for W3C's TR documents, in Memento Guide: Resource Versioning and Memento.

Status

This project is in the exploratory evaluation phase.

Timeline

There is currently no timeline for completion of this project.

Dependencies

This project depends on the W3C Data Platform project.

The Spec Annotation project depends on this project for annotation persistence across spec versions.

Partners

W3C is committed to expose some of our data via the W3C Data Platform project, incrementally. Most pertinent to this project is a data API for retrieving TR document information.

We are working with Los Alamos National Laboratory, a W3C member, where Herbert Van de Sompel runs the Memento project. They have volunteered some of their time to implement an open-source Memento TimeGate and TimeMap.

We are also working with Hypothes.is, also a W3C member, to enable annotations that persist through different versions of a document, with Memento as the mechanism.

Requirements

LANL has indicated willingness to create a generic, open-source Memento service that would use the W3C's document data API to created a TimeMap and TimeGate. This would still require some work from W3C, to install and maintain the Memento software, and also to serve the proper Memento HTTP headers from W3C's document servers.

LANL is motivated to do this based on the increased attention and visibility that would come from W3C using the Memento protocol and installing the MediaWiki extension.

These are the requirements for a complete Memento service for TR documents.

Terminology

  • URI-R: the canonical document URL, or Original Resource
  • URI-M: a versioned document URL, e.g. a Memento
  • URI-G: TimeGate
  • URI-T: TimeMap

Data API Requirements

In order for the TimeMap/TimeGate serer to work, W3C's document data API needs to provide the following inputs and outputs:

  • query input(s):
    • canonical document URL
    • versioned document URL (optional, but would save a trip by the TimeMap/TimeGate service)
  • query output:
    • all associated versioned document URLs, with the version datetime for each
    • canonical document URL (clearly indicated, such as by the absence of a datetime)
  • query output format: a common machine-parseable format (e.g. JSON, XML)

Server Requirements

Environment

LANL needs to know which technologies (programming language, web server, caching database) W3C is willing to support. Commonly used technologies are very much preferred as that would allow creating a rather generic open source TimeMap/TimeGate tool.

LANL TimeMap/TimeGate Service

The open-source Memento service proposed and created by LANL would perform the following functions

  • Exposes TimeGates at TimeGate looking like baseurl/TimeGate/canonical document URL (also supports baseurl/TimeGate/versioned document URL)
  • Exposes TimeMaps at TimeMap looking like baseurl/timemap/canonical document URL (also supports baseurl/timemap/versioned document URL)
  • Calls the W3C API using canonical document URL (and if supported by the API also using versioned document URL)
  • Transforms response into TimeMap format.
  • Stores/Caches TimeMaps and supports a cache refreshing mechanism (tbd).
  • TimeGate redirects to temporally appropriate versioned document URL based on accept-datetime value
  • TimeMap delivers TimeMaps in application/link-format (and possibly other formats via content negotiation)

HTTP Server

In addition to exposing the TimeMap and TimeGate, W3C must serve the Memento HTTP headers for TR documents.

  1. All canonical document URLs:
    • Must have a HTTP Link header pointing at their associated TimeGate ("timegate" rel type) and TimeMap (timemap" rel type). In typical implementations:
  2. All versioned document URL for a canonical document URL
    • Must have a HTTP Link header pointing at the TimeGate and TimeMap associated with canonical document URL. See above re rel types and URI syntax.
      • Must have a Memento-Datetime header with as value the version datetime expressed in HTTP datetime format
      • Should have a HTTP Link header pointing at URI-M itself and also to the first, last, prev, and next URI-Ms for URI-R ("memento" rel type). For each URI-M, these links also contain the version datetime ("datetime" attribute on link). (optional)

TR Document Markup

It's desirable to have parity between the rel attribute HTML document's heading version links and the rel parameter of the HTTP headers, for a number of reasons:

  1. it's consistent, and less prone to human error
  2. it makes it easier to create and maintain a timemap by extracting the metadata from the specs
  3. thus, it helps justify enforcing a policy of including these 'rel' attributes in HTML links

To achieve this, W3C specification heading version links should include rel attributes with the following values:

  • This version: working-copy(RFC-5829); this is a URI-M, or versioned document URL
  • Latest version: canonical (RFC-6596) or latest-version (RFC-5829) (TBD); this is a URI-R, or canonical document URL
  • Previous version: predecessor-version(RFC-5829); this is a URI-M, or versioned document URL

TR documents would not include a link for the successor-version (RFC-5829), because this is not known at the time of publication, and W3C avoids changing documents after publication; this function is loosely served by the Latest version heading link. However, the HTTP headers should include the successor-version rel value.

Annotator Requirements

Annotation client should be able to:

  • handle targets for both canonical document URLs and versioned document URLs
  • associate annotation with their specific versioned document URL
  • establish whether the document server supports the Memento protocol, or whether there is an appropriate third-party Memento server (such as the Wayback Machine) which does store previous versions of the document
  • negotiate datetimes with the Memento server for retrieving the correct version of the document
  • in the case of a third-party Memento server, request that the service make a snapshot of the document at the time of annotation

MediaWiki Options

LANL has two versions of a Memento extension for MediaWiki

  1. the full, stand-alone version, which serves Memento HTTP headers and also provides a TimeMap and TimeGate based on the MediaWiki version history
  2. an ultra-light version that only serves Memento HTTP headers, with the TimeMap and TimeGate hosted remotely on a third-party Memento Aggregator (operated by LANL)

Potential installation locations:

  • W3C wikis
  • WebPlatform.org