Data preservation

From Data on the Web Best Practices
Jump to: navigation, search

Preserving Web data

Data preservation is a well understood and commonly performed tasks for static and self-contained data. This commonly includes the following steps:

  • Ingest the data and assign a persistent identifier to it
  • Ensure the data is correctly stored and prevent bit rot
  • Provide access to the data and perform format translation if needed

The model most commonly referred to is Open Archival Information System. Many institutions taking care of digital preservation are implementing this model or some variant of it. Web pages can be preserved following the same strategies, considering a web site as a static data set that is self-contained and can be all snap-shoted and preserved at a fixed time.

When it comes to Web data some new elements have to be taken into account, namely:

  • The persistent identifiers (IRI) used across the web are related to live data that can change
  • The meaning of a resource is contextualized by the other resources it is linked to
  • Documents fetched in HTML, RDF of JSON, for instance, are only one of the many possible serialization of the data they represent

Looking at the OAIS model there are new questions that then arise among which:

  • During the ingest phase shall all the resources a given resource points to be preserved to ?
  • Shall all the serializations of a resource be preserved or only one ?
  • Shall the reasoning process that lead to the materialization of the facts be preserved too ?
  • How to link IRI of resource with their preserved historical descriptions ?
  • Shall such links be established in the first place ?
  • etc

The research project Preserving Linked Data (PRELIDA) has provided insights into the specific issues related to preserving Web data.

Scope

This document does not aim at answering all the questions. In particular, the bit-level preservation of the data is considered to be out of scope of this best practices document. Namely, issues related to preventing and fixing bit-rot, tackling federated storage of data packets, controlling access rights, etc. The reader is referred to the literature to tackle this issues (Long-Term Archive and Notary Services (ltans), Long-Term Archive Service Requirements, A System for Long-Term Document Preservation).