Preserving Web data
Data preservation is a well understood and commonly performed tasks for static and self-contained data. This commonly includes the following steps:
- Ingest the data and assign a persistent identifier to it
- Ensure the data is correctly stored and prevent bit rot
- Provide access to the data and perform format translation if needed
The model most commonly referred to is Open Archival Information System. Many institutions taking care of digital preservation are implementing this model or some variant of it. Web pages can be preserved following the same strategies, considering a web site as a static data set that is self-contained and can be all snap-shoted and preserved at a fixed time.
When it comes to Web data some new elements have to be taken into account, namely:
- The persistent identifiers (IRI) used across the web are related to live data that can change
- The meaning of a resource is contextualized by the other resources it is linked to
- Documents fetched in HTML, RDF of JSON, for instance, are only one of the many possible serialization of the data they represent
Looking at the OAIS model there are new questions that then arise among which:
- During the ingest phase shall all the resources a given resource points to be preserved to ?
- Shall all the serializations of a resource be preserved or only one ?
- Shall the reasoning process that lead to the materialization of the facts be preserved too ?
- How to link IRI of resource with their preserved historical descriptions ?
- Shall such links be established in the first place ?
The research project Preserving Linked Data (PRELIDA) has provided insights into the specific issues related to preserving Web data.
This document does not aim at answering all the questions. In particular, the bit-level preservation of the data is considered to be out of scope of this best practices document. Namely, issues related to preventing and fixing bit-rot, tackling federated storage of data packets, controlling access rights, etc. The reader is referred to the literature to tackle this issues (Long-Term Archive and Notary Services (ltans), Long-Term Archive Service Requirements, A System for Long-Term Document Preservation).
Preserving Web Data
Following the Open Archival Information System model we hereafter propose an approach for the preservation of Web data around the three aspects of "Ingest", "Manage" and "Access". The data to be preserved is considered to be a document with descriptions of uniquely identified resources. The resources are identified using IRI and their description is a list of property/value serialised in some form.
- Data depositors submit a data dump to the archival service. The submission can go through any kind of process including, e.g., SWORD or form-based submission on a web interface. The metadata associated to the dump is used to describe the dataset (license, ingestion date, ...). If not already provided this metadata is enhanced with a list of resources being described in the dataset. This list of resources is to be later used for access.
- A persistent identifier is assigned to the ingested dataset. It is recommended that this identifier is assigned as an HTTP IRI.
- The archival institution monitor the data dump to prevent common issues such as bit rot and also ensure the data stays in a readable format. For example, dumps submitted as RDF/XML may have to be converted to newer formats such as Turtle to ensure they stay readable by future versions of libraries. In any case, it is expected that the originally ingested dump will always be kept "as is" and that when necessary another version of it will be made next to it.
- Data consumers get access to the metadata about a preserved dataset through its persistent identifier. When HTTP IRI are used de-referencing is one way of accessing this data
- The list of resources found in the metadata is used to let users assess whether the content of the file is relevant for them. When necessary, the LDP recommendation can be used to provide paginated results for long lists of resources.
Editors and Contributors
- Christophe Guéret