Warning:
This wiki has been archived and is now read-only.

BP Data Preservation

From Data on the Web Best Practices

Jump to: navigation, search

NOTE : the content of this section is maintained on GitHub http://w3c.github.io/dwbp/bp.html#preservation

Data preservation

Data preservation is a well understood and commonly performed tasks for static and self-contained data. This commonly includes the following steps:

Ingest the data and assign a persistent identifier to it
Ensure the data is correctly stored and prevent bit rot
Provide access to the data and perform format translation if needed

The model most commonly referred to is Open Archival Information System. Many institutions taking care of digital preservation are implementing this model or some variant of it. Web pages can be preserved following the same strategies, considering a web site as a static data set that is self-contained and can be all snap-shoted and preserved at a fixed time.

When it comes to Web data some new elements have to be taken into account, namely:

The persistent identifiers (IRI) used across the web are related to live data that can change
The meaning of a resource is contextualized by the other resources it is linked to
Documents fetched in HTML, RDF of JSON, for instance, are only one of the many possible serialization of the data they represent

The two following sections tackle requirements for the digital archive and the data depositors of Web data.

Digital trusted repositories

Maintain a list of resources ?
Archives should maintain a list of resources being described in the preserved datasets
Why
Web data is about the description of resources identified with a IRI. It is to be expected that queries made by data consumers will resolve around finding a preserved description of a particular resource.
What
At ingestion time a Web data dataset dump is scanned for all the subject described. For RDF and JSON-LD, this corresponds to all the resources in the "subject" position of statements.
Intended outcome
A list of resources per ingested dataset
Possible approach to implementation
.
How to test
.
Evidences
[[:Property:.\|.]]

Assess preservation coverage ?
Archives should assess the preservation coverage of a particular dataset prior to ingestion
Why
A chunk of Web data is by definition dependent on the rest of the global graph. This global context influences the meaning of the description of the resources found in the dataset. Ideally, the preservation of a particular dataset would involve preserving all its context. That is the entire Web of Data.
What
At ingestion time an evaluation of the linkage of Web data dataset dump to already preserved resources is assessed. The presence of all the vocabularies and target resources in uses is sought in a set of digital archives taking care of preserving Web data. Datasets for which very few of the vocabularies used and/or resources pointed out are already preserved somewhere should be flagged as being at risk.
Intended outcome
A evaluation of the preservation coverage for a given dataset
Possible approach to implementation

How to test

Evidences

Depositors

Provide data using a recognized serialisation format ?
Data depositors should provide their data dump in a W3C standard serialisation
Why

What

Intended outcome

Possible approach to implementation

How to test

Evidences