Best Practices Discussion Stability
From Government Linked Data (GLD) Working Group Wiki
I have to expand and add this somewhere -> Publish only URLs under your control or adopt an approach such as PURLS from the onset of your effort.
The length of time information is available is inherently connected to the value placed upon it. Value is determined based on a cost-benefit relationship; The benefit derived from information is reduced by the cost(s) associated with using it.
Increasing stability requires adopting a strategy to allocate limited resources for achieving a goal. Adhere to a selection criteria of what best to be preserved. There are three possible goals;
- Conservation - From a historical perspective one could seek to preserve all information in the format and modality in which it was originally conveyed. The most demanding is conservation of the full look and feel of the publication.
Preservation of access - It might be important to have information available immediately at all times.
Preservation of content - It might be important to have raw data available for analysis ad infinitum. This means the overall objective is to preserve only the scientific content.
While it might be argued that RDF, as a machine consumable form of metadata, does not have a fluctuating cost, the following discussion details the properties that effect cost and therefore effect value no matter which goal is sought.
WIKI CODE VERSIOn
- Preservation of access - It might be important to have information available immediately at all times.
- Preservation of content - It might be important to have raw data available for analysis ad infinitum. This means the overall objective is to preserve only the scientific content.
- Complexity - All serializations are equal to a back-end system, therefore providers should serialize RDF in either turtle (to minimize disk expenditure) or NTriples (to increase integrity and manageability).
Consistency Design and support should be designed while recognizing that change is necessary and will happen.
- Data Consistency - As new data is produced, old data becomes legacy. Consumers of data will write programs to automate processing of legacy data and the number of changes in format directly effects the cost incurred by processors. Carefully consider whether making changes are worth the incurred cost of modifying ingestion. Changing formats between different serializations has a cost to consumers because they need to anticipate and provide for the change. When possible publishers should modify all legacy data themselves so that the data they provide is entirely consistent and each consumer does not need to perform exactly the same task.
- Contact Consistency - Any support contact information should be published using a data steward so that the transition of responsibility does not introduce inconsistency to consumers.
Organization - The minimum metadata for each data offering should include:
- Serialization type/format
- Date of Publication
- Steward contact email
- Diskspace Resource - Different serializations represent the same semantics but require varying amounts of characters (diskspace). While Turtle provides the most concise serialization and is arguably the easiest for humans to read, it does not provide the integrity that NTriples does. NTriples allows datasets to be split up based on size or line count without effecting the integrity of the dataset. In general NTriples will provide the greatest overall stability for LOD. Compression of data should be done using either GZIP or ZIP, do not choose to adopt other compression approaches just because they are "free". The maximum data compression should be chosen.
- Integrity - Provide checksums of downloads so that consumers can be assured that they have received the entire dataset.
- Discrete - It is best to have a greater number of small files rather than fewer larger files. Files should be comprised of meaningful discrete units such as a time period or locality.
- File names - Files should be meaningfully named without using non-printable or diacritic characters
- Archive structure - Data archives should be nested in least a single directory. The directory name should be unique to accommodate multiple archives to be uncompressed without having to rename the directory.
Ronald P. Reck - 2012-01-12
- Rothenberg, Jeff. (1995). Ensuring the longevity of digital documents. Scientific American, 272 (1), 42-47. http://www.clir.org/pubs/archives/ensuring.pdf
- Taking a byte out of history: the archival preservation of federal computer records, Report of the U.S. House of Representatives Committee on Government Operations, Nov. 6. 1990 (House Report 101-978). http://books.google.com/books/about/Taking_a_byte_out_of_history.html?id=2q1ek_kLVmMC
- Smith, Abby. 1998. "Preservation in the Future Tense." CLIR Issues 3 (May/June). Washington, D.C.: Council on Library & Information Resources.
Back to main page: Best_Practices_Discussion_Summary#2.2.5_Stability