Best Practices Discussion Stability

From Government Linked Data (GLD) Working Group Wiki
Revision as of 15:14, 5 January 2012 by Awashing (Talk | contribs)

Jump to: navigation, search

some ideas to incorporate to our section:


Increasing stability requires adopting a strategy to allocated limited resources toward achieving any of the following goals:

Conservation - From a historical perspective one could seek to preserve all information in the format and modality in which it was originally conveyed. The most demanding is conservation of the full look and feel of the publication.

Preservation of access - It might be important to have information availability immediately at all times.

Preservation of content - It might be important to have raw data available for analysis ad infinitum. This means the overall objective is to preserve only the scientific content.

Simple, organized and consistency will increase longevity. Provide information in discrete time based units. Use only printable characters in file names. More files in smaller units is better. Publish only URLs under your control or adopt an approach such as PURLS from the onset of your effort. Adopt and adhere to standards based representations where applicable simple text is best.

Adhere to a selection criteria of what best to be preserved as resources are limited. Provide documentation of format, version and date of publication. Email addresses supplied as a contact should not be personalized so that stewardship may be reassigned to new individuals without distruption. In order to compress information use ZIP or GZIP only. Provide a checksum of your downloads.

Adopt an approach that will not require changes over time.


New version:

The length of time information will be available is inherently connected to the value placed upon it. Value is determined based on the a cost-benefit relationship; The benefit anyone derives from information is reduced by the cost associated with using it. While it might be argued that RDF, as a machine consumable form of metadata, does not have a fluctuating cost, the following discussion details the properties that effect cost which therefore effect value.

  • Complexity - All serializations are equal to a back-end system, therefore serialize RDF in either turtle (to minimize disk expenditure) or Ntriples (to increase integrity and manageability)
  • Consistency - As new data is produced, old data becomes legacy. Consumers of data will write programs to automate processing of legacy data and the number of changes in format directly effects the cost incurred by processors. Carefully consider whether making changes are worth the incurred cost of modifying ingestion. Changing formats between different serializations has a cost to consumers because they need to anticipate and provide for the change. If possible publishers should modify legacy data themselves so that the data they provide is entirely consistent and each consumer does not need to perform exactly the same task. Any support contact information should be published using a data steward so that the transition of responsibility does not introduce inconsistency to consumers.
  • Organization - At least provide documentation of
    1. Serialization type
    2. Date of Publication
    3. Version
    4. Steward contact email
  • Diskspace Resource - Turtle provides the most concise serialization and is arguably the easiest for humans to read. It does not provide the integrity that NTriples does.
  • Integrity - Provide checksums of downloads so that consumers can known that they have received the entire dataset.
  • Manageability -

Ronald P. Reck - 2011-12-15


Readings

  • Smith, Abby. 1998. "Preservation in the Future Tense." CLIR Issues 3 (May/June). Washington, D.C.: Council on Library & Information Resources.

Anne Washington i

Back to main page: Best_Practices_Discussion_Summary#2.2.5_Stability