225 Best Practices for Stability

From Government Linked Data (GLD) Working Group Wiki
Revision as of 23:44, 24 January 2012 by Awashing (Talk | contribs)

Jump to: navigation, search

Best Practices: Stability

Back to Best Practices Wiki page


  • Dec 2011 - Initial revisions by Ron, AnneW
  • Anne Washington 16:47, 13 January 2012 (UTC), revisions
  • Ronald Reck 16:47, 13 January 2012 (UTC), revisions


The group will specify how to publish data so that others can rely on it being available in perpetuity, persistently archived if necessary.


This definition describes stability of LOD:

Stability - noun
Stable LOD is persistent, predictable and machine accessible from externally visible locations.
  • Persistent = Information is machine accessible for long periods of time.
  • Predictable = Names and information follow a logical format.
  • Stable location = Externally visible locations are consistent.
  • Other things that impact stability
    • legacy = earlier naming schemes, formats, data storage devices
    • steward = people who are committed to consistently maintain specific datasets, either individuals or roles in organizations
    • provenance = the sources that establish a context for the production and/or use of an artifact. see W3C Provenance working group


The goals influence the value of the data.

We believe that preservation of content is the main goal for stability, however there are others

The length of time information is available is inherently connected to the value placed upon it. Value is determined based on a cost-benefit relationship; The benefit derived from information is reduced by the cost(s) associated with using it.

Increasing stability requires adopting a strategy to allocate limited resources for achieving a goal. Adhere to a selection criteria of what best to be preserved. There are three possible goals;

  • 1. Preservation of content - It might be important to have raw data available for analysis ad infinitum. This means the overall objective is to preserve only the scientific content.
  • 2. Preservation of access - It might be important to have information available immediately at all times.
  • 3. Conservation - From a historical perspective one could seek to preserve all information in the format and modality in which it was originally conveyed. The most demanding is conservation of the full look and feel of the publication.


ORGANIZATIONAL CONSIDERATIONS Without internal stability from the data stewards, external technology stability is a challenge. These are some organization characteristics for stable data.

  • Consistent human skills
  • Consistent infrastructure
  • data related to organizational values or business needs
  • internal champion or consistent business process
  • internal politics on variation names do not impact external locations

Mark metadata based on its intended audience

  • Internal-audience : management of the process
  • External-audience : final state, or no-update needed.


These are a few representative samples to generate discussion and comment. Additional suggestions are encouraged.

These examples were discussed on the public-gld email listserv

Technical examples What existing examples can we point to? (Need international ones...)

  1. Internet Archive (http:www.archive.org)

Institutional examples Who has the incentive to provide stable persistent data? Some real possibilities and some metaphors for discussion.

  1. Archives
    1. Third party entities that document provenance and provide access
  2. Estate Lawyer
    1. Someone responsible for tracking down heirs for an inheritance
  3. Private Foundation
    1. A philanthropic entity who is interested in the value proposition of stability and acts as archive
  4. Government
    1. A government organization which has the funds to steward others' data
  5. Internet organization
    1. A global open organization like W3C or IKAN


These are characteristics of stability. These properties will influence data cost and therefore data value.

  • Integrity - Provide checksums of downloads so that consumers can be assured that they have received the entire dataset. Data that is unreliable should not used for critical decisions and is therefore of less value than data that is deemed correct.
  • Consistency Design of data format should recognize that change is necessary and will happen.
    • Data Consistency - As new data is produced, old data becomes legacy. Consumers of data will write programs to automate processing of legacy data and the number of changes in format directly effects the cost incurred by processors. Carefully consider whether making changes are worth the incurred cost of modifying ingestion. Changing formats between different serializations has a cost to consumers because they need to anticipate and provide for the change. When possible publishers should modify all legacy data themselves so that the data they provide is entirely consistent and each consumer does not need to perform exactly the same task.
    • Contact Consistency - Any support contact information should be published using a data steward so that the transition of responsibility does not introduce inconsistency to consumers.
  • Manageability -
    • Discrete - It is best to have a greater number of small files rather than fewer larger files. Files should be comprised of meaningful discrete units such as a time period or locality.
    • File names - Files should be meaningfully named without using non-printable or diacritic characters
    • Archive structure - Data archives should be nested in least a single directory. The directory name should be unique to accommodate multiple archives to be uncompressed without having to rename the directory.
    • Organization - The minimum metadata for each data offering should include:
      1. Serialization type/format
      2. Date of Publication
      3. Version
      4. Steward contact email
    • Complexity - All serializations are equal to a back-end system, therefore providers should serialize RDF in either turtle (to minimize disk expenditure) or NTriples (to increase integrity and manageability).
    • Diskspace Resource - Different serializations represent the same semantics but require varying amounts of characters (diskspace). While Turtle provides the most concise serialization and is arguably the easiest for humans to read, it does not provide the integrity that NTriples does. NTriples allows datasets to be split up based on size or line count without effecting the integrity of the dataset. In general NTriples will provide the greatest overall stability for LOD. Compression of data should be done using either GZIP or ZIP, do not choose to adopt other compression approaches just because they are "free". The maximum data compression should be chosen.


    Ways that this best practice is connected to others.


    1. PURLs (Persistent Uniform Resource Locators) purl.oclc.org
    2. Handle System http://www.handle.net/ and its commercial cousin Digital Object Identifier [1]


    For further reading.

    • Aging content on the web: Issues, Implications, and Potential Research Opportunities. (2009) Brent Furneaux, Timothy R. Hill, Wayne Smith, Shailaja Venkatsubramanyan, Jingguo Wang, Anne Washington, and Paul Witman. Communications of the Association for Information Systems. ISSN: 1529-3181. Volume 24, Number 1. http://aisel.aisnet.org/cais/vol24/iss1/8 available for download as a PDF

    Joint page Best_Practices_Discussion_Stability currently holds an example template for a best practices section.