225 Best Practices for Stability

From Government Linked Data (GLD) Working Group Wiki
Jump to: navigation, search

Best Practices: Stability

Back to Best Practices Wiki page

Status

  • Dec 2011 - Initial revisions by Ron, AnneW
  • Anne Washington 16:47, 13 January 2012 (UTC), revisions
  • Ronald Reck Sun Feb 12 14:22:33 EST 2012, revisions

Overview

This section will focus how to publish data so that others can rely on it being available in perpetuity, persistently archived if necessary.

Definition

The scope, limits and explanation of stability"

This definition describes stability of LOD

Stability -
Stable LOD is persistent, predictable and machine accessible from externally visible locations.
  • Persistent = Information accessible for an unbounded period of time.
  • Predictable = Names and information follow a logical format.
  • Stable location = Externally visible locations are consistent in name, and availability.
  • Other things that impact stability
    • legacy = earlier naming schemes, formats, data storage devices
    • steward = people who are committed to consistently maintain specific datasets, either individuals or roles in organizations
    • provenance = the sources that establish a context for the production and/or use of an artifact. see W3C Provenance working group

Goals

The purpose of having a best practice for stability

The length of time information is available is inherently connected to the value placed upon it. If information is deemed valuable, it is likely to persist for a longer period of time. Value, which can change over time, is always determined based on a cost-benefit relationship; Any benefit derived from information is reduced by the cost(s) associated with using it. Increasing stability requires the adoption of a strategy to allocate limited resources for achieving a goal. Goals drive data providers' criteria to make a selection of what is best preserved.

We believe that preservation of content is the main goal for stability, possible goals include:

  • 1. Preservation of content - It might be important to have raw data available for analysis ad infinitum. This means the overall objective is to preserve only the scientific content.
  • 2. Preservation of access - It might be important to have information available immediately at all times.
  • 3. Conservation - From a historical perspective one could seek to preserve all information in the format and modality in which it was originally conveyed. The most demanding is conservation of the full look and feel of the publication.

Success Factors

ORGANIZATIONAL CONSIDERATIONS Without internal stability from the data stewards, any external technology stability is a challenge. These following are some organization characteristics for stable data.

  • Consistent human skills
  • Consistent infrastructure
  • Data related to organizational values or business needs
  • Internal champion or consistent business process
  • Internal politics on variation names do not impact external locations


Mark metadata based on its intended audience

  • Internal-audience : management of the process
  • External-audience : final state, or no-update needed.

Examples

These are a few representative samples to generate discussion and comment. Additional suggestions are encouraged.

These examples were discussed on the public-gld email listserv


Technical examples What existing examples can we point to? (Need international ones...)

  1. Internet Archive (http:www.archive.org)

Institutional examples Who has the incentive to provide stable persistent data? Some real possibilities and some metaphors for discussion.

  1. Archives
    1. Third party entities that document provenance and provide access
  2. Estate Lawyer
    1. Someone responsible for tracking down heirs for an inheritance
  3. Private Foundation
    1. A philanthropic entity who is interested in the value proposition of stability and acts as archive
  4. Government
    1. A government organization which has the funds to steward others' data
  5. Internet organization
    1. A global open organization like W3C or IKAN

Properties

These are characteristics that influence the stability or longevity. Many of these properties are not unique to LOD, yet they influence data cost and therefore data value.

  • Integrity - Provide checksums of downloads so that consumers can be assured that they have received the entire dataset. Data that is unreliable should not used for critical decisions and is therefore of less value than data that is deemed complete. Possible checksum types include MD5 and SHA.
  • Consistency Any design of a data format should recognize that change is necessary and will happen. Recognition that change is enviable while providing a mechanism for embracing modification increases continuity and longevity.

    The following types of changes can be anticipated. Therefore, data design should be made to accommodate them:

    1. The person who published the data changes jobs. For Contact Consistency - Any support contact information should be published using a data steward so that the inherent transition of responsibility does not introduce inconsistency to consumers.
    2. Departments, Agencies and Governments are reorganized - For File Naming and Data Consistency - Discourage the use of the originating source as component in the name of the data file, or the URIs it contains.The information can appropriately be contained within the file as metadata.
    3. IT infrastructure overhaul - For File Naming Consistency - Discourage the use of the server or system as component in the name of the data file.
    4. Merger/acquisition - For Data Consistency - Discourage the use of branding as it inherently and needlessly increases cost for new owners while providing no value at all to consumers.
    5. Primary stakeholder loses interest in the data - As above For Data Consistency - Discourage the use of branding as it inherently increase cost for new owners

    • Data Repository Consistency - As new data is produced, old data becomes legacy data. Consumers of data will write programs to automate processing of legacy data and the number of changes in format directly effects the cost incurred by data consumers. Data providers should carefully consider whether the benefit of the change exceeds the incurred cost of modifying ingestion procedures. Even changing formats between different serializations has a cost to consumers as they need to anticipate and provide for the change. Data providers should consider lifecycle workflow and when at all possible they should modify legacy data themselves so that all provided data is consistent and each consumer will not be required to perform exactly the same data conversion task to create a homogeneous data repository.
    • Manageability -
      • Discrete - It is best to have a greater number of small files rather than fewer larger files. Smaller files reduce the cost on consumers. Files should be comprised of meaningful discrete units based on a time period, locality or other logical unit.
      • File names - Files should be meaningfully named without using non-printable characters.
      • Archive structure - Data archives should be nested in least a single directory. The directory name should be unique to accommodate multiple archives to be uncompressed without introducing collisions.
    • Organization - The minimum metadata accompanying each data offering should include:
      1. Serialization type (such as NTriples or RDF/XML)
      2. Publisher
      3. Creation Date
      4. Modification Date
      5. Version
      6. Email address for data steward
    • Complexity - All serializations are equal to a back-end system, therefore providers should serialize RDF in either
      • turtle - The turtle serialization minimizes the disk space expenditure while also increasing human readability.
      • NTriples - The NTriples serialization increases integrity in that re-ordering will have no effect on semantics, and damaged lines only effect the assertion on those lines. NTriples also increases flexibility because files can be split into smaller files as long as the division happen at the end of the line.
    • Diskspace Resource - Different serializations represent the same semantics but require varying amounts of characters (diskspace). While Turtle provides the most concise serialization and is arguably the easiest for humans to read. Turtle does not provide the integrity that NTriples does because NTriples can be reordered or split up based on size or line count without effecting the integrity of the dataset. In general NTriples will provide the greatest overall stability for LOD. Compression of data should be done using either GZIP or ZIP, do not choose to adopt other compression approaches just because they are "free". The maximum data compression should be chosen.

    Interconnections

    Ways that this best practice is connected to others.


    STABIILTY, URL, and URIs The identifiers used in LOD are a possible point of failure, therefore use URIs that dereference under DNS that you control or that have greatest likeliness to persist. Use URI's according to the best practices stated elsewhere in this document increases value. Other strategies for maximizing the longevity of URI's include:

    1. PURLs (Persistent Uniform Resource Locators) purl.oclc.org
    2. Handle System http://www.handle.net/ and its commercial cousin Digital Object Identifier [1]


    Vocabulary Choices Effect Value When LOD uses or references vocabularies or vocabulary items it is a point of frailty, which therefore can effect cost. Vocabulary use according to the best practices stated elsewhere in this document increases value.

    References

    For further reading.

    • Aging content on the web: Issues, Implications, and Potential Research Opportunities. (2009) Brent Furneaux, Timothy R. Hill, Wayne Smith, Shailaja Venkatsubramanyan, Jingguo Wang, Anne Washington, and Paul Witman. Communications of the Association for Information Systems. ISSN: 1529-3181. Volume 24, Number 1. http://aisel.aisnet.org/cais/vol24/iss1/8 available for download as a PDF


    Joint page Best_Practices_Discussion_Stability currently holds an example template for a best practices section.