Best Practices Discussion Summary

From Government Linked Data (GLD) Working Group Wiki
Revision as of 16:19, 13 January 2012 by Awashing (Talk | contribs)

Jump to: navigation, search

Status

2011 Dec - Best Practices draft available on wiki.

29-30 June, 2011 - Preliminary discussion took place at the W3C Government Linked Data Working Group First F2F.

Best Practices Task Force

Note: The group will produce one or more Recommendations which address the following issues:

  Editors of Draft: Michael Hausenblas, Bernadette Hyland, Boris Villazón-Terrazas
Procurement
  • George Thomas (Health & Human Services, US)
  • Mike Pendleton (Environmental Protection Agency, US)
  • John Sheridan (OPSI, UK)
Vocabulary Selection
  • George Thomas (Health & Human Services, US)
  • Michael Hausenblas (DERI)
  • Ghislain (INSTITUT TELECOM)
  • Boris Villazón-Terrazas(UPM)
  • John Erickson (RPI)
  • Biplav Srivastava (IBM)
URI Construction
  • Ghislain (INSTITUT TELECOM)
  • Boris Villazón-Terrazas(UPM)
  • Daniel Vila (UPM)
  • John Erickson (RPI)
  • Martin Alvarez (CTIC)
  • Cory Casanove (OMG)
Linked Data Cookbook
  • Bernadette Hyland (3 Round Stones)
  • Sarven Capadisli (DERI)
Legacy Data
  • Biplav Srivastava (IBM)
Versioning
  • John Erickson (RPI)
  • Ghislain (INSTITUT TELECOM)
  • Hadley Beeman, (versioning as related to Data "Cube")
  • Cory Casanove (OMG)
Stability
  • Anne Washington (GMU)
  • Ron Reck
Provenance Liaison
  • John Erickson (RPI)

Overview

Linked Data approaches address key requirements of open government by providing a family of international standards for the publication, dissemination and reuse of structured data. Further, Linked Data, unlike previous data formatting and publication approaches, provides a simple mechanism for combining data from multiple sources across the Web.

In an era of reduced local, state and federal budgets, there is strong economic motivation to reduce waste and duplication in data management and integration. Linked Open Data is a viable approach to publishing governmental data to the public, but only if it adheres to some basic principles.


Purpose of Best Practices Recommendation(s)

The following are some motivations for the need for publishing Recommendation(s) and Working Notes, identified in the GLD WG Charter.

  1. The overarching objective is to provide best practices and guidance to create of high quality, re-usable Linked Open Data (LOD).

More specifically, best practices are aimed at assisting government departments/agencies/bureaus, and their contractors, vendors and researchers, to publish high quality, consistent data sets using W3C Standards to increase interoperability.

Best practices are intended to be a methodical approach for the creation, publication and dissemination of governmental Linked Data. Best practices from the GLD WG shall include:

  1. Description of the full life cycle of a Government Linked Data project, starting with identification of suitable data sets, procurement, modeling, vocabulary selection, through publication and ongoing maintenance.
  2. Definition of known, proven steps to create and maintain government data sets using Linked Data principles.
  3. Guidance in explaining the value proposition for LOD to stakeholders, managers and executives.
  4. Assist the Working Group in later stages of the Standards Process, in order to solicit feedback, use cases, etc.

Organized by the Charter

From section 2.2 of the GLD Charter.

The Working Group, facilitated by the Best Practices Task Force, will produce Recommendation(s), (a Working Group Note / website, where noted), for the following:

  1. Procurement.
  2. Vocabulary Selection.
  3. URI Construction.
  4. Versioning.
  5. Stability.
  6. Legacy Data.
  7. Cookbook. (Working Group Note or website rather than Recommendation).

GLD Life cycle

2.2.1 Best Practices for Procurement

Procurement. Specific products and services involved in governments publishing linked data will be defined, suitable for use during government procurement. Just as the Web Content Accessibility Guidelines allow governments to easily specify what they mean when they contract for an accessible Website, these definitions will simplify contracting for data sites and applications.

Update as of 21-Dec-2011 - Mike and George reaching out to John Sheridan to discuss over the last two weeks of December. Possible that John has already taken leave for the holidays. Needs to get jump started asap.

Draft as of 9-Jan-2012

Procurement Overview and Glossary

Linked Open Data (LOD) offers novel approaches for publishing and consuming data on the Web. This procurement overview and companion glossary is intended to help contract officers and their technical representatives understand LOD activities, and their associated products and services. It is hoped that this will aid government officials in procuring LOD related products and services.

OVERVIEW

Recent Open Government initiatives call for more and better access to government data. To meet expanding consumer needs, many governments are now looking to go beyond traditional provisioning formats (e.g. CSV, XML), and are beginning to provision data using Linked Open Data (LOD) approaches.

In contrast to provisioning data on the Web, LOD provisions data into the Web so it can be interlinked with other linked data, making it easier to discover, and more useful and reusable. LOD leverages World Wide Web standards such as Hypertext Transfer Protocol (HTTP), Resource Description Framework (RDF), and Uniform Resource Identifiers (URIs), which make data self-describing so that it is both human and machine readable. Self-describing data is important because most government data comes from relational data systems that do not fully describe the source data schema needed for application development by third parties.

While LOD is a relatively new approach to data provisioning, growth has been exponential. LOD has been adopted by other national governments including the UK, Sweden, Germany, France, Spain, New Zealand and Australia.

Development and maintenance of linked data is supported by the Semantic Web/Semantic Technologies industry. Useful information about industry vendors/contractors, and their associated products and services, is available from the World Wide Web Consortium’s Government Linked Data (W3C/GLD) workgroup Community Directory.

The following categorizes activities associated with LOD development and maintenance, and identifies products and services and associated with these activities:

1. LOD Preparation

Products:

Services: Services that support modeling relational or other data sources using URIs, developing scripts used to generate/create linked open data. Overlap exists between LOD preparation and publishing.

2. LOD Publishing

Products: RDF database (a.k.a. triple store) enables hosting of linked data

Services: These are services that support creation, interlinking and deployment of linked data (see also linked data preparation). Hosting data via a triple store is a key aspect of publishing. LD publishing may include implementing a PURL strategy. During preparation for publishing linked data, data and publishing infrastructure may be tested and debugged to ensure it adheres to linked data principles and best practices. (Source: Linked Data: Evolving the Web into a Global Data Space, Heath and Bizer, Morgan and Claypool, 2011, Section 5.4, p. 53)

3. LOD Discovery and Consumption

Products: Linked Data Browsers allow users to navigate between data sources by following RDF links; Linked Data Search Engines crawl linked data by following RDF links, and provide query capabilities over aggregated data.

Services: These are services that support describing, finding and using linked data on the Web. Publication of linked data contributes to a global data space often referred to as the Linked Open Data Cloud or ‘Web of Data.’ These are services that support the development of applications that use (i.e. consume) this ‘Web of Data.’

4. Management Consulting and Strategic Planning

Products: Not applicable

Services: There are a broad range of management related services; examples include briefings intended for decisions makers to provide a general understanding of the technology, business case, ROI; strategic planning support (e.g. enterprise linked data, implementation of PURLs, etc.)

5. Formal Education and Training

Products:

Services: Various private companies and universities offer training related to linked open data. These offerings vary widely. Trainings vary from high-level informational trainings intended to provide managers/decision makers with general understanding, to in-depth, hands-on instruction for the tech savvy on how to prepare, publish and consume linked data.


GLOSSARY

Linked Open Data: A pattern for hyper-linking machine-readable data sets to each other using Semantic Web techniques, especially via the use of RDF and URIs. Enables distributed SPAQL queries of the data sets and a “browsing” or “discovery” approach to finding information (as compared to a search strategy. (Source: Linking Enterprise Data, David Wood, Springer, 2010, p. 286)

Linked Open Data Cloud: Linked Open Data that has been published is depicted in a LOD cloud diagram. The diagram shows connections between linked data sets and color codes them based on data type (e.g., government, media, life sciences, etc.). The diagram can be viewed at: http://richard.cyganiak.de/2007/10/lod/

RDF (Resource Description Framework): A language for representing information about resources in the World Wide Web. RDF is based on the idea of identifying things using Web identifiers (called Uniform Resource Identifiers, or URIs), and describing resources in terms of simple properties and property values. This enables RDF to represent simple statements about resources as a graph of nodes and arcs representing the resources, and their properties and values. (http://www.w3.org/TR/rdf-primer/)

Semantic Technologies: The broad set of technologies that related to the extraction, representation, storage, retrieval and analysis of machine-readable information. The Semantic Web standards are a subset of semantic technologies and techniques. (Source: Linking Enterprise Data, David Wood, Springer, 2010, p. 286) Semantic Web: An evolution or part of the World Wide Web that consists of machine-readable data in RDF and an ability to query that information in standard ways (e.g. via SPARQL)

Semantic Web Standards: Standards of the World Wide Web Consortium (W3C) relating to the Semantic Web, including RDF, RDFa, SKOS and OWL. (Source: Linking Enterprise Data, David Wood, Springer, 2010, p. 287)

SPARQL: Simple Protocol and RDF Query Language (SPARQL) defines a standard query language and data access protocol for use with the Resource Description Framework (RDF) data model. (http://msdn.microsoft.com/en-us/library/aa303673.aspx) Just as SQL is used to query relational data, SPARQL is used to query graph, or linked, data.

Uniform Resource Identifiers (URIs): URI’s play a key role in enabling linked data. To publish data on the Web, the items in a domain of interest must first be identified. These are the things whose properties and relationships will be described in the data, and may include Web documents as well as real-world entities and abstract concepts. As Linked Data builds directly on Web architecture [67], the Web architecture term resource is used to refer to these things of interest, which are, in turn, identified by HTTP URIs. Wide Web Consortium’s Government Linked Data (W3C/GLD) workgroup: http://www.w3.org/2011/gld/charter

2.2.2 Best Practices for Vocabulary Selection

The group will provide advice on how governments should select RDF vocabulary terms (URIs), including advice as to when they should mint their own. This advice will take into account issues of stability, security, and long-term maintenance commitment, as well as other factors that may arise during the group's work.


Ghislain

One of the most challenging task when publishing data set is to have metadata describing the model used to capture it. The model or the ontology gives the semantic of each term used within the data set or in the LOD cloud when published. The importance of selecting the appropriate vocabulary is threefold:

  • Ease the interoperability with existing vocabularies
  • Facilitate integration with other data source from others publishers
  • Speed-up the time of creating new vocabularies, since it is not created from scratch, but based on existing ones.


Publishers should take time to see what is the domain application of their data set: finance, statistics, geograpraphy, weather, administration divisions, organisation, etc. Based on the relevant concepts presented in the Data set, one of these two options could be performed:

  • Searching vocabularies using Semantic Web Engines: The five most used SWEs are Swoogle, Watson (Ontology-oriented web engines); SWSE, Sindice (Triples-oriented Web engines); and Falcons (an hybrid-oriented Web engine)

One of the difficult task sometimes in the reuse ontology process is to decide which Semantic Search engine to use for obtaining an efficient results in the search of ontologies. There are five well-known and typically used SWSEs in the literature. What are the criteria to choose one Semantic search engine in a particular domain.

In the literature, there are no guidelines helping ontology developers to decide between one SWSEs. Guidelines proposed here could potentially help ontology designers in taking such a decision. However , we can divide SW search engines in 3 groups:

*Those that are "Ontology-oriented" Web engines such as Swoogle and Watson.
*The ones "Triple-oriented" Web engines or RDF-oriented like SWSE and Sindice.
*and finally those which are "Hybrid-oriented" Web engine as the case of Falcons.

Also, a rapid observation while experimenting the use of the abovementioned engines is that there is not a clear separation between ontologies and RDF data coming from blogs and other sources like DBPedia.

Using the search engines consist in practice querying them using the set of relevant concepts of the domain (e.g., tourism, point of interest, organization, etc). The output of this exercise is a list of candidate ontologies to be assessed for reusing purpose.

The datahub (previous CKAN) maintains the list of data sets shared and can be accessed by an API or a full JSON dump. The approach here could be to look for data sets or the similar domain of interest, and analyzed the metadata describing that data to find out the vocabularies reused. Another "data market" place worth mentioning could be Kasabi

  • Searching vocabularies using LOV LOV.

The Linked Open Vocabularies (a.k.a LOV) is a set of data expressed in RDF, that inventories vocabularies for describing data sets but also the semantic relations between the vocabularies. Although it is in its preliminary state, it contains more than 100 vocabularies already identified. It came out that there are some vocabularies "commonly" used like SKOS, FOAF, Dublin Core, Geo and Event.

  • Composition of the three methods above-mentioned: It consists of combining the search process making use of the existing searching engines and some data sets catalogue.

TODO

  • Assessment and criteria for vocabularies selection


Boris

We need to determine the vocabulary to be used for modelling the domain of the government data sources. The most important recommendation in this context is to reuse as much as possible available vocabularies. This reuse-based approach speeds up the vocabulary development, and therefore, governments will save time, effort and resources. This activity consists of the following tasks:

  • Search for suitable vocabularies to reuse. Currently there are some useful repositories to find available vocabularies, such as, SchemaWeb, SchemaCache, Swoogle, and LOV Linked Open Vocabularies.
  • In case that we did not find any vocabulary that is suitable for our purposes, we should create them, trying to reuse as much as possible existing resources, e.g., government catalogues, vocabularies available at sites like [1], etc.
  • Finally, if we did not find available vocabularies nor resources for building the vocabulary, we have to create the vocabulary from scratch.

2.2.3 URI Construction

The group will specify how to create good URIs for use in government linked data. Inputs include Cool URIs for the Semantic Web, Designing URI Sets for the UK Public Sector (PDF), and Creating URIs (data.gov.uk). Guidance will be produced not only for minting URIs for governmental entities, such as schools or agencies, but also for vocabularies, concepts, and datasets.

Ghislain

  • URIs aims at identifying any data, concept or object to be published and be de-referenceable. URIs must have a pattern to follow within the public sector publishing data.
  • Decisions about the patterns for to use should also take into account some basic criteria like: simplicity, stability and manageability.
  • At first, one base URI structure should be identified, something looking like http://{sector}.yourdomain/ or http://data.{sector}.{yourdomain}/.
  • Separate in the URIs schemes decisions the vocabulary and the data. Many decisions could be taken, and even with a special care for spatial data Designing URI Sets for the UK Public Sector
  • Two options for the vocabulary URIs schemes:

1- Using of the same base URI

2- Using a different scheme for instances.


TODO : More specific examples ? @@ EXPAND MORE @@

Boris / Dani

It is necessary to identify a resource on the Internet, and precisely URIs are thought for that. According to [2] URIs should be designed with simplicity, stability and manageability in mind, thinking about them as identifiers rather than as names for Web resources.

There are some existing guidelines for URI design, for example (1) Cool URIs for the Semantic Web W3C Interest Group Note, which introduces a useful guidance on how to use URIs to describe things that are not Web documents; (2) Designing URI Sets for the UK Public Sector, a document from the UK Cabinet offices that defines the design considerations on how to URIs can be used to publish public sector reference data; and (3) Sytle Guidelines for Naming and Labelling Ontologies in the Multilingual Web, which proposes guidelines for desigin URIs in a multilingual scenario.

Based on the aforementioned guidelines and on our experience we propose the following design decisions regarding the assignment of URIs to the elements of the dataset:

TO DO ... propose the guidelines (Dani)

John Erickson(RPI)

TWC RPI drafted a set of URI Design Principles with an eye toward re-hostablity --- that is, we propose a syntactic design that can be modeled and demonstrated on one host (TWC's Instance Hub but can be easily re-hosted on another, such as a government agency responsible for the named entities.

TWC RPI's URI Design goals

  • Able to easily be re-hosted (eg from demonstrator portal to agency host)
  • Concise URIs with as little cruft as possible
  • URIs that span many domains including:
    • National identifiers (e.g. govermental agencies, states, zip codes)
    • State-level identifiers (e.g. counties, congressional districts)
    • Agency-level identifiers (e.g. EPA facilities)

TWC RPI's URI Template

 'http://' BASE '/' 'id' '/' ORG '/' CATEGORY ( '/' TOKEN )+ 

In TWC RPI's case, BASE will be logd.tw.rpi.edu

Notes on the RPI Design
  • id
    • This is required, to avoid "polluting" the top namespace of BASE with identifiers.
    • Prefer id over other alternatives to keep token as short as possible. The id token doesn't add any semantics, it is just a syntactic way of distinguishing these URIs from others.
    • Also, consistency with data.gov.uk URIs here is a good thing.
  • ORG
    • This is a short token representing the agency, government, or organization that controls the identfier space.
    • For US identifiers, this token will start with 'us/', and be followed by a designation of either federal or state-level (e.g. 'us/fed', 'us/ny', 'us/ca').
    • Identifiers relating to data.gov will all fall under the federal 'us/fed' space.
    • For identifiers that aren't directly governmental, the ORG token should be suitably unique; for example, we use "usps-com" below for USPS controlled zip code URIs.
  • CATEGORY and TOKEN
    • These are ORG-specific values that identify the specific instance.
    • Use as many TOKENs as necessary to distinguish the instance.

Examples of TWC RPI URI Design

The URI Design Principles page provides examples of applying this template to:

  • US Government agencies:
    http://BASE/id/us/fed/agency/Department_of_Health_and_Human_Services/Centers_for_Disease_Control
  • States and Territories:
    http://logd.tw.rpi.edu/id/us/state/Vermont
  • Counties:
    http://BASE/id/us/state/Alaska/Bethel_Census_Area
  • US Postal Codes (Zip Codes):
    http://BASE/id/usps-com/zip/09510
  • Congressional Districts:
    http://BASE/id/us/ma/congressional-district/4
  • EPA Facilities:
    http://BASE/id/epa-gov/facility/110007995027

2.2.4 Versioning

The group will specify how to publish data which has multiple versions, including variations such as:

  1. data covering different time periods
  2. corrected data about the same time period
  3. the same data published using different vocabularies, formats, and presentation styles
  4. retracting published data

John Erickson(RPI)

The Digital Library community has faced the problem of versions in digital repositories for more than a decade+. One useful summary of thinking in this space can be found at the Version Identification Framework (VIF) Project site. See especially:

  1. Essential Versioning Information
  2. Embedding Versioning Information in an Object
  3. Recommendations for Repository Developers

The Resourcing IDentifier Interoperability for Repositories (RIDIR) project (2007-2008) considered in depth the relationship between identifiers and finding versions of objects. See RIDIR Final Report. In their words, RIDIR set out to investigate how the appropriate use of identifiers for digital objects might aid interoperability between repositories and to build a self-contained software demonstrator that would illustrate the findings. A number of related projects are listed at JISC's RIDIR information page.

In addition, at TWC we have adopted an ad hoc approach to denoting versions of published linked data:

  1. The URI for the "abstract" dataset has no version information, e.g. http://logd.tw.rpi.edu/source/data-gov/dataset/1017
  2. The URI for a particular version appends this, e.g. http://logd.tw.rpi.edu/source/data-gov/dataset/1017/version/1st-anniversary
  3. The version indicator (e.g. "1st-anniversary") is arbitrary; a date code may be used. We sometimes use NON-ISO 8601 (e.g. "12-Jan-2012" to make it clear this is (in our case) not necessarily machine produced.

2.2.5 Stability

The group will specify how to publish data so that others can rely on it being available in perpetuity, persistently archived if necessary.


Definition

A definition describes this element of best practices.

Stability
persistent, predictable machine accessible data at stable externally visible locations.
  • Persistent = machine accessible for long periods of time
  • Predictable = names follow a logical format
  • Stable location = externally visible locations are consistent.


  • Other things that impact stability
    • legacy = earlier naming schemes, formats, data storage devices
    • steward = people who are committed to consistently maintain specific datasets, either individuals or roles in organizations
    • provenance = the sources that establish a context for the production and/or use of an artifact. see W3C Provenance working group

Goals

These goals influence the value of the data. We believe that preservation of content is the main goal for stability, however there are others


The length of time information is available is inherently connected to the value placed upon it. Value is determined based on a cost-benefit relationship; The benefit derived from information is reduced by the cost(s) associated with using it.

Increasing stability requires adopting a strategy to allocate limited resources for achieving a goal. Adhere to a selection criteria of what best to be preserved. There are three possible goals;

  1. Preservation of content - It might be important to have raw data available for analysis ad infinitum. This means the overall objective is to preserve only the scientific content.
  2. Preservation of access - It might be important to have information available immediately at all times.
  3. Conservation - From a historical perspective one could seek to preserve all information in the format and modality in which it was originally conveyed. The most demanding is conservation of the full look and feel of the publication.



Mark metadata based on its intended audience

  • Internal-audience : management of the process
  • External-audience : final state, or no-update needed.

Examples

These examples are to generate discussion and comment. The intent is to have a few representative samples that will encourage additional suggestions. These examples were discussed on the public-gld email listserv


Technical examples What existing examples can we point to? (Need international ones...)

  1. PURLs (Persistent Uniform Resource Locators) purl.oclc.org
  2. Handle System http://www.handle.net/ and its commercial cousin Digital Object Identifier [3]
  3. Internet Archive (http:www.archive.org)

Institutional examples Who has the incentive to provide stable persistent data? Some real possibilities and some metaphors for discussion.

  1. Archives
    • Third party entities that document provenance and provide access
  2. Estate Lawyer
    • Someone responsible for tracking down heirs for an inheritance
  3. Private Foundation
    • A philanthropic entity who is interested in the value proposition of stability and acts as archive
  4. Government
    • A government organization which has the funds to steward others' data
  5. An internet organization
    • A global open organization like W3C or Icann

Anne Washington 04:21, 8 December 2011 (UTC)


Characteristics

SUCCESS FACTORS
Stable data may share similar technical characteristics. others?
  • scalable = data formats can handle increase in size over time. Names need to expand to accommodate.
    • i.e. W3C policy on dates in URLS to accommodate name changes that occur over time.
  • granularity
  • compressed in ZIP or GZIP format
  • Data archives should be nested in least a single directory


COMMON PITFALLS TO AVOID
Stable data may avoid similar technical characteristics. Others?
  • Data file names should not contain non-printable characters
  • Data archives should be kept to a manageable size.
  • Number of directories deep ?


ORGANIZATIONAL CONSIDERATIONS
Without internal stability from the data stewards, external technology stability is a challenge. These are some organization characteristics for stable data.
  • Consistent human skills
  • Consistent infrastructure
  • data related to organizational values or business needs
  • internal champion or consistent business process
  • internal politics on variation names do not impact external locations

-Ron Reck and Anne Washington 20:01, 29 November 2011 (UTC)

Possible new page to transclude?: Best_Practices_Discussion_Stability

2.2.6 Legacy Data

The group will produce specific advice concerning how to expose legacy data, data which is being maintained in pre-existing (non-linked-data) systems. Subject: Roadmap for cities to adopt open data

Biplav: use-case

  • Suppose a city is considering opening up its data. It has certain concerns:
  1. Business and legal level
    1. What are the privacy considerations in publishing data? On one hand, city will like to respect the privacy of citizens and businesses, and on the other, it will like the data to be valuable enough to lead to positive change.
    2. How to pay for the cost of opening up data? Cities may or may not have legal obligations to open up data. Accordingly, they will look for guidance on how to account for the costs. Further, can they levy a license fee if they are not obligated to open data
    3. Which data should be opened and when? Should it be by phases? What data should not be shared?
    4. What policies / laws are needed from the city so that businesses can collaborate on open data, while preserving their IP?
  2. Technical level
    1. What should be architecture to share large-scale public data? How do we ensure performance and security?
    2. What visualization should be supported for different types of data?
    3. Can we provide a template implementation for the reference architecture?
  • We need to provide a roadmap to address them

2.2.7 Linked Data Cookbook

The group will produce a collection of advice on smaller, more specific issues, where known solutions exist to problems collected for the Community Directory. This document is to be published as a Working Group Note, or website, rather than a Recommendation. It may, instead, become part of the Community Directory site. The Cookbook for Open Government Linked Data.

Website Organization

Basic features

Best Practices may leverage the SWEO Semantic Web Case Studies and Use Cases approach.


Design Goals

  1. Must be relevant for government (local, state, federal, international)
  2. Self-maintaining over time
  3. Data published in a W3C RDF serialization (or submitted W3C Standard)


Possible Technologies to Use in the Site

  • Website technologies options suggested:
  • Callimachus-driven form-based site?
  • wiki (Mediawiki or otherwise)
  • Others? Please suggest others.