Best Practices Discussion Summary

From Government Linked Data (GLD) Working Group Wiki
Revision as of 04:21, 8 December 2011 by Awashing (Talk | contribs)

Jump to: navigation, search

Discussion Date

Preliminary discussion took place at the W3C Government Linked Data Working Group First F2F June 29-30, 2011.

Interested in working on this

Note: The group will produce one or more Recommendations which address the following issues:

  • Michael Hausenblas (DERI), Bernadette Hyland, Boris Villazón-Terrazas as Editors (Editor's Draft)
  • Michael Hausenblas (DERI) - Vocab
  • Ghislain (INSTITUT TELECOM) - Vocab, URI Construction, Versioning
  • David Price (TopQuadrant) - Vocab, Legacy Data
  • Boris Villazón-Terrazas(UPM) - Vocab - URI Construction
  • Daniel Vila (UPM) - URI Construction
  • John Erickson (RPI)- URI construction, Versioning, Provenance, Vocabulary
  • Dean Allemang (TopQuadrant) - Versioning, Training?
  • Cory Casanove - Versioning
  • Bernadette Hyland (3 Round Stones) - Linked Data Cookbook
  • Sarven Capadisli (DERI) - Linked Data Cookbook
  • Martin Alvarez (CTIC) - URI Construction
  • John Sheridan (OPSI, UK) - Procurement
  • George Thomas (Health & Human Services, US) - Procurement
  • Mike Pendleton (Environmental Protection Agency, US) - Procurement
  • Biplav Srivastava (IBM) - Vocab, Legacy data

Outstanding:

  • Procurement

Upon joining the WG:

  • Hadley Beeman - Versioning (related to Data "Cube")
  • Anne Washington (George Mason University) - Stability

Overview

Linked Data approaches address key requirements of open government by providing a family of international standards for the publication, dissemination and reuse of structured data. Further, Linked Data, unlike previous data formatting and publication approaches, provides a simple mechanism for combining data from multiple sources across the Web.

In an era of reduced local, state and federal budgets, there is strong economic motivation to reduce waste and duplication in data management and integration. Linked Open Data is a viable approach to publishing governmental data to the public, but only if it adheres to some basic principles.


Purpose of Best Practices Recommendation(s)

The following are some motivations for the need for publishing Recommendation(s) and Working Notes, identified in the GLD WG Charter.

  1. The overarching objective is to provide best practices and guidance to create of high quality, re-usable Linked Open Data (LOD).

More specifically, best practices are aimed at assisting government departments/agencies/bureaus, and their contractors, vendors and researchers, to publish high quality, consistent data sets using W3C Standards to increase interoperability.

Best practices are intended to be a methodical approach for the creation, publication and dissemination of governmental Linked Data. Best practices from the GLD WG shall include:

  1. Description of the full life cycle of a Government Linked Data project, starting with identification of suitable data sets, procurement, modeling, vocabulary selection, through publication and ongoing maintenance.
  2. Definition of known, proven steps to create and maintain government data sets using Linked Data principles.
  3. Guidance in explaining the value proposition for LOD to stakeholders, managers and executives.
  4. Assist the Working Group in later stages of the Standards Process, in order to solicit feedback, use cases, etc.

Organized by the Charter

From section 2.2 of the GLD Charter.

The Working Group, facilitated by the Best Practices Task Force, will produce Recommendation(s), (a Working Group Note / website, where noted), for the following:

  1. Procurement.
  2. Vocabulary Selection.
  3. URI Construction.
  4. Versioning.
  5. Stability.
  6. Legacy Data.
  7. Cookbook. (Working Group Note or website rather than Recommendation).

GLD Life cycle

2.2.1 Best Practices for Procurement

Procurement. Specific products and services involved in governments publishing linked data will be defined, suitable for use during government procurement. Just as the Web Content Accessibility Guidelines allow governments to easily specify what they mean when they contract for an accessible Website, these definitions will simplify contracting for data sites and applications.

2.2.2 Best Practices for Vocabulary Selection

The group will provide advice on how governments should select RDF vocabulary terms (URIs), including advice as to when they should mint their own. This advice will take into account issues of stability, security, and long-term maintenance commitment, as well as other factors that may arise during the group's work.


Ghislain

One of the most challenging task when publishing data set is to have metadata describing the model used to capture it. The model or the ontology gives the semantic of each term used within the dataset or in the LOD cloud when pusblished. The importance of selecting the appropriate vocabulary is threefold:

  • Ease the interoperability with existing vocabularies
  • Facilitate integration with other data source from others publishers
  • Speed-up the time of creating new vocabularies, since it is not created from scratch, but based on existing ones.


Publishers should take time to see what is the domain application of their data set: finance, statistics, geograpraphy, weather, administration divisions, organisation, etc. Based on the relevant concepts presented in the Data set, one of these two options could be performed:

  • Searching vocabularies using Semantic Web Engines: The five most used SWEs to are Swoogle, Watson (Ontology-oriented web engines); SWSE, Sindice (Triples-oriented Web engines); and Falcons (an hybrid-oriented Web engine)

One of the difficult task sometimes in the reuse ontology process is to decide which Semantic Search engine to use for obtaining an efficient results in the search of ontologies. There are five well-known and tipically used SWSEs in the literature. What are the criteria to choose one Semantic search engine in a particular domain. In the literature, there are no guidelines helping ontology developers to decide between one SWSEs. Guidelines proposed here could potentially help ontology designers in taking such a decision. However , we can divide SW search engines in 3 groups:

*Those that are "Ontology-oriented" Web engines such as Swoogle and Watson.
*The ones "Triple-oriented" Web engines or RDF-oriented like SWSE and Sindice.
*and finally those which are "Hybrid-oriented" Web engine as the case of Falcons.

Also, a rapid observation while experimenting the use of the abovementioned engines is that there is not a clear separation between ontologies and RDF data coming from blogs and oher sources like DBPedia.

Using the search engines consist in practice querying them using the set of relevant concepts of the domain. The output of this exercise is a list of candidate ontologies to be assessed for reusing purpose.

The datahub (previous CKAN) maintains the list of datasets shared and can be accessed by an API or a full JSON dump. The approach here could be to look for data sets or the similar domain of interest, and analyzed the metadata describing that data to find out the vovabularies reused. Another "data market" place worth mentionning could be Kasabi

  1. Searching vocabularies using LOV LOV
  2. Composition of the three methods mentioned above

TODO

  • Assessment and criteria for vocabularies selection


Boris

We need to determine the vocabulary to be used for modelling the domain of the government data sources. The most important recommendation in this context is to reuse as much as possible available vocabularies. This reuse-based approach speeds up the vocabulary development, and therefore, governments will save time, effort and resources. This activity consists of the following tasks:

  • Search for suitable vocabularies to reuse. Currently there are some useful repositories to find available vocabularies, such as, SchemaWeb, SchemaCache, Swoogle, and LOV Linked Open Vocabularies.
  • In case that we did not find any vocabulary that is suitable for our purposes, we should create them, trying to reuse as much as possible existing resources, e.g., government catalogues, vocabularies available at sites like [1], etc.
  • Finally, if we did not find available vocabularies nor resources for building the vocabulary, we have to create the vocabulary from scratch.

2.2.3 URI Construction

The group will specify how to create good URIs for use in government linked data. Inputs include Cool URIs for the Semantic Web, Designing URI Sets for the UK Public Sector (PDF), and Creating URIs (data.gov.uk). Guidance will be produced not only for minting URIs for governmental entities, such as schools or agencies, but also for vocabularies, concepts, and datasets.

  • Starting points? Ghislain


Boris / Dani

It is necessary to identify a resource on the Internet, and precisely URIs are thought for that. According to [2] URIs should be designed with simplicity, stability and manageability in mind, thinking about them as identifiers rather than as names for Web resources.

There are some existing guidelines for URI design, for example (1) Cool URIs for the Semantic Web W3C Interest Group Note, which introduces a useful guidance on how to use URIs to describe things that are not Web documents; (2) Designing URI Sets for the UK Public Sector, a document from the UK Cabinet offices that defines the design considerations on how to URIs can be used to publish public sector reference data; and (3) Sytle Guidelines for Naming and Labelling Ontologies in the Multilingual Web, which proposes guidelines for desigin URIs in a multilingual scenario.

Based on the aforementioned guidelines and on our experience we propose the following design decisions regarding the assignment of URIs to the elements of the dataset:

TO DO ... propose the guidelines (Dani)

2.2.4 Versioning

The group will specify how to publish data which has multiple versions, including variations such as:

  1. data covering different time periods
  2. corrected data about the same time period
  3. the same data published using different vocabularies, formats, and presentation styles
  4. retracting published data


2.2.5 Stability

The group will specify how to publish data so that others can rely on it being available in perpetuity, persistently archived if necessary.


Definition

Stability
persistent, predictable machine accessible data at stable externally visible locations.
  • Persistent = machine accessible for long periods of time
  • Predictable = names follow a logical format
  • Stable location = externally visible locations are consistent.


Characteristics

Stability can be described with the following characteristics:

  • legacy = earlier naming schemes, formats, data storage devices
  • steward = people who are committed to consistently maintain specific datasets, either individuals or roles in organizations
  • provenance = the sources that establish a context for the production and/or use of an artifact. see W3C Provenance working group


SUCCESS FACTORS Stable data may share similar technical characteristics. others?

  • scalable = data formats can handle increase in size over time.
  • granularity
  • compressed in ZIP or GZIP format
  • Data archives should be nested in least a single directory

COMMON PITFALLS TO AVOID Stable data may avoid similar technical characteristics. Others?

  • Data file names should not contain non-printable characters
  • Data archives should be kept to a manageable size.
  • Number of directories deep ?

ORGANIZATIONAL CONSIDERATIONS Without internal stability from the data stewards, external technology stability is a challenge. These are some organization characteristics for stable data.

  • Skill sets (technical, infrastructure, conversion)
  • data related to organizational values or business needs
  • internal champion or consistent business process
  • internal variations do not impact external locations


-Ron Reck and Anne Washington 20:01, 29 November 2011 (UTC)

Examples

These examples are to generate discussion and comment. The intent is to have a few representative samples that will encourage additional suggestions. These examples were discussed on the public-gld email listserv


Technical examples

  1. PURLs (Persistent Uniform Resource Locators) purl.oclc.org
  2. Handle System http://www.handle.net/ and its commercial cousin Digital Object Identifier [3]


Institutional examples Who has the incentive to provide stable persistent data? Some real possibilities and some metaphors for discussion.

  1. Archives
    • Third party entities that document provenance and provide access
  2. Estate Lawyer
    • Someone responsible for tracking down heirs for an inheritance
  3. Private Foundation
    • A philanthropic entity who is interested in the value proposition of stability and acts as archive
  4. Government
    • A government organization which has the funds to steward others' data
  5. An internet organization
    • A global open organization like W3C or Icann

Anne Washington 04:21, 8 December 2011 (UTC)


Possible new page to transclude?: Best_Practices_Discussion_Stability

2.2.6 Legacy Data

The group will produce specific advice concerning how to expose legacy data, data which is being maintained in pre-existing (non-linked-data) systems. Subject: Roadmap for cities to adopt open data

Biplav: use-case

  • Suppose a city is considering opening up its data. It has certain concerns:
  1. Business and legal level
    1. What are the privacy considerations in publishing data? On one hand, city will like to respect the privacy of citizens and businesses, and on the other, it will like the data to be valuable enough to lead to positive change.
    2. How to pay for the cost of opening up data? Cities may or may not have legal obligations to open up data. Accordingly, they will look for guidance on how to account for the costs. Further, can they levy a license fee if they are not obligated to open data
    3. Which data should be opened and when? Should it be by phases? What data should not be shared?
    4. What policies / laws are needed from the city so that businesses can collaborate on open data, while preserving their IP?
  2. Technical level
    1. What should be architecture to share large-scale public data? How do we ensure performance and security?
    2. What visualization should be supported for different types of data?
    3. Can we provide a template implementation for the reference architecture?
  • We need to provide a roadmap to address them

2.2.7 Linked Data Cookbook

The group will produce a collection of advice on smaller, more specific issues, where known solutions exist to problems collected for the Community Directory. This document is to be published as a Working Group Note, or website, rather than a Recommendation. It may, instead, become part of the Community Directory site. The Cookbook for Open Government Linked Data.

Website Organization

Basic features

Best Practices may leverage the SWEO Semantic Web Case Studies and Use Cases approach.


Design Goals

  1. Must be relevant for government (local, state, federal, international)
  2. Self-maintaining over time
  3. Data published in a W3C RDF serialization (or submitted W3C Standard)


Possible Technologies to Use in the Site

  • Website technologies options suggested:
  • Callimachus-driven form-based site?
  • wiki (Mediawiki or otherwise)
  • Others? Please suggest others.