221 Best Practices for Procurement

From Government Linked Data (GLD) Working Group Wiki
Jump to: navigation, search

Best Practices: Procurement

Back to Best Practices Wiki page

GLD Charter Description

Procurement. Specific products and services involved in governments publishing linked data will be defined, suitable for use during government procurement. Just as the Web Content Accessibility Guidelines allow governments to easily specify what they mean when they contract for an accessible Website, these definitions will simplify contracting for data sites and applications.

Status

21-Dec-2011 - MikeP and George reaching out to John Sheridan to discuss over the last two weeks of December. Possible that John has already taken leave for the holidays. Needs to get jump started asap.

23-Jan-2012 - Major revision by MikeP

10-Feb-2012 - Minor revision by MikeP

Linked Open Data (LOD) offers novel approaches for publishing and consuming data on the Web. This procurement overview, checklist, and companion glossary is intended to help contract officers and their technical representatives understand LOD activities, and their associated products and services. Information security considerations are also offered. It is hoped that this will aid government officials in procuring LOD related products and services.

Overview

Recent Open Government initiatives call for more and better access to government data. To meet expanding consumer needs, many governments are now looking to go beyond traditional provisioning formats (e.g. CSV, XML), and are beginning to provision data using Linked Open Data (LOD) approaches.

In contrast to provisioning data on the Web, LOD provisions data into the Web so it can be interlinked with other linked data, making it easier to discover, and more useful and reusable for third party data consumers. LOD leverages World Wide Web standards such as Hypertext Transfer Protocol (HTTP), Resource Description Framework (RDF), and Uniform Resource Identifiers (URIs), which make data self-describing so that it is both human and machine readable. Self-describing data is important because most government data comes from relational data systems that do not fully describe the source data schema needed for application development by third parties.

Linked data has several advantages to standard data provisioning approaches. Linked data leverages the architecture, standards and protocols of the Web, allowing data to be discovered and reused seamlessly. Linked data embeds meaning with the data. Links allow disparate data to be easily discovered, related and “surfed” much as one discovers and surfs text on the Web. With traditional means of publishing data, data remains siloed, and the linkages between data have to be established by users. Linked data includes metadata and shared vocabularies necessary to make it self-describing, and linkages to other data are integral to the data itself.

The costs of integrating data are an important consideration for application developers. Integrating CSV files requires each application developer to identify and link common data elements across datasets. When separate application developers integrate the same datasets independently, this work is repeated. With linked data, the data publisher identifies links to other linked data, and those links are re-used by every application developer thereafter. While there is a one-time cost incurred by government to create these links, that investment lowers costs for third-party consumers (e.g. other government agencies, NGOs, private industry).

While LOD is a relatively new approach to data provisioning, growth has been exponential. LOD has been adopted by national governments including the UK, US, Sweden, Germany, France, Spain, New Zealand and Australia.

Development and maintenance of linked data is supported by the Semantic Web/Semantic Technologies industry. Useful information about industry vendors/contractors, and their associated products and services, is available from the World Wide Web Consortium’s Government Linked Data (W3C/GLD) workgroup Community Directory.


LOD Production through Consumption Lifecycle

The following are categories of products/services in support of LOD development, maintenance, and consumption:

1. LOD Preparation/Modeling

These are products and services that support modeling relational or other data sources using URIs, developing scripts used to generate/create linked open data.

2. LOD Publishing

Support creation, interlinking and deployment of linked data (see also linked data preparation). Hosting data via a triple store is a key aspect of publishing. LD publishing may include implementing a persistent identifiers/PURL strategy. During preparation for publishing linked data, data and publishing infrastructure may be tested and debugged to ensure it adheres to linked data principles and best practices. (Source: Linked Data: Evolving the Web into a Global Data Space, Heath and Bizer, Morgan and Claypool, 2011, Section 5.4, p. 53)

3. LOD Discovery

These are services that support describing, finding and using linked data on the Web. The discovery of data in the Web of Data can be done through Web crawlers, search engines, browsing an index or looking for data embedded in Web documents.


4. LOD Consumption/Application

Publication of linked data contributes to a global data space often referred to as the Linked Open Data Cloud or ‘Web of Data.’ These are services that support the development of applications that use (i.e. consume) this ‘Web of Data.’

5. Management Consulting and Strategic Planning

There are a broad range of management related services; examples include briefings intended for decisions makers to provide a general understanding of the technology, business case, ROI; strategic planning support (e.g. enterprise linked data, implementation of perisistent identifiers, etc.)

6. Formal Education and Training

Various private companies and universities offer training related to linked open data. These offerings vary widely. Trainings vary from high-level informational trainings intended to provide managers/decision makers with general understanding, to in-depth, hands-on instruction for the tech savvy on how to prepare, publish and consume linked data.

Procurement Checklist

Credit: This section of Procurement Best Practices was taken from the Linked Data Cookbook

The following is an outline of questions a department/agency should consider reviewing as part of their decision to choose a service provider:

  1. Is the infrastructure accessible and usable from developers’ environment?
  2. Is the documentation aimed at developers comprehensive and usable?
  3. Is the software supported and under active development?
  4. Is there an interface to load data and “follow your nose” through a Web interface?
  5. Can the data be queried programmatically via a SPARQL endpoint?
  6. Does the vendor have reference sites? Are they similar to what you are considering in production?
  7. What is the vendor’s past performance with government agencies or authorities?
  8. Does the vendor provide training for the products or services?
  9. What is the vendor’s Service Level Agreement?
  10. Is there a government approved contract vehicle to obtain this service or product?
  11. Is the vendor or provider an active contributor to Open Source Software, Standards groups, activities associated with data.gov and Linked Open Data projects at the enterprise and/or government level.
  12. Does the department/agency have a published Open Source Policy?
    1. If so, does the vendor or provider comply with the department/agency’s published Open Source Policy?


Security Planning for LOD

Within government agencies, hosting linked data may require submission/review of a security plan to Security Officer. While security plan specifics will vary widely based on a range of factors like hosting environment and software configuration, the process for developing and getting a security plan approved can be streamlined if the following guidelines and best practices are considered:

Notify your security official of your intent to host linked data (earlier is better)

       - Provide an overview of linked data
       - Describe how you plan to host the data (e.g., cloud, agency data center), implementation timelines
       - Consider including your hosting service/software vendor in discussion(s)

Solicit assistance from the security official:

       - Identify guidance that should be used (e.g. for US Federal Agencies this typically would entail compliance  with security control recommendations from NIST Special Publication 800-53)
       - Request clarification on regarding specific content/areas that the plan should address
       - Request a system security plan template to ensure the plan is organized to facilitate the review process (if a vendor is contributing information on controls related to their service/software, the vendor needs to adhere to the template in their response)

Security plans are typically comprised of a set of security controls, describing physical, procedural, technical and other processes and controls in a system which are in place to protect information access, availability and integrity, and for avoiding, counteracting and minimizing security risks. These are typically comprised of several layers, such as physical facility security, network and communications, to considerations of operating system, software, integration and many other elements. As such, there will typically be some common security controls which are inherited, and which may not be specific or unique to the linked data implementation, such as controls inherited from the hosting environment, whether cloud hosting provider, agency data center, et cetera. Additionally, some security controls will be inherited from the software vendors.

As such, opportunities may exist to streamline the development of a security plan, or conversely, to identify potential project security vulnerabilities and risks, through early engagement with hosting providers, software vendors and others who may be responsible for those common, inherited controls. If the inherited controls meet the recommendations, they can then be assembled following the requisite templates, and the system security plan can be completed through addition of any applicable controls specific or unique to the linked data application's configuration, implementation, processes or other elements described in the security control and security plan guidance.


GLOSSARY

Linked Open Data: A pattern for hyper-linking machine-readable data sets to each other using Semantic Web techniques, especially via the use of RDF and URIs. Enables distributed SPAQL queries of the data sets and a “browsing” or “discovery” approach to finding information (as compared to a search strategy. (Source: Linking Enterprise Data, David Wood, Springer, 2010, p. 286)

Linked Open Data Cloud: Linked Open Data that has been published is depicted in a LOD cloud diagram. The diagram shows connections between linked data sets and color codes them based on data type (e.g., government, media, life sciences, etc.). The diagram can be viewed at: http://richard.cyganiak.de/2007/10/lod/

RDF (Resource Description Framework): A language for representing information about resources in the World Wide Web. RDF is based on the idea of identifying things using Web identifiers (called Uniform Resource Identifiers, or URIs), and describing resources in terms of simple properties and property values. This enables RDF to represent simple statements about resources as a graph of nodes and arcs representing the resources, and their properties and values. (http://www.w3.org/TR/rdf-primer/)

Semantic Technologies: The broad set of technologies that related to the extraction, representation, storage, retrieval and analysis of machine-readable information. The Semantic Web standards are a subset of semantic technologies and techniques. (Source: Linking Enterprise Data, David Wood, Springer, 2010, p. 286) Semantic Web: An evolution or part of the World Wide Web that consists of machine-readable data in RDF and an ability to query that information in standard ways (e.g. via SPARQL)

Semantic Web Standards: Standards of the World Wide Web Consortium (W3C) relating to the Semantic Web, including RDF, RDFa, SKOS and OWL. (Source: Linking Enterprise Data, David Wood, Springer, 2010, p. 287)

SPARQL: Simple Protocol and RDF Query Language (SPARQL) defines a standard query language and data access protocol for use with the Resource Description Framework (RDF) data model. (http://msdn.microsoft.com/en-us/library/aa303673.aspx) Just as SQL is used to query relational data, SPARQL is used to query graph, or linked, data.

Uniform Resource Identifiers (URIs): URI’s play a key role in enabling linked data. To publish data on the Web, the items in a domain of interest must first be identified. These are the things whose properties and relationships will be described in the data, and may include Web documents as well as real-world entities and abstract concepts. As Linked Data builds directly on Web architecture [67], the Web architecture term resource is used to refer to these things of interest, which are, in turn, identified by HTTP URIs. Wide Web Consortium’s Government Linked Data (W3C/GLD) workgroup: http://www.w3.org/2011/gld/charter

Production Use of Academic and Open Source Software

Linked Data is a relatively recent technology, and has emerged from research in academia and the work of individual “hackers” and enthusiasts, rather than being driven by large IT vendors. Thus, users of the technology may encounter free and open source software components from small-scale, non-commercial or academic developers. Using such components can be very convenient since they are readily available and free of charge. But their use in production systems has a number of risks:

  • The software may not be well-tested and stable
  • The software may not be sufficiently flexible for easy integration into the existing IT environment
  • The developer may lose interest (e.g., a PhD student finishes their degree and moves on), leaving users stranded without improvements or support
  • Development of the software may have been funded from a particular project that has ended, leading to much reduced levels of ongoing development and support
  • Arranging for commercial support of such software can be difficult even if funds are available

The following questions should be answered as part of the evaluation of such components.

  • How long has the software been around? Have there been ongoing releases for a longer period of time?
  • How was development of the software funded? Did contributions come from multiple sources/projects?
  • Have different people contributed to the software's development over time? The activity in mailing list archives, code repositories, and bug/issue trackers can be a useful indication.
  • Is there an active user community that relies on the software? Mailing list activity can be a good indication.
  • Is the software being improved? Are features added and bugs fixed? A release version history and the issue tracker can be good indications.
  • Is the software being used? Did other organizations evaluate it and find it acceptable for their purposes?
  • Has the developer done improvements to the software on a contractual/commercial basis before, and is this option still available?