Difference between revisions of "Linked Data Cookbook"

From Government Linked Data (GLD) Working Group Wiki
Jump to: navigation, search
(Cookbook for Open Government Linked Data)
 
Line 353: Line 353:
  
 
If you move or remove data that is published to the Web, you may break third party applications or mashups without knowing. This is considered rude for obvious reasons and is the basis for the social contract.  A good way to prevent causing http 404's is for your organization to implement a persistence strategy.   
 
If you move or remove data that is published to the Web, you may break third party applications or mashups without knowing. This is considered rude for obvious reasons and is the basis for the social contract.  A good way to prevent causing http 404's is for your organization to implement a persistence strategy.   
 
@@ TODO: More on persistence strategy (BOH)
 
  
 
It is beyond the charter of this working group to describe and recommend appropriate licenses for Open Government content published as Linked Data.  However, it is best practice to explicitly attach a license statement to each data set. Governments typically define ownership of works produced by government employees or contractors in legislation.  
 
It is beyond the charter of this working group to describe and recommend appropriate licenses for Open Government content published as Linked Data.  However, it is best practice to explicitly attach a license statement to each data set. Governments typically define ownership of works produced by government employees or contractors in legislation.  
Line 361: Line 359:
  
 
Similarly, the UK and many former Commonwealth countries maintain the concept of the Crown Copyright. It is important to know who owns your data and to say so. Additional work around the recording of legal implications and licensing may be undertaken by the W3C Government Linked Data Working Group in coming years. It is recommended that governmental agencies publishing Linked Data review the Recommendations produced by the W3C.
 
Similarly, the UK and many former Commonwealth countries maintain the concept of the Crown Copyright. It is important to know who owns your data and to say so. Additional work around the recording of legal implications and licensing may be undertaken by the W3C Government Linked Data Working Group in coming years. It is recommended that governmental agencies publishing Linked Data review the Recommendations produced by the W3C.
 
@@ TODO: Expand for other countries by example, e.g., Brazil, others
 

Latest revision as of 01:35, 15 March 2013

Cookbook for Open Government Linked Data

  Editors: 
  Bernadette Hyland, (3 Round Stones) 
  Boris Villazón Terrazas  (iSOCO, Intelligent Software Components S.A.)

Revision Dec 2011

Purpose of this Document

This wiki page is a W3C Note supporting the Working Group's deliverable for the related Recommendation Track document, outlined in the W3C Government Linked Data Working Group Charter.

Audience

Readers of this document are expected to be familiar with the creation of Web applications, and to have a general familiarity with the technologies involved, but are not expected to have a background in semantic technologies or previous experience with Linked Data Best Practices.

The document is not targeted solely at Web developers; others, such as data curators and publishers, government contractors and staff involved in Open Government initiatives, and tool developers are encouraged to read it.

Scope

The approach in writing this document has been to collate and present the most relevant engineering practices prevalent in the Linked Data development community today and identify those that: a) facilitate the exploitation of Linked Data to enable better search, access and re-use of open government information; or b) are considered harmful and can have non-obvious detrimental effects on the overall quality of data publishing on the Web.

The goal of this document is not to invent or endorse future technologies. However, there are a number of cases where explicitly omitting a Best Practice that referred to an emerging technology on the grounds that it was too recent to have received wide adoption would have unnecessarily excluded a valuable recommendation. As such, some Best Practices have been included on the grounds that the Working Group believes that they will soon become fully qualified Best Practices (e.g. in prevalent use within the development community).

In building a Web application, it is not necessary to implement all Best Practices. Instead, each Best Practice should be considered as a possible measure that might be implemented towards the goal of providing as rich and dynamic an experience as possible via a Web browser and Linked Data client.

Abstract

The World Wide Web of 2011 is a mature and trusted information system, allowing its broad adoption even by laggards. As an information system owned by no one and yet open to vendors, governments and private citizens, the Web has become a natural place to publish information for public dissemination. The wide availability of Web clients, be they on mobile phones, laptop or desktop computers, tablets or game consoles, and the provision of public access services (especially by libraries) has made publication on the Web a preferred way for governments to empower their citizenry, if done in a standards-compliant manner.

The goal of the W3C Linked Data Cookbook is to provide practical guidance to developers and technology managers who are embarking on the process of publishing open government content.

We will begin with a process overview, identifying data sets and then outline the entire process, from start to finish. We'll introduce the Linked Data "star" scheme and describe the trade-offs of publishing quickly versus taking the time to model more complex data sets with higher re-use potential. Data in the RDF family of standards is well on its way to becoming Linked Data, but it is not ubiquitous yet. Linked Data principles still need to be applied. This guide is intended to assist in describing how to produce high quality, “4 and 5 star Linked Data."

Overview

Many governments have mandated publication of open government data to the public via the Web. The intention is to facilitate the maintenance of open societies and support governmental accountability and transparency initiatives. However, publication of unstructured data on the World Wide Web is in itself insufficient; in order to realize the goals of efficiency, transparency and accountability, re-use of published data means members of the public must be able to absorb data in ways that can be readily found via search, visualized and absorbed programmatically.

This guidance is intended to help data curators and publishers better understand how to best use their time and resources to achieve the noble goals of Open Government. Linked Data principles address many of the data description and data format requirements for realizing the goals of Open Government. Linked Data uses a family of international standards and best practices for the publication, dissemination and reuse of structured data. Linked Data, unlike previous data formatting and publication approaches, provides a simple mechanism for combining data from multiple sources across the Web.

This is a living website that is being updated with best practices to model, create, publish and announce new Open Government Linked Open around the world.

A Brief History on Open Government Linked Data

The the term "Linked Data" was used in 2006 by Tim Berners-Lee in a 2006 Linked Data Design note. By 2008, a Linked Open Data Workshop was organized at the 17th WWW conference in Beijing China. By 2008, researchers and early adopters were sharing advice on publishing large data sets as Linked Data, searching structured content on the Web, and what was needed as Linked Data applications.

By 2011 there was sufficient moment in this approach that over 300 data sets have been published on the Web as Linked Data, peer-review and mainstream developer books have been written on Linked Data, and a W3C Working Group focusing on Government Linked Data was launched.

In that short time Linked Data has grown into an almost mainstream activity for many governments around the world. The US Government’s open data site listed twenty one countries whose governments publish open data regarding the operations of their public sector, with roughly one third of them publishing Linked Data using the data standards of the World Wide Web Consortium. Those numbers show every sign of increasing in the immediate future.

As of September 2011, according to analysis published by Chris Bizer (Freie Universität Berlin), Anja Jentzsch (Freie Universität Berlin), Richard Cyganiak (DERI, NUI Galway) State of the LOD Cloud, while Open Government Linked Data represents 42% of the total volume of statements it contains less than 4% out-links. This means that compared to other domains such as Life Sciences and Publications with over 27% and 38% out-links, respectively, government data is not highly linked with other data. As of 2011, published Open Government Linked Data is usually "4 star Linked Data" in most cases. This is our motivation -- to help raise that out-link percentage from less than 4% to say 30% or greater for government data sets.

These data sets are primarily found through catalog sites. Data formats available via open government data sites fall into the following general categories:

  1. Raw data (i.e., CSV, JSON, PDF, XML, RSS, XLS, XLSX)
  2. Geospatial data (i.e., SHP, KML)
  3. RDF (Turtle, N3, RDF/XML)

Many published government data sets, including the ubiquitous comma-separated-value (CSV) format, are not immediately useful because column headings or schema are not defined. While someone looking at a CSV file may be able to derive some value from the column headings, there is no guarantee the meaning of the data has been communicated. That is why the modeling effort described below is so vital.

Where we are heading? At a minimum, we would would like to see the vast majority of Open Government Linked Data sets contain the following information: authority (name of organization), title, description, publication dates, terms of license, vocabularies used, time period/duration of the data, to name some of the key metadata every publisher should associate with their data set.

Ingredients for High Quality Linked Data

The 7 Best Practices for Producing Linked Data

The following best practices are described in this document and listed here for your convenience:

  1. Model the Data
  2. Name things with URIs
  3. Re-use vocabularies whenever possible
  4. Publish human and machine readable descriptions
  5. Convert data to RDF
  6. Specify an appropriate license
  7. Host the Linked Data Set Publicly and Announce it!


Diagrams of the life cycle for data modeling: Linked Data life cycle

@@TODO@@ (Bernadette) Describe the Five Stars of Linked Open Government Data


Step 1 Model the Data

Linked Data domain experts model data without context versus traditional modelers who typically organize data for specified Web services or applications. Modeling without context better enables data reuse and easier merging of data sets by third parties. Linked Data application logic does not drive the data schema. Below we summarize the process of modeling that can take anywhere from several hours for a simple data set to several weeks for experienced subject matter experts and Linked Data experts working iteratively together. That is a comparatively modest investment given that most organizations spend months or years carefully compiling and curating their data.

A criticism voiced by detractors of Linked Data suggest that Linked Data modeling is too hard or time consuming. The effort of modeling Linked Data should be viewed as the way forward to unlock data and make it more widely available within an organization or on the public Web.

The following is an outline of what you can expect the modeling process to be where the original data resides in a relational database:

  1. Identify:
    1. Obtain a copy of the logical and physical model of the database(s).
    2. Obtain data extracts (e.g. CSV table extracts, data dictionaries) or create data in a way that can be replicated.
    3. Look for real world objects of interest such as people, places, things and locations.
  2. Model:
    1. Sketch or draw the objects on a white board (or similar) and draw lines to express how they are related to each other.
    2. Investigate how others are already describing similar or related data. Reuse common vocabularies to facilitate data merging and reuse.
    3. If you’re using existing data, look for duplication and (de)normalize the data as necessary.
    4. Use common sense to decide whether or not to make link.
    5. Put aside immediate needs of any application.
  3. Name:
    1. Use URIs as names for your objects.
    2. Think about time and how the data will change over time.
  4. Test:
    1. Be sure to test the assumptions in the schema with subject matter experts familiar with the data.
    2. Re-factor the schema and reflect changes in both human and machine readable formats. Diagrams of the objects and relationships are useful and recommended to quickly communicate what items are reflected in the model as first class objects and shows relationships.

Linked Data domain experts typically model two or three exemplar objects to begin the process. During this process, domain experts figure out the relationships and identify how each object relates to the real world, initially drawing on a large white board or collaborative wiki site.

As you iterate, use a graphing tool to organize the objects and relationships and produce an electronic copy for others to review. It bears repeating, during the modeling process, one should not be contemplating how an application will use your data. Instead, focus on modeling real world things that are known about the data and how it is related to other objects. Take the time to understand the data and how the objects represented in the data are related to each other.

Step 2 Name Things with URIs

Next, name objects in the data modeled in Step #1 above. Give careful consideration to the URI naming strategy. This deserves careful consideration just like any form of public communication from your organization.

The reader is encouraged to review both the guidance on Designing URIs Sets provided by the UK Cabinet Office. This guide provides principles for choosing the right domain for URI sets, the path structure, coping with change, machine and human-readable formats. Additionally, the Linked Open Government Data (LOGD) team at Rensselaer Polytechnic Institute in New York documented URI Guidance for URI naming schemes.

The following is intended as a primer for designing public sector URIs:

 Use HTTP URIs.

URIs provide names for resources. You can say things about resources. Everyone knows what to do with HTTP URIs. They are a quick, easy and scalable look-up mechanism.

  Use clean, stable URIs.

Try to abstract away from implementation details.

  Use a Domain that You Control.

It is important to select a DNS domain that your department or agency controls. It is bad etiquette to use someone’s domain that you do not own or control. In this way, you can also commit to its permanence and provide data at this address.

Plan to coordinate with the group that handles the government organization's top level DNS. You may be required to prepare and present a security plan if you're hosting government content. In the US Federal Government, these guidelines are defined by the National Institute of Science and Technology and are referred to as the Federal Information Security Management Act (FISMA).

FISMA or a similar acts require production Web servers and applications be compliant with the relevant standards for:

  1. Categorizing information
  2. Security requirements for information
  3. Security controls for information systems
  4. Assessing security controls
  Use Natural Keys.

Use natural keys to make your URIs readable by people. This is not required, but is a very useful courtesy to those wishing to reuse your data. Take some care in defining these. Don’t be cryptic. For example nobody can guess what http://.../984d6a means.

Use containers in a URI path to help keep natural keys separate. Containers provide a logical place to put lists. For example, http://.../baked_goods/bread/rye-12, http://.../baked_goods/bread/rye-13

  Use Neutral URIs.

A URI contains meaningful, natural or system-neutral keys. One can route these URIs to any implementation, meaning they can live forever. Therefore, don’t include version numbers or technology names. Neutral URIs are also a wise choice as you’re not advertising a specific technology choice that might change or where security vulnerabilities may exist.

  Use of hash URIs should be done cautiously.

The hash URI strategy builds on the characteristic that URIs may contain a special part that is separated from the base part of the URI by a hash symbol (#). This special part is called the fragment identifier. Fragment identifiers are not sent to a server. This limits server side decision making, and limits granularity of the response. Fragment identifiers enable such URIs to be used to identify real-world objects and abstract concepts, without creating ambiguity. Use fragment identifiers with caution.

Hash URIs are most useful for Linked Data schemas, where a URI for a schema term might resolve to a human-readable Web page (e.g. http://example.com/id/vocabulary#linked_data). Most Linked Data should use so-called Slash URIs that avoid fragment identifiers (e.g. http://example.com/id/vocabulary/linked_data/).

  Use dates sparingly in URIs.

Dates provide a way to show data from a period in time. They are most useful for things like statistics, regulations, specifications, samples or readings from a given period.

The W3C is a well known exception to this thumb rule. They use a convention for URIs related to the dates that working groups are established or documents are standardized. For example, the W3C RDF Recommendation was published 10 February 2004, so the convention they use is to path with the date as follows http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/. This approach is acceptable however, it should be used with careful consideration on how things may change over time.

Step 3 Re-use Vocabularies Whenever Possible

A relatively small number of vocabularies are routinely used to describe people, places, things and locations. Any given Linked Data set may include terms from an existing and widely used vocabulary. This could include using terms from Dublin Core, which describes metadata about published works, or Friend-of-a-Friend (FOAF), used to describe people and their relationship to other people, or GeoNames a geographical database covers all countries and contains over ten million geographical names.

  Use existing authoritative vocabularies that are in widespread usage 
  to describe common types of data.

In many cases, the authoritative vocabulary is maintained by someone else, allowing many to benefit from a the labor a few. Knowing how to use the commonly used vocabularies properly will help your organization find natural reuse potential and help identify areas for cross-linking with other Linked Data sets.

Remember the benefits ascribed to Linked Data are realized precisely because the data curator and publisher takes the time to identify existing authoritative vocabularies as well as, link their data to other data on the Web to provide context.

  Presume re-use.  

In the traditional data modeling world, documentation is often not kept current after the system is launched, nor is it routinely published on the Web. However, in the Linked Data community, reuse is presumed. It is through use of the URI and the use of authoritative vocabularies that data curators and publishers are able to publish information more quickly and reduce costs of data integration.

Some vocabulary guidelines for consideration in your project:

  1. To name things, use rdfs:label, foaf:name, skos:prefLabel
  2. To describe people, use FOAF, vCard
  3. To describe projects, use DOAP
  4. To describe Web pages and other publications,use dc:creatoranddc:description
  5. To describe an RDF schema/vocabulary/ontology, use a VoID description
  6. To describe addresses, use vCard
  7. To model simple data, use RDF, RDFS, custom vocabularies
  8. To model existing taxonomies, use SKOS

The following summary of vocabularies is relevant for government agencies. It is not exhaustive. The typical practice is to combine a few terms from several different vocabularies, much like a smorgasbord. You may still need to define a few more customized terms depending upon the specific needs of your organization.

  1. The WGS84 WGS84 for geo positioning defines terms for lat(itude), long(itude) and other information about spatially-located things, using WGS84 as a reference datum.
  2. The Bibliographic Ontology (BIBO) provides main concepts and properties for describing citations and bibliographic references (i.e. quotes, books, articles, etc).
  3. The Creative Commons Rights Expression Language defines terms for describing copyright licenses in RDF. Although data created by government agencies is generally under legislated ownership, the CC licenses are often used by government contractors to ensure that government agencies retain a right to use the material.
  4. The Description of a Project vocabulary (pronounced “dope”) is a project to create a vocabulary to describe software projects, with particular emphasis on Open Source projects.
  5. The Dublin Core Metadata Initiative (DCMI) Metadata Terms defines general metadata attributes for published works including title, creator, date, subject and publisher.
  6. The Friend-of-a-Friend (FOAF) vocabulary defines terms for describing people, their activities (collaboration) and their relations to other people and objects. There are extensions to FOAF for the Social Web. This helps describe how one relates to Facebook, Flikr, LinkedIn, etc.
  7. The GeoNames Ontology is a geographical database containing over 10 million geographical names.
  8. The Good Relations is an ontology for E-commerce that defines terms for describing products, price, and company data. The goal is to increase the visibility of products and services in search engines, recommender systems, and mobile or social applications.
  9. The Object Reuse and Exchange vocabulary defines standards for the description and exchange of aggregations of Web resources. These aggregations, called compound digital objects, may combine distributed resources with multiple media types including text, images, data, and video. Used by libraries and media publishers.
  10. The Semantically-Interlinked Online Communities vocabulary (SIOC, pronounced “shock”) is designed for developers to describe information about an online community sites, such as users, posts and forums.
  11. The vCard vocabulary is a file format for address books. It is an older but popular address book format that has since been ported to RDF and includes the basics of what is needed for representing addresses internationally.
  12. The Vocabulary of Interlinked Datasets (VoID) defines key metadata about RDF datasets. It is intended as a bridge between the publishers and users of RDF data, with applications ranging from data discovery to cataloging and archiving of datasets. One should always publish a VoID description of your vocabulary so others can reuse it.
  If no existing or authoritative vocabulary for your subject exists, 
  follow basic conventions.

If you determine that there is no existing or authoritative vocabulary for your subject domain, create one or more, following some basic conventions. There are several good books that discuss effective modeling in RDFS and OWL and we encourage you to refer to them as for guidance.

Boris

Some links

Step 4 Publish Human and Machine Readable Descriptions

  Make data "self-describing"

Consumers of Linked Data do not have the luxury of talking to a database administrator who could help them understand a schema. Therefore, a best practice for publishing a Linked Data set is to make it “self-describing.”

Self-describing data suggests that information about the encodings used for each representation is provided explicitly within the representation.

Reusability is provided to others by modeling data outside of any one application’s context. Validation of the data is specific to an application’s context. Said another way, the application using a data set is responsible for validating the data. The ability for Linked Data to describe itself, to place itself in context, contributes to the usefulness of the underlying data.

  Include human-readable descriptions of your data as a Web page, 
  in addition to your RDF data files.

By making available both human-readable and machine-readable formats that are self-contained, you will have moved your agency closer to achieving the goals of Open Government and making the data truly available for reuse.

  Publish a VoID description of your RDF dataset. 

VoID is the de facto standard vocabulary for describing Linked Data sets because it helps users find the right data for their tasks. A useful guide with examples for using VoID may be found on http://code.google.com/p/void-impl/wiki/ExampleVoidQueries

Step 5 Convert Data to RDF

Once you have a schema that you are satisfied with, the next step is to convert the source data into a Linked Data representation or serialization.

There are several RDF serializations including Turtle, Notation-3 (N3), N-Triples, XHTML with embedded RDFa, and RDF/XML

One RDF serialization is not universally "better" than any other; they are different representations of the same standard. Some serializations may be better than others for specific purposes; e.g., Turtle can be parsed more quickly and is considered by many to be more readable by humans than RDF/XML.

A best practice is to validate a representative sample set of your data format after converting it into one or more of the RDF serialization formats. RDF validation helps to avoid unnecessary errors when the data is loaded into an RDF database.

Conversion: Triplification vs. Modeling

Conversion approaches fall into three categories:

  1. Automatic conversion, sometimes called triplification
  2. Partial scripted conversion
  3. Modeling by human and subject matter experts, followed by scripted conversion

When converting content to RDF, it is not considered good practice to convert hierarchical data set into RDF/XML with few or no links. Per Tim Berners-Lee, “it is bad etiquette not to link to related external material. The value of your own information is very much a function of what it links to, as well as the inherent value of the information within the web page. So it is also in the Semantic Web.”

In specific cases, automatic conversion by script, sometimes called “triplification” is a valid strategy to help break the back of large conversions, for example large amounts of sensor data or geospatial information. However, automatic triplification often does not produce high quality results. Skipping the important modeling step and converting solely by script may technically produce RDF content but without offering benefit in terms of re-use.

  Converting hierarchical data by producing one triple per record misses
  the key premise of Linked Data re-use.

The preferable approach is to include one or more subject matter experts and domain experts to review the data, logical and relational schemas. This is no different than what data modeling professionals have done for decades, with the exception that Linked Data experts name objects using URIs and openly publish human and machine-readable schemas.

Linked Data experts can model subjects, predicates and objects for the data set, identify existing vocabularies and define custom requirements to develop a reasonable object modeling guide. The modeling guide should be documented and reviewed with the subject matter experts and business stakeholders. It will form part of the human-readable documentation later produced and published online as part of the Linked Data.

The collaboration need not be complex or particularly technical. Ideally, business owners will participate in the process and contribute to the discussion on cross linking content. The focus should be on the data and what it represents.

  Avoid the temptation to structure the data for a specific use or application.

Collaboratively identify the objects and how they relate to other objects. There is plenty of time in the future to do complex ontology development; “walk before you run.”

Step 6 Specify an Appropriate License

Publishing existing content as Linked Data means still requires specification of an appropriate license. Software licenses for data is a complex issue and organizations dedicated to software licensing, such as Creative Commons, is a valuable resource for learning about the issues and choosing the appropriate license.

Summary of key Open Government License sites:

  • The UK Open Government License was created to enable any public sector information holder to make their information available for use and reuse under its terms.
  • The Open Database License (ODbL) is an open license for databases and data that includes explicit attribution and share-alike requirements.
  • Public Domain Dedication and License (PDDL) is a document intended to allow you to freely share, modify, and use a particular data for any purpose and without any restrictions.
  • Open Data Commons Attribution Licenseis a database specific license requiring attribution for databases.
  • The Creative Commons Licenses are several copyright licenses that allow the distribution of copyrighted works.
  • The Open License compatible with the following licenses:
    • Open government License (OGL) of the UK
    • Creative Commons Attribution 2.0 (CC-BY 2.0)
    • Open Data Commons Attribution (ODC-BY)

Step 7 Host Linked Data Publicly and Announce it!

To publish Linked Data means the data set must be physically copied onto a publicly accessible Web server. There are options ranging from in-house hosting to vendors offering hosting and support as a managed service.

  1. Search for “linked data hosting” to get a sense of who provides commercial hosting for Linked Data.
  2. Government Procurement officers may be able to provide a list of commercial suppliers who offer “Linked Data services”.
  3. Check the open, non-member affiliated W3C Community Directory of vendors providing Linked Data services.

Once you have created a data or converted it into Linked Data format, it is time to serve it. Publication on the Web is your way to say, “Dinner is ready. Come and get it!” Publishing Linked Data from a communications strategy is not unlike publishing to the Web of documents. It should be considered a form of public communication from your agency.

Government agencies, starting in 2012 are beginning to define Web data publication policies. Data policies should be in human-readable form and reference privacy, data quality and retention, frequency of updates, treatment of data through secondary sources, citation and reference, public participation, and applicability of the data policy. For example, a data policy for government content might say something like, “All data sets accessed through this site are confined to public information and must not contain National Security information as defined by statute and/or Executive Order, or other information/data that is protected by other statute, practice, or legal precedent. The supplying Department/Agency is required to maintain currency with public disclosure requirements.” This data policy happens to be for the US data.gov site. In due course, this data policy will no doubt be extended to address the burgeoning growth to Open Government data being published on the Web.


Criteria for being added to the Linked Data Cloud

The best practice is to test your data and confirm that it complies with the Linked Data Principles. Next, confirm that your data meets the criteria to join the Linked Open Data cloud. Richard Cyganiak maintains a site that outlines those criteria in the following check-list:

  1. There must be resolvable http:// (or https://) URIs.
  2. They must resolve, with or without content negotiation, to RDF data in one of the popular RDF formats (RDFa, RDF/XML, Turtle, N-Triples).
  3. The data set must contain at least 1000 triples.
  4. The data set must be connected via RDF links to a data set in the LOD diagram. This means, either your data set must use URIs from the other data set, or vice versa. An arbitrary recommendation is to have at least 50 such inter-links.
  5. Access to the entire data set must be possible via RDF crawling, via an RDF dump, or via a SPARQL endpoint.

Announcing a New Linked Data Set

With the hard work of modeling, vocabulary selection, minting URIs, converting the data and validating it now done, meet with the organization’s communications and management who are supportive of Open Government initiatives. Consider the publication of a press release and blog posts announcing your new data set’s public availability.

This is a rapidly evolving area and the reader is encouraged to review the latest recommendations from the W3C Government Linked Data Working Group as one current source of information on applicable best practices for government Linked Data sets. The following is general advice:

  1. Publish a human-readable description of the data;
  2. Publish the schema as a VoID description;
  3. List your data set on CKAN; which is an open registry of data and content packages. See the Guidelines for Collecting Metadata on Linked Datasets in CKAN for further details. It will be reviewed and added to the CKAN lodcloud group and will be updated on the next version of the diagram.
  4. Submit your data set to semantic search engines such as PingTheSemanticWeb (PTSW), Sindice, and Swoogle to help people find your published Linked Data;
  5. Inform the Linked Data developer community mailing list of the existence of the data set;
  6. Announce your data set to search engines by opting in where required, adding RDFa hints for improved layout in search results; and
  7. Include a SPARQL endpoint for all or some of your data, if possible. Making RDF dumps available is strongly recommended to minimize the burden of web crawlers on the SPARQL endpoint.

Sharing Linked Data via SPARQL

In many cases, government agencies or authorities may wish to provide a programmatic interface to their published data. Controlled access to the RDF datasets is achieved by providing a SPARQL end point.

Note the very important words controlled access. Few sites allow unfettered access via a SPARQL endpoint because a poorly constructed SPARQL query could take down a server, much as poorly constructed SQL queries can crash a relational database. An endpoint may be either available only to authenticated users or if it is publicly available, limits may be put on the query syntax or the size of the result set. A SPARQL end-point will allow Linked Data clients to issue queries against a published URL and get back a SPARQL results document in an XML format.

Some government agencies have one or more SPARQL endpoints, allowing people to perform searches across their data. For example, the UK Government allows access via data.gov.uk/sparql. They provide reference data that covers the central working of government, including organizational structures all available as RDF. Many sites host a Web-based text entry form that allows you to enter a query and immediately get results back. The US Government updated data.gov to include increased support for visualizations, in additional to allowing for downloads in various formats. The recently updated data.gov site does not appear to have a SPARQL endpoint as of this writing.

As with any form of database, there are performance considerations associated with running queries. Seek the advice of a Linked Data expert as you prepare your ’go live’ data publishing strategy. Together, work through use cases and the audience for your data. There are decisions around utilization of servers, access, backup and fail-over that are important to consider as part of the organization’s “social contract”, as well as production support commitment.


International Standards Compliance

  Serve “5-star” Linked Data, whenever possible.

It should be no surprise that we emphasize the importance of relevant standards support. This includes, the RDF family of standards including the SPARQL Query Language, as well as, compliance with Linked Data principles that we’ve discussed above. Of particular importance is provision of a SPARQL endpoint and support for the SPARQL query language specification (currently SPARQL v1.1). If a vendor provides variations to the standard, there will be an elevated risk of vendor lock- in if non-standard features are adopted.

RESTful APIs are very important, but are not sufficient alone.


Serving Linked Data Correctly

When serving RDF data, it is very important to generate the correct MIME type. Some less experienced service providers do not properly configure their Web servers to correctly serve RDF data. This becomes an important criterion when choosing a service provider. The Web’s HTTP protocol uses Multipurpose Internet Mail Extensions (MIME) to identify content types, originally developed for multimedia attachments to electronic mail messages. Linked Data uses MIME content types to identify whether a response is intended to be read by humans or machines. The server looks at the MIME type in order to provide the correct content type in its response.

Mechanism for Updating Linked Data sets

A knowledgeable vendor should be able to explain their ability to serve “5-star” Linked Data and knowledgeably discuss any variations of limitations. The cost and ease of management of infrastructure will be a factor in deciding between local deployment versus software-as-a-service, as discussed above. In addition, a platform’s ease of use is of critical importance. If it isn’t easy to refresh Linked Data, it will become stale.

Your Social Responsibility as a Data Publisher

Publishers of Linked Data implicitly enter into an implicit social contract with users of their data. A problem on the Web is that it can be difficult to determine how much your information may matter to users. Publishers should feel a responsibility to maintain their data, to keep it fresh and up to date, to ensure its accuracy to the greatest degree possible and to repair reported problems. Publishers should assign a contact person or people to respond to enquires via some common mechanisms such as electronic mail or even telephone. If reuse is a priority, then following best practices such as modeling your data as high quality Linked Data, carefully consid- ering your URI strategy and publishing VoID descriptions will form the foundation of your Open Government initiatives. Ensuring that your Linked Open Data set re- mains available where you say it will be is critical.

If you move or remove data that is published to the Web, you may break third party applications or mashups without knowing. This is considered rude for obvious reasons and is the basis for the social contract. A good way to prevent causing http 404's is for your organization to implement a persistence strategy.

It is beyond the charter of this working group to describe and recommend appropriate licenses for Open Government content published as Linked Data. However, it is best practice to explicitly attach a license statement to each data set. Governments typically define ownership of works produced by government employees or contractors in legislation.

For example, the US Government designates information produced by civil servants as a U.S. Government Work, whereas contractors may produce works under a variety of licenses and copyright assignments. U.S. Government Works are not subject to copyright restrictions in the United States. It is critical for US government officials to know their rights and responsibilities under the Federal Acquisition Regulations (especially FAR Subpart 27.4, the Contract Clauses in 52.227-14, -17 and -20 and any agency-specific FAR Supplements) and copyright assignments if data is produced by a government contractor.

Similarly, the UK and many former Commonwealth countries maintain the concept of the Crown Copyright. It is important to know who owns your data and to say so. Additional work around the recording of legal implications and licensing may be undertaken by the W3C Government Linked Data Working Group in coming years. It is recommended that governmental agencies publishing Linked Data review the Recommendations produced by the W3C.