223 Best Practices URI Construction

From Government Linked Data (GLD) Working Group Wiki
Revision as of 20:50, 14 March 2012 by Mpendlet (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Best Practices: URI Construction

Back to Best Practices Wiki page

Purpose of this wiki: This page is provide a collaboration page for creating URIs for use in government linked data.

Status

  • 21st Feb 2012 - Rewritten principles and IRIs note (Dani)
  • Feb 2012 - Preparation for inclusion in Editors Draft Best Practices FPWD
  • Dec 2011 - Initial revisions by Ghislain, Boris, Dani, JohnE

See Also

Guidance will be produced not only for minting URIs for governmental entities, such as schools or agencies, but also for vocabularies, concepts, and datasets.

Design principles

The Web makes use of the URI (Uniform Resource Identifiers) as a single global identification system. The global scope of URIs promotes large-scale "network effects", in order to benefit from the value of Linked Data, government and governmental agencies need to identify their resources using URIs. This section provides a set of general principles aimed at helping government stakeholders to define and manage URIs for their resources.


Use HTTP URIs
What it means: To benefit from and increase the value of the World Wide Web, governments and agencies SHOULD provide HTTP URIs as identifiers for their resources. There are many benefits to participating in the existing network of URIs, including linking, caching, and indexing by search engines. As stated in [LDPrinciples], HTTP URIs enable people to "look-up" or "dereference" a URI in order to access a representation of the resource identified by that URI.


Provide at least one machine-readable representation of the resource identified by the URI
What it means: In order to enable HTTP URIs to be "dereferenced", data publishers have to set up the neccesary infrastructure elements (e.g. TCP-based HTTP servers) to serve representations of the resources they want to make available (e.g. a human-readable HTML representation or a machine-readable RDF/XML representation). A publisher may supply zero or more representations of the resource identified by that URI. However, there is a clear benefit to data users in providing at least one machine-readable representation. More information about serving different representations of a resource can be found in Cool URIs for the Semantic Web.


A URI structure will not contain anything that could change
What it means: It is good practice that URIs do not contain anything that could easily change or that is expected to change, such as session tokens or other state information. URIs should be stable and reliable in order to maximize the possibilities of reuse that Linked Data brings to users. There must be a balance between making URIs readable and keeping them more stable by removing descriptive information that will likely change. For more information on this, see [MDinURI] and Architecture of the World Wide Web: URI Opacity.

Best Practices Checklist

High-level Considerations for Constructing URIs

The purpose of URIs is to uniquely and reliably name resources on the Web. According to Cool URIs for the Semantic Web (W3C IG Note), URIs should be designed with simplicity, stability and manageability in mind, thinking about them as identifiers rather than as names for Web resources.

Many general-purpose guidelines exist for the URI designer to consider, including Cool URIs for the Semantic Web, which provides guidance on how to use URIs to describe things that are not Web documents; Designing URI Sets for the UK Public Sector, a document from the UK Cabinet offices that defines the design considerations on how to URIs can be used to publish public sector reference data; and (3) Style Guidelines for Naming and Labelling Ontologies in the Multilingual Web (PDF), which proposes guidelines for designing URIs in a multilingual scenario.

The purpose of this subsection is to provide specific, practical guidance to government stakeholders who are planning to create systems for publishing government Linked Data and therefore must create sensible, sustainable URI designs that fit their specific requirements.

A "Checklist" for Constructing Government URIs

The following checklist is based in part on Creating URIs (short; on the Web) and Designing URI Sets for the UK Public Sector (long; in PDF).

  1. What will your proposed URIs name? Will they:
    • Point to something downloadable? (e.g. PDF, CSV, RDF, TTL or ZIP files)
    • Identify some real world thing? (e.g. school, department, agency)
    • Point to information about a real world thing?
    • Identify some abstract thing? (e.g. a position, a service, a relationship)
    • Define a concept? (e.g. a vocabulary term or metadata element)
  2. Do you already have (non-URI) names for those things? (e.g. using other information systems)
  3. Do URIs already exist for naming these things?
    • Are you sure that the existing URIs refer to the same thing as you intend?
  4. Will you or some other organization have control over the new URIs?
  5. Do you have any strong syntax preferences or requirements?
    • Will your stakeholders need to easily write the chosen URI on a piece of paper, or remember it easily?
    • Will you spell URIs on the phone?
    • Will the URIs need to give hints about the content of the resource?
    • Is it necessary for the URI structure to make guessing of related URIs easier?
  6. What are the long-term persistence requirements of your URIs?
    • Should the URIs you create still make sense if the named resource evolves?
    • How far into the future must your resolvable URIs lead to results (e.g. data, documents, definitions)
  7. Will you need to move the URI-named resources in the future?
    • Will such moves be related to organizational changes and may need to be reflected in the URIs?
    • Will these moves be technical only and should not need to be reflected in the URIs?
  8. Should the government sector (e.g. "Health," "Energy," "Defense") be included in the domain of the URI?
    • Have these sectors been defined formally (e.g. by statute)?
    • Will informal or equivalent sector names also be used?
  9. Is sensible resolution of partial/incomplete URIs necessary or anticipated?

URI Persistence

@@TODO@@ Expand this section (Bernadette)

Advice, info related to persistent URIs

As is the case with many human interactions, confidence in interactions via the Web depends on stability and predictability. For an information resource, persistence depends on the consistency of representations. The representation provider decides when representations are sufficiently consistent (although that determination generally takes user expectations into account).

Although persistence in this case is observable as a result of representation retrieval, the term URI persistence is used to describe the desirable property that, once associated with a resource, a URI should continue indefinitely to refer to that resource.

Consistent representation

A URI owner SHOULD provide representations of the identified resource consistently and predictably.

URI persistence is a matter of policy and commitment on the part of the URI owner. The choice of a particular URI scheme provides no guarantee that those URIs will be persistent or that they will not be persistent.

HTTP [RFC2616] has been designed to help manage URI persistence. For example, HTTP redirection (using the 3xx response codes) permits servers to tell an agent that further action needs to be taken by the agent in order to fulfill the request (for example, a new URI is associated with the resource).

In addition, content negotiation also promotes consistency, as a site manager is not required to define new URIs when adding support for a new format specification. Protocols that do not support content negotiation (such as FTP) require a new identifier when a new data format is introduced. Improper use of content negotiation can lead to inconsistent representations.

For more discussion about URI persistence, see [Cool].

Internationalized Resource Identifiers: Using non-ASCII characters in URIs

Guidelines for those interested in minting URIs in their own languages (German, Dutch, Spanish, Chinese, etc.)

The URI syntax defined in RFC 3986 STD 66 (Uniform Resource Identifier (URI): Generic Syntax) restricts URIs to a small number of characters: basically, just upper and lower case letters of the English alphabet, European numerals and a small number of symbols. There is now a growing need to enable use of characters from any language in URIs.

The purpose of this section is to provide guidance to government stakeholders who are planning to create URIs using characters that go beyond the subset defined in RFC 3986.

First we provide two important definitions:

IRI (RFC 3987) is a new protocol element, that represents a complement to the Uniform Resource Identifier (URI). An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646) that can be therefore be used to mint identifiers that use a wider set of characters than the one defined in RFC 3986.

The Internationalized Domain Name or IDN is a standard approach to dealing with multilingual domain names was agreed by the IETF in March 2003.

Althought there exist some standards focused on enabling the use of international characters in Web identifiers, government stakeholders need to take into account several issues before constructing such internationalized identifiers. This section is not meant to be exhaustive and we point the interested audience to An Introduction to Multilingual Web Addresses, however some of the most relevant issues are following:

  • Domain Name lookup: Numerous domain name authorities already offer registration of internationalized domain names. These include providers for top level country domains as .cn, .jp, .kr, etc., and global top level domains such as .info, .org and .museum.
  • Domain names and phishing: One of the problems associated with IDN support in browsers is that it can facilitate phishing through what are called 'homograph attacks'. Consequently, most browsers that support IDN also put in place some safeguards to protect users from such fraud.
  • Encoding problems: IRI provides a standard way for creating and handling international identifiers, however the support for IRIs among the various semantic Web technology stacks and libraries is not homogenic and may lead to difficulties for applications working with this kind of identifiers. A good reference on this subject can be found in "I18n of Semantic Web Applications" by Auer et al.

Working Notes

TWC RPI Draft

@@TODO@@ Format/Update URI Design Principals per TWC RPI Draft (JohnE)

TWC RPI has drafted URI Design Principles: Creating Unique URIs for Government Linked Data with an eye toward instance identifier URIs that may be easily re-hosted --- a syntactic design that can be modeled and demonstrated on one host (e.g. TWC's Instance Hub demonstrator) but can be easily re-hosted on another, such as a government agency responsible for a set of named entities.

URI Design Goals

The design principles should produce...

  • URIs that are easily re-hosted (eg from a demonstrator portal to agency host)
  • Concise URIs with as little cruft as possible
  • URIs that span many domains including:
    • National identifiers (e.g. govermental agencies, states, zip codes)
    • State-level identifiers (e.g. counties, congressional districts)
    • Agency-level identifiers (e.g. EPA facilities)

URI patterns (Ghislain)

  • URIs aims at identifying any data, concept or object to be published and be de-referenceable. URIs must have a pattern to follow within the public sector publishing data.
  • Decisions about the patterns for to use should also take into account some basic criteria like: simplicity, stability and manageability.
  • At first, one base URI structure should be identified, something looking like this form
 http:// {sector}. yourdomain / or http:// data. {sector}. {yourdomain} /.
  • Separate in the URIs schemes decisions the vocabulary and the data. Many decisions could be taken, and even with a special care for spatial data Designing URI Sets for the UK Public Sector
  • Two options for the vocabulary URIs schemes:

1- Using of the same base URI

  • 1.1- Having the path /ontology after your domain name, appending the concept local name of local ID.
e.g: http:// {baseURI} /ontology /{ontoName} /{Aclass}  (for a Concept) and
http:// {baseURI} /ontology /{ontoName} /{aProperty} (for a property)
  • 1.2- In this same direction, the URIs for the instances can be formed by adding /resource/{Aresource}/{individualResource} to the base URI.

2- Using a different scheme for instances.

  • 2.1- If having a base URI of the type http:// data. {sector}. {yourdomain} /, you may consider having the following pattern:
http:// data. {sector}. {yourdomain} / def/ {ontoDomain}}}. Note the keyword *def* for the vocabulary, {ontoDomain} can be the scope of the ontology, (geo, stat, service, transport, etc..)
  • 2.2- Using this solution, instances are formed using the following scheme:
http:// id. {sector}. {yourdomain} / {ontoDomain}. Note the presence of the keyword *id* (individuals) at the beginning of the URI pattern

URI Design Template

 'http://' BASE '/' 'id' '/' ORG '/' CATEGORY ( '/' TOKEN )+ 

In the example of TWC RPI's Instance Hub demonstration, BASE is logd.tw.rpi.edu

Notes on the RPI Design

  • id
    • This is required, to avoid polluting the top namespace of BASE with identifiers.
    • id is preferred over other alternatives to keep the token as short as possible.
    • The id token adds no semantics; it is merely a syntactic way of distinguishing instance identifier URIs from others.
    • Some consistency with [data.gov.uk data.gov.uk] URIs is considered A Good Thing.
  • ORG
    • This is a short token representing the agency, government, or organization that has authority over the identfier space.
    • For US identifiers, this token will start with 'us/', and be followed by a designation of either federal or state-level (e.g. 'us/fed', 'us/ny', 'us/ca').
    • Identifiers relating to data.gov will all fall under the federal 'us/fed' space.
    • For identifiers that aren't directly governmental, the ORG token should be suitably unique; for example, we use "usps-com" below for USPS controlled zip code URIs.
  • CATEGORY and TOKEN
    • These are ORG-specific values that identify the specific instance.
    • Use as many TOKENs as necessary to distinguish the instance.

Examples

(Per TWC RPI URI Design Draft)

The URI Design Principles page provides examples of applying this template to:

  • US Government agencies:
    http://BASE/id/us/fed/agency/Department_of_Health_and_Human_Services/Centers_for_Disease_Control
  • States and Territories:
    http://logd.tw.rpi.edu/id/us/state/Vermont
  • Counties:
    http://BASE/id/us/state/Alaska/Bethel_Census_Area
  • US Postal Codes (Zip Codes):
    http://BASE/id/usps-com/zip/09510
  • Congressional Districts:
    http://BASE/id/us/ma/congressional-district/4
  • EPA Facilities:
    http://BASE/id/epa-gov/facility/110007995027

OData Protocol URI Conventions

Government linked data providers using the Windows Azure Platform and implementing the Open Data Protocol (OData) specification may also wish to consider the OData: URI Conventions recommendation. That document "...defines a set of recommended (but not required) rules for constructing URIs to identify the data and metadata exposed by an OData server as well as a set of reserved URI query string operators, which if accepted by an OData server, MUST be implemented as required by (that document)..."


References