Warning:
This wiki has been archived and is now read-only.

BP Data Identification

From Data on the Web Best Practices
Jump to: navigation, search

Identifiers are simple conventions of labels that allow us to distinguish what is being identified from anything else. Identifiers are used extensively in every information system, making it possible to refer to any particular element. The Web is predicated on a uniform system of identifiers that are globally unique and can be looked up by dereferencing them over the Internet. There are three terms in common use for these identifiers and, although they are often used interchangeably, there are differences.

  • URL (Uniform Resource Locator) is the location of a resource on the Web. URLs are a subset of URIs.
  • URI (Uniform Resource Identifier) is an identifier for anything including those not available over the Internet such as people, buildings and mountains. There are many URI schemes, not all of which can be looked up over the Internet. For example, doi:10.1103/PhysRevD.89.032002 is as URI but cannot be looked up (directly) on the Internet. For data on the Web, only HTTP(s) URIs are relevant. HTTP URIs are a subset of URIs which are a subset of IRIs.
  • IRI (Internationalized Resource Identifier) are conceptually identical to URIs but allow the use of non-ASCII characters to which URIs are limited.

Of these three, the term URL is by far the most commonly used. The term URI, and even more so, IRI, may cause confusion among some audiences, however, in the context of data on the Web, URI is more appropriate since data points and datasets very often refer to real world objects and phenomena. The term IRI is used where necessary.

Data discovery, usage and citation on the Web depends fundamentally on the use of HTTP (or HTTPS) URIs.

It is perhaps worth emphasizing some key points about URIs in the current context.

  1. URIs are 'dumb strings', that is, they carry no semantics. Their function is purely to identify a resource.
  2. Although the previous point is accurate, it would be perverse for a URI such as http://example.com/datset.csv to return anything other than a CSV file. Human readability is helpful.
  3. When de-referenced (looked up), a single URI may offer the same resource in more than one format. http://example.com/dataset may offer the same data in, say, CSV, JSON and XML. The server returns the most appropriate format based on content negotiation.
  4. One URI may redirect to another.
  5. De-referencing a URI triggers a computer program to run on a server so that the URI acts as a call to an API. The server may therefore do something as simple as return a single, static file, or it may carry out complex processing. Precisely what processing is carried out, i.e. the software on the server, is completely independent of the URI itself.

Title

Use persistent URIs as identifiers

Summary

Datasets must be identified by a persistent URI.

Why

Adopting a common identification system enables basic data identification and comparison processes by any stakeholder in a reliable way. They are an essential pre-condition for proper data management and re-use.

Intended Outcome

Datasets or information about datasets, must be discoverable and citable through time, regardless of the status, availability or format of the data.

Possible Approach to Implementation

Whether a URI is persistent or not is a matter of policy and intention, not of technology.

To be persistent, URIs must be designed as such, backed up by organizational commitments. There have been a number of articles written on this topic and the following summarizes many of the key points made.

Recreate this list

Where a data publisher is unable or unwilling to manage its URI space directly for persistence, an alternative approach is to use a redirection service such as purl.org. This provides persistent URIs that can be redirected as required so that the eventual location can be ephemeral. The software behind such services is freely available so that it can be installed and managed locally if required.

Digital Object Identifiers (DOIs) offer a similar alternative. These identifiers are defined independently of any Web technology but can be appended to a 'URI stub.' DOIs are an important part of the digital infrastructure for research data and and libraries.

How to Test

Check that each dataset in question is identified using a URI that has been assigned under a controlled process as set out in the previous section. Ideally, the relevant Web site includes a description of the process and a credible pledge of persistence should the publisher no longer be able to maintain the URI space themselves.

Evidence

Relevant requirements: R-UniqueIdentifier, R-Citable

Title

Assign URIs to dataset versions and series

Summary

URIs should be assigned to individual versions of datasets as well as the overall series.

Why

Like documents, many datasets fall into natural series or groups. For example:

  • noon temperature readings in central London 1850 to the present day;
  • today's noon temperature in London;
  • the temperature in London at noon on 3rd June 2015.

In different circumstances, it will be appropriate to refer separately to each of these examples (and many like them).

Intended Outcome

It should be possible to refer to a specific version of a dataset and to concepts such as a 'dataset series' and 'the latest version.'

Possible Approach to Implementation

The W3C provides a good example of how to do this. The (persistent) URI for this document is http://www.w3.org/TR/2015/WD-dwbp-20150224/. The URI for the 'latest version' of this document is http://www.w3.org/TR/dwbp. At the time of publication, these two URIs both resolve to this document. However, when the next version of this document is published, the 'latest version' URI will be changed to point to that.

To complete the London temperature example, one might imagine URIs as follows:

How to Test

Check that each version of a dataset has its own URI, and that logical groups of datasets are also identifiable.

Evidence

Relevant requirements: R-UniqueIdentifier, R-Citable ??