URI Usage Primer

It is useful to use URIs to name things both on the web (web pages, images and so on) and in the wider world (people, organisations and so on), because it means that we can use those URIs to find out more information about those things by resolving the URI. Data often includes URIs as names for entities, and then associates properties with them. When data describes properties such as license or nationality for entities named with a URI, applications need to be able to distinguish between properties that apply to the result of resolving the URI (eg an HTML page) and those that apply to something described by that document.

This document describes the processing that applications should perform when they encounter URIs in data. It describes how to define data formats and publish information at URIs to enable applications to understand how URIs within data should be interpreted.

Introduction

Applications operate based on data that they receive or collect. For example, an application that works as an HTTP server might be sent data through an HTTP POST or PUT request. A mobile app might collect data by requesting it through GET requests on a web API.

The data that an application receives is a sequence of bits. The application interprets those bits through a series of processes — decoding, parsing, transforming and so on — to create an internal model based on which it can act. When the data includes URIs, those URIs may be used to inform the processing that builds the internal model, and the internal model may eventually include things that are labelled using the URIs. Most importantly, the internal model may include content retrieved by resolving the URIs in the original data, and associate properties with that content based on the information associated with the URI in that data.

For example, Paul Downey has created an image of his poster The URI Is The Thing and made it available on a photo sharing site. Let us imagine that the photo sharing site exposes information about the poster in a number of ways, including through a JSON API. The JSON might look something like:

{
  "@id": "http://photo.example.com/psd/12345/original.jpeg",
  "type": "image",
  "creator": "Paul Downey",
  "license": "http://creativecommons.org/licenses/by/3.0/"
}

In this case, say the URI http://photo.example.com/psd/12345/original.jpeg resolves to a sequence of bits that encodes a JPEG image, and the JSON provided by the photo sharing site is intended to inform applications that that JPEG image was created by Paul Downey and can be reused elsewhere as long as it is attributed. Knowing this, an application that accessed JSON from the site that included the above data could retrieve, store and process the bits provided at the JPEG's URI (for example to extract EXIF data).

In other cases, as described in , URIs used within data might point to landing pages which describe the thing that has the properties specified in the data rather than being the thing that has those properties. To communicate effectively, data providers and applications need to have an agreed understanding about whether a given property provided in some data applies directly to the content at the given URI or to the thing that content describes. This document provides terminology and best practices to facilitate that shared understanding.

Landing Pages

A landing page is any page which contains a description of something else. Landing pages often provide summaries or additional information about the thing that they describe. Examples are landing pages for images on Flickr or videos on YouTube, which are HTML pages that embed the media that they describe and provide access to comments and other metadata about it. Landing pages for documents are often tables of contents or abstracts.

For example, say that the photo sharing site from the earlier example published an HTML page about The URI Is The Thing at http://photo.example.com/psd/12345 which acts as a landing page for the photo, enabling people to add comments about it and providing links to other pictures by Paul Downey and so on. In this scenario, the site might publish the JSON:

{
  "@id": "http://photo.example.com/psd/12345",
  "type": "image",
  "creator": "Paul Downey",
  "license": "http://creativecommons.org/licenses/by/3.0/"
}

Unlike the previous example, here it is not the case that the content an application gets when it resolves the value of the @id property (http://photo.example.com/psd/12345) is an image — it is an HTML page. Similarly, the content of the HTML page is not created by Paul Downey — it is created by the photo sharing site. The HTML page is not available under the CC-by license — the photo sharing site holds the copyright. Thus the properties that are associated with the URI http://photo.example.com/psd/12345 within the data do not apply to the content provided at that URI, but to the image for which the HTML page is the landing page.

Landing pages are not necessarily HTML: APIs that provide data in JSON, XML or RDF usually use URIs within that data which provide locations from which further information about the entities associated with the URIs can be discovered, again in JSON or XML or RDF. These JSON, XML or RDF landing pages are exactly the same as HTML landing pages: they describe the image, video or other thing rather than being a sequence of bits that is that thing.

Thus the same considerations would apply if the photo sharing site published the JSON above at the URI http://photo.example.com/psd/12345. The JSON that's published at that URI is not an image, so it is a landing page (albeit one that can't be read easily by a human). The site could alternatively use content negotiation to determine whether a given application receives the JSON or the HTML or some other format.

If the URI http://photo.example.com/psd/12345 supported content negotiation such that a request with Accept: text/html provided an HTML page but a request with Accept: image/jpeg returned the image, the URI is being used to identify two distinct resources: the image and the landing page. As discussed in The Architecture of the World Wide Web [[WEBARCH]], this pattern should be avoided: different resources should be given different URIs.

The photo sharing site may add information to the JSON that it publishes that is about the HTML page at the URI. For example they might add a last-modified date that indicates the date and time that the page was last modified:

{
  "@id": "http://photo.example.com/psd/12345",
  "type": "image",
  "creator": "Paul Downey",
  "license": "http://creativecommons.org/licenses/by/3.0/",
  "last-modified": "2012-06-20T08:54:32Z"
}

Doing this is potentially confusing because a developer simply looking at the output of the API and trying to make sense of it might assume that because the rest of the properties associated with http://photo.example.com/psd/12345 (such as creator or license) apply to the image described by the landing page at that URI, the last-modified property must apply to that image as well, when in fact it applies to the HTML landing page.

While the above example is of a landing page for an image where the image itself is available elsewhere on the web, publishers also provide landing pages for things that aren't available on the web, such as people or pieces of furniture. For example, the photo sharing site might publish a landing page for Paul Downey:

{
  "@id": "http://photo.example.com/psd",
  "type": "person",
  "name": "Paul Downey",
  "nickname": "psd"
}

When data is about something like a person or piece of furniture, it is usually obvious (to developers, who understand the world) that a given property, such as nickname or dimensions, doesn't apply to the landing page but to the person or piece of furniture that it describes. On the other hand, when the data is about something whose content could exist as data on the web, such as a photograph or a book or a film, that thing will often have properties that could equally apply to the landing page itself, such as creator or last-modified.

Data Normalization

When publishing data on the web, you have to make choices about how best to group data together and associate it with URIs. You can think of data published on the web as being similar to data stored within tables of a relational database, in which each page on the web corresponds to a row that is identified by a primary key that is a URI. One row can reference another by using the URI as the value for a particular property.

In relational databases, it proves useful to normalize data by splitting the information available into separate tables with links between them, such that within a single table any properties are solely dependent on the primary key for that table. This helps to avoid the redundancy that occurs when the same information is stored in several places: after normalization, the information is stored in one place and referenced multiple times. It also helps to match user expectations about the way in which data is organised.

The same principle can be applied to information published on the web. If the photo sharing site were to apply this principle, instead of publishing information about a single entity, it might instead publish JSON in which the HTML landing page and the image it describes were separate objects with distinct URIs:

{
  "@id": "http://photo.example.com/psd/12345",
  "last-modified": "2012-06-20T08:54:32Z",
  "describes": {
    "@id": "http://photo.example.com/psd/12345/original.jpeg",
    "type": "image",
    "creator": "Paul Downey",
    "license": {
    	"@id": "http://creativecommons.org/licenses/by/3.0/"
    }
  }
}

Graphically, this looks like:

On websites that support write operations, particularly PUT requests, data normalization — providing separate URIs for separate entities — can help to isolate modifications and minimise the size of update messages, because the updates can be targetted to the resource that needs to be updated rather than a related entity that happens to incorporate its description. Even on read-only websites, data normalization facilitates the reuse of data from elsewhere on the web through linking and embedding, because the separate resources are given separate URIs. In the example above, the fact that the URI for the image itself (http://photo.example.com/psd/12345/original.jpeg) is provided within the data means that other data publishers on the web can refer directly to the image (to annotate it, embed it and so on) without going through the landing page.

When data normalization introduces entities which don't exist on the web, such as people or pieces of furniture, it can still be helpful to provide URIs for those entities separate from those for the landing pages that describe those things. Typically these URIs should either use fragment identifiers (also known as hash URIs) that don't resolve to fragments of content within the landing page, or URIs that give a 303 See Other response when they are requested, providing a redirection to the landing page that describes the entity. These methods are described in Cool URIs for the Semantic Web [[COOLURIS]].

Documenting Properties

Having unnormalized data has implications for how the properties used within data need to be documented. It isn't always feasible to fully normalize data published on the web. Splitting data up over several URIs can lead to more network requests (depending on the amount of data that is provided embedded within the published view). Supporting more URIs requires extra server-side development effort. In addition, current web APIs do not tend to reuse URIs from other APIs, which means there is little practical benefit to normalizing (though there are exceptions particularly around user identity within social networks and the identification of entities using the Open Graph Protocol, schema.org or Twitter Cards).

A data format that mixes properties about landing pages and properties about the things those landing pages describe is not necessarily ambiguous: all that's required for developers to understand what the properties actually apply to is for the meaning of the property to be documented.

We recommend the use of the following terms to describe properties within such documentation:

URI property: a property that holds a URI for an entity to which the properties in the data are associated, often named something like @id or url
direct property: a property that applies directly to an entity (which may or may not be a landing page, depending on the amount of normalization that has been done)
shorthand property: a property that indicates that the entity is a landing page that describes another entity which has a particular implied property with that value
implied property: a property whose presence on an entity is implied through a shorthand property on a landing page
parallel property: a implied property where the shorthand property implies the presence of a landing page for another entity (see )

The following diagram shows how these properties interact:

The term shorthand property can be used in a variety of cases, and documentation about shorthand properties needs to be particularly explicit about how they should be interpreted, as described in the following sub-sections.

For example, in the JSON

{
  "@id": "http://photo.example.com/psd/12345",
  "type": "image",
  "creator": "Paul Downey",
  "license": "http://creativecommons.org/licenses/by/3.0/",
  "last-modified": "2012-06-20T08:54:32Z"
}

the properties might be documented as:

@id: a URI property for an HTML landing page
type: a shorthand property that implies the thing the landing page describes has the specified type
creator: a shorthand property that implies the thing the landing page describes has the specified creator
license: a shorthand property that implies the thing the landing page describes has the license whose content is found at the location given
last-modified: a direct property which indicates when the landing page was last modified

In this cases, "has type", "has creator" and "has license" are implied properties which might not be described explicitly in the documentation. Graphically, we have:

URI Values

Properties may have values that are themselves URIs. In these cases, the property documentation should make clear whether the entity URI (provided by the URI property such as @id) points to a landing page, or the value URI (given in the value of the individual property) points to a landing page, or both. For example, in a case such as:

{
  "@id": "http://photo.example.com/psd/12345",
  "type": "image",
  "creator": "http://photo.example.com/psd",
  "license": "http://creativecommons.org/licenses/by/3.0/",
  "modified": "2012-06-20T08:54:32Z"
}

both the creator and the license properties are shorthand properties that relate to the image described by the landing page at the entity URI http://photo.example.com/psd/12345. However, the value of the creator property is also a landing page, this time for Paul Downey, whereas the value of the license property actually points to the content of the license.

Properties between entities that are implied due to a property asserted between two landing pages are called parallel properties because in a diagram that shows the relationships between the landing pages and between the entities, these kinds of implied properties will appear parallel to the shorthand property.

The following diagram shows the creator shorthand property, whose value is a URI that points to a landing page, and how this property implies the existence of two entities — an image and a person — and a "has creator" relationship between those entities.

Multi-Faceted Landing Pages

Sometimes landing pages are about more than one thing, or the thing that they describe is functionally related to other things. In the example we've been using, the image http://photo.example.com/psd/12345/original.jpeg is actually a photograph of a poster which is about the web. What if the photograph of his poster had been taken by someone other than Paul Downey? The JSON about its landing page might be:

{
  "@id": "http://photo.example.com/psd/12345",
  "type": "image",
  "photographer": "Nadia",
  "creator": "Paul Downey",
  "license": "http://creativecommons.org/licenses/by/3.0/",
  "last-modified": "2012-06-20T08:54:32Z"
}

In this case, the photographer property relates to the photograph described by the landing page at http://photo.example.com/psd/12345 whereas the creator property relates to the artwork that was photographed.

As this example shows, it is helpful to document the kind of the thing described by a landing page that a given property relates to. This enables an application, if it chooses to, to build an internal model of the data that includes separate entities for the landing page, each of the things that are described by the landing page, and the ways in which they are related.

In the example above, the documentation might include:

photographer: a shorthand property that implies the landing page is about a photograph that was taken by a photographer with the given name
creator: a shorthand property that implies the landing page is about a creative work that was created by someone with the given name

Combining Data

Data associated with the same URI may come from different sources. For example, a social networking site may provide JSON that states that someone likes the image described by http://photo.example.com/psd/12345 (here we assume that the likes property is defined as a shorthand property that implies that the content of the page at http://photo.example.com/psd/12345 describes the thing that is liked):

{
  "@id": "http://social.example.com/dirk",
  "type": "person",
  "likes": [
  	"http://photo.example.com/psd/12345",
  	...
  ]
}

A review site might similarly provide JSON that describes a review of the image at http://photo.example.com/psd/12345 (again we assume here that the documentation describes this):

{
  "@id": "http://review.example.com/jane/12345",
  "type": "review",
  "subject": "http://photo.example.com/psd/12345",
  "rating": 5
}

As discussed in , the landing page at http://photo.example.com/psd/12345 may describe many things. If a search engine or other application were to merge the information from the three sites, it would need to associate both the "like" and the review to the same entity — the image.

The publishers of the image could help applications to combine information about the image across the sites accurately. The publisher could supply a separate URI for the image itself, linked to from the landing page with a specific relationship (such as describesImage) through a Link: HTTP header or a <link> element within the page. The social media site and the review site could either reference that image directly, or describe their shorthand properties in terms of the describesImage property of the landing page.

Locating Property Documentation

Documentation about how to interpret data, which includes how URIs within the data should be interpreted and how any properties within the data apply to entities named by those URIs, should be published somewhere such that it's possible for those developers to find that documentation. Possible routes for doing this explicitly include:

if the data is provided through a protocol that supports it, such as through HTTP, by explicitly indicating the media type of the data, and registering that media type such that documentation can be found for it through the IANA media type registry
if the media type is generic (such as application/json), by providing supplementary documentation through a profile link relationship, for example within a HTTP Link: header
embedding links to the documentation within the data itself, for example through a resolvable XML namespace or @xsi:schemaLocation attribute in XML or resolvable URIs for classes and properties in RDF

Developers must be able to locate this documentation through a mechanism that isn't a search against the Internet. For example, if the property documentation should be accessed through resolving URIs within the data (the last of the options above), this should be specified within the media type definition or the documentation provided through the profile link relationship.

What if the data isn't made available by HTTP and you therefore don't have a media type: how does follow-your-nose work in that case? For example, if the data is provided via FTP or embedded within a textual email message.

Recommendations

This section makes concrete recommendations for data consumers, data publishers and the authors of specifications that use URIs, based on the discussion above.

Authoring Specifications

Data formats that include references to URIs should specify what properties are associated with the entities named by those URIs, based on how the URIs are used within the document and on the other data within the document. They should also specify what applications can expect to find at the end of these URIs: in particular whether the URI is being used to reference the content found at that URI, or something described by that content.

For example, the XML Recommendation [[XML10]] specifies that URIs used within the <!DOCTYPE> declaration must resolve to documents that are well-formed external subsets: this places some clear expectations on what publishers should publish at these URIs, and on how applications should process them.

By contrast, the Namespaces in XML Recommendation [[XML-NAMES]] does not specify how the URIs used within XML namespace declarations should be processed, over and above how to compare them. It does not say whether applications can resolve them, or what should be found if they are resolved. The Architecture of the World Wide Web states that these URIs should resolve to "namespace documents", but leaves open whether these should be machine-readable schemas themselves, or landing pages from which the schemas can be located, as described in the TAG Finding Associating Resources with Namespaces.

Specifying Metaformats

Metaformats such as RDF that incorporate URIs as part of their core information model should document the default interpretation of those URIs: whether properties for which no other information is available should be interpreted as applying to the content available at those URIs or the things those pages describe.

Metaformats that are designed to be used to provide metadata about HTML pages, images, video and other information on the web should default to an interpretation in which URIs are used to locate content. Those that are designed to encode data about things that are not found on the web should default to an interpretation in which URIs are used to locate data that describe the entity.

Metaformats may delegate how URIs are interpreted to individual vocabularies that use the metaformat, such that different properties within a vocabulary fall into different categories, as described in .

Specifying Vocabularies

Authors of vocabularies used with metaformats such as XML, JSON or RDF should define how data expressed in those vocabularies should be interpreted. The vocabulary should be documented in terms of the entities that data using that vocabulary describes, and how any URIs used within the vocabulary should be interpreted, whether as locating content on the web or locating landing pages that describe things. This interpretation may vary on a property-by-property basis, in which case the properties should be documented using the terminology given in .

Specifying Schema Languages

Schema languages should include mechanisms for indicating the category of a property as described in . This encourages vocabulary authors to be explicit in their property documentation, and it enables applications to automatically normalize the data they encounter into separate entities, as described in , without prior knowledge of the vocabulary.

In some cases, vocabulary authors may wish to provide names within the vocabulary for the implied properties in order to express the relationship between them and shorthand properties. Alternatively, shorthand properties within a vocabulary may be mapped on to implied properties in a different vocabulary. Schemas may provide the facility to specify the implications of the presence of each shorthand property in terms of implied entities and their properties.

Consuming Data

Applications that consume data on the web should determine what properties can be associated directly with an entity. This determination should be based on the media type of the data. Media types for structured syntaxes such as JSON, XML or Turtle may delegate information about how to interpret data to a vocabulary, defined in a schema or in separate documentation.

Applications should be wary, in the absence of explicit indications within specifications or vocabularies, about associating properties with the content located at URI used within a URI property for an entity. Some publishers may intend the properties to be associated with the content, while others intend them to be associated with an entity described by the content. Applications should be particularly careful in interpreting properties that could be associated with things on the web, such as "like" or "creator".

HTTP Responses

TODO. Jonathan suggests including a section on handling hash URIs and 303 responses; he actually suggested an appendix, but I think it fits here better.

Publishing Data

Publishers can help enable more accurate merging of data from different sites if they supply separate URIs for the different entities that other sites may wish to reference, and link to these from any landing pages using a link relationship such as describes. These URIs can simply link back (through a 303 redirection if the entity does not have any web content, or with a describedby link relationship if it does), to the landing page.