Hash URIs

Note: This was initially posted at http://www.jenitennison.com/blog/node/154. There’s been quite a bit of discussion recently about the use of hash-bang URIs following their adoption by Gawker, and the ensuing downtime of that site. Gawker have redesigned their sites, including lifehacker and various others, such that all URIs look like http://{domain}#!{path-to-content} — the #! is the hash-bang. The home page on the domain serves up a static HTML page that pulls in Javascript that interprets the path-to-content and requests that content through AJAX, which it then slots into the page. The sites all suffered an outage when, for whatever reason, the Javascript couldn’t load: without working Javascript you couldn’t actually view any of the content on the site. This provoked a massive cry of #FAIL (or perhaps that should be #!FAIL) and a lot of puns along the lines of making a hash of a website and it going bang. For analysis and opinions on both sides, see: Breaking the Web with hash-bangs by Mike Davies Broken Links by Tim Bray Hash, Bang, Wallop by Ben Ward Hash-bang boom by Tom Gibara Thoughts on the Hashbang by Ben Cherry Nathan’s comments on www-tag While all this has been going on, the TAG at the W3C have been drafting a document on Repurposing the Hash Sign for the New Web (originally named Usage Patterns For Client-Side URI parameters in April 2009) which takes a rather wider view than just the hash-bang issue, and on which they are seeking comments. All matters of design involve weighing different choices against some criteria that you decide on implicitly or explicitly: there is no single right way of doing things on the web. Here, I explore the choices that are available to web developers around hash URIs and discuss how to mitigate the negative aspects of adopting the hash-bang pattern.

Background

The semantics of hash URIs have changed over time. Look back at RFC 1738: Uniform Resource Locators (URL) from December 1994 and fragments are hardly mentioned; when they are, they are termed “fragment/anchor identifiers”, reflecting their original use which was to jump to an anchor within an HTML page (indicated by an ` element with aname` attribute; those were the days). Skip to RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax from August 1998 and fragment identifiers have their own section, where it says:

When a URI reference is used to perform a retrieval action on the identified resource, the optional fragment identifier, separated from the URI by a crosshatch (“#”) character, consists of additional reference information to be interpreted by the user agent after the retrieval action has been successfully completed. As such, it is not part of a URI, but is often used in conjunction with a URI. At this point, the fragment identifier: is not part of the URI should be interpreted in different ways based on the mime type of the representation you get when you retrieve the URI is only meaningful when the URI is actually retrieved and you know the mime type of the representation Forward to RFC 3986: Uniform Resource Identifier (URI): Generic Syntax from January 2005 and fragment identifiers are defined as part of the URI itself: The fragment identifier component of a URI allows indirect identification of a secondary resource by reference to a primary resource and additional identifying information. The identified secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations. This breaks away from the tight coupling between a fragment identifier and a representation retrieved from the web and purposefully allows the use of hash URIs to define abstract or real-world things, addressing TAG Issue 37: Definition of abstract components with namespace names and frag ids and supporting the use of hash URIs in the semantic web. Around the same time, we have the growth of AJAX, where a single page interface is used to access a wide set of content which is dynamically retrieved using Javascript. The AJAX experience could be frustrating for end users, because the back button no longer worked (to let them go back to previous states of their interface) and they couldn’t bookmark or share state. And so applications started to use hash URIs to track AJAX state (that article is from June 2005, if you’re following the timeline). And so we get to hash-bangs. These were proposed by Google in October 2009 as a mechanism to distinguish between cases where hash URIs are being used as anchor identifiers, to describe views, or to identify real-world things, and those cases where they are being used to capture important AJAX state. What Google proposed is for pages where the content of the page is determined by a fragment identifier and some Javascript to also* be accessible by combining the base URI with a query parameter (_escaped_fragment_={fragment}). To distinguish this use of hash URIs from the more mundane kinds, Google proposed starting the fragment identifier #! (hash-bang). Hash-bang URIs are therefore associated with the practice of transcluding content into a wrapper page. To summarise, hash URIs are now being used in three distinct ways: 1. to identify parts of a retrieved document 2. to identify an abstract or real-world thing (that the document says something about) 3. to capture the state of client-side web applications Hash-bang URIs are a particular form of the third of these. By using them, the website indicates that the page uses client-side transclusion to give the true content of the page. If it follows Google’s proposal, the website also commits to making that content available through an equivalent base URI with a _escaped_fragment_ parameter.

Hash-bang URIs in practice

Let’s have a look at how hash-bang URIs are used in a couple of sites.

Lifehacker

First, we’ll look at lifehacker, which is one of Gawker’s sites whose switch to hash-bangs triggered the recent spate of comments. What happens if I link to the article http://lifehacker.com/#!5770791/top-10-tips-and-tricks-for-making-your-work-life-better? The exact response to this request seems to depend on some cookies (it didn’t work the first time I accessed it in Firefox, having pasted the link from another browser). If it works as expected, in a browser that supports Javascript, the browser gets the page at the base URI http://lifehacker.com/, which includes (amongst a lot of other things) a script that POSTs to http://lifehacker.com/index.php?_actn_=ajax_post a request with the data: op=ajax_post refId=5770791 formToken=d26bd943151005152e6e0991764e6c09 The response to this POST is a 53kB JSON document that contains a bit of metadata about the post and then its escaped HTML content. This gets inserted into the page by the script, to display the post. As this isn’t a GETtable resource, I’ve attached this file to this post so you can see what it looks like. (Honestly, I could hardly bring myself to describe this: a POST to get some data? a .php URL? query parameter set to ajax_post? massive amounts of escaped HTML in a JSON response? Geesh. Anyway, focus… hash-bang URIs…) A browser that doesn’t support Javascript simply gets the base URI and is none the wiser about the actual content that was linked to. What about the _escaped_fragment_ equivalent URI, http://lifehacker.com/?_escaped_fragment_=5770791/top-10-tips-and-tricks-for-making-your-work-life-better? If you request this, you get back an 200 OK response which is an HTML page with the content embedded in it. It looks just the same as the original page with the embedded content. What if you make up some rubbish URI, which in normal circumstances you would expect to give a 404 Not Found response? Naturally, a request to the base URI of http://lifehacker.com/ is always going to give a 200 OK response, although if you try http://lifehacker.com/#!1234/made-up-page you get page furniture with no content in the page. A request to http://lifehacker.com/?_escaped_fragment_=1234/made-up-page results in a 301 Moved Peramently to the hash-bang URI http://lifehacker.com/#!1234 rather than the 404 Not Found that we’d want.

Twitter

Now let’s look at Twitter. What happens if I link to the tweet http://twitter.com/#!/JeniT/status/35634274132561921? Although it’s not indicated in the Vary header, Twitter determines what to do about any requests to this hashless URI based on whether I’m logged in or not (based on a cookie). If I am logged on, I get the new home page. This home page GETs (through various iframes and Javascript obfuscation) several small JSON files through Twitter’s API: http://api.twitter.com/1/statuses/show.json?include_entities=true&contributor_details=true&id=35634274132561921: the details of the tweet http://api.twitter.com/1/statuses/35634274132561921/retweeted_by.json?count=15: details about retweets * http://api.twitter.com/1/users/lookup.json?user_id=&screen_name=unhosted: details about the twitter user @unhosted, who was mentioned in the tweet This JSON gets converted into HTML and embedded within the page using Javascript. All the links within the page are to hash-bang URIs and there is no way of identifying the hashless URI (unless you know the very simple pattern that you can simply remove it to get a static page). If I’m not logged on but am using a browser that understands Javascript, the browser GETs http://twitter.com/; the script in the returned page picks out the fragment identifier and redirects (using Javascript) to http://twitter.com/JeniT/status/35634274132561921. If, on the other hand, I’m using curl or a browser without Javascript activated, I just get the home page and have no idea that the original hash-bang URI was supposed to give me anything different. The response to the hashless URI http://twitter.com/JeniT/status/35634274132561921 also varies based on whether I’m logged in or not. If I am, the response is a 302 Found to the hash-bang URI http://twitter.com/#!/JeniT/status/35634274132561921. If I’m not, for example using curl, Twitter just returns a normal HTML page that contains information about the tweet that I’ve just requested. Finally, if I request the _escaped_fragment_ version of the hash-bang URI http://twitter.com/?_escaped_fragment_=/JeniT/status/35634274132561921 the result is a 301 Moved Permanently redirection to the hashless URI http://twitter.com/JeniT/status/35634274132561921 which can be retrieved as above. Requesting a status that doesn’t exist such as http://twitter.com/#!/JeniT/status/1 in the browser results in a page that at least tells you the content doesn’t exist. Requesting the equivalent _escaped_fragment_ URI redirects to the hashless URI http://twitter.com/JeniT/status/1. Requesting this results in a 404 Not Found result as you would expect.

Advantages of Hash URIs

Why are these sites using hash-bang URIs? Well, hash URIs in general have four features which make them useful to client-side applications: they provide addresses for application states; they give caching (and therefore performance) boosts; they enable web applications to draw data from separate servers; and they may have SEO benefits.

Addressing

Interacting with the web is all about moving from one state to another, through clicking on links, submitting forms, and otherwise taking action on a page. Backend databases on web servers, cookies, and other forms of local storage provide methods of capturing application state, but on the web we’ve found that having addresses for states is essential for a whole bunch of things that we find useful: being able to use the back button to return to previous states being able to bookmark states that we want to return to in the future being able to share states with other people by linking to them On the web, the only addressing method that meets these goals is the URI. Addresses that involve more than a URI, such as “search http://example.com/ with the keyword X and click on the third link” or “access http://example.org/ with cookie X set to Y” or “access http://example.net with the HTTP header X set to Y” simply don’t work. You can’t bookmark them or link to them or put them on the side of a bus. Application state is complex and multi-faceted. As a web developer, you have to work out which parts of the application state need to be addressable through URIs, which can be stored on the client and which on a server. They can be classified into four rough categories; states that are associated with: 1. having particular content in the page, such as having a particular thread open in a webmail application 2. viewing a particular part of the content, such as a particular message within a thread that is being shown in the page 3. having a particular view of the content, such as which folders in a navigational folder list are collapsed or expanded 4. a user-interface feature, such as whether a drop-down menu is open or closed States that have different content almost certainly need to have different URIs so that it’s possible to link to that content (the web being nothing without links). At the other extreme, it’s very unlikely that the state of a drop-down menu would need to be captured at all. In between is a large grey area, where a web developer might decide not to capture state at all, to capture it in the client, in the server, or to make it addressable by giving it a URI. If a web developer chooses to make a state addressable through a URI, they again have choices to make about which part of the URI to use: should different states have different domains? different paths? different query parameters? different fragment identifiers? Hash URIs make states addressable that developers might otherwise leave unaddressable. To give some examples, on legislation.gov.uk we have decided to: use the path to indicate a particular piece of content (eg which section of an item of legislation you want to look at), for example /ukpga/1985/67/section/6 use query parameters for particular views on that content (eg whether you want to see the timeline associated with the section or not), for example /ukpga/1985/67/section/6?view=timeline&timeline=true use fragment identifiers to jump to subsections, for example /ukpga/1985/67/section/6#section-6-2 * also use fragment identifiers for enhanced views (eg when viewing a section after a text search) /ukpga/1985/67/section/6#text%3Dschool%20bus The last of these states would probably have gone un-addressed if we couldn’t use a hash URI for it. The only changes that it makes to the normal page are currently to the links to other legislation content, so that you can go (back) to a highlighted table of contents (though we hope to expand it to provide in-section highlighting). Given that we rely heavily on caching to provide the performance that we want and that there’s an infinite variety of free-text search terms, it’s simply not worth the performance cost of having a separate base URI for those views.

Caching and Parallelisation

Fragment identifiers are currently the only part of a URI that can be changed without causing a browser to refresh the page (though see the note below). Moving to a different base URI — changing its domain, path or query — means making a new request on the server. Having a new request for a small change in state makes for greater load on the server and a worse user experience due both to the latency inherent in making new requests and the large amount of repeated material that has to be sent across the wire.

Note: HTML5 introduces pushState() and changeState() methods in its history API that enable a script to add new URIs to the browser’s history without the browser actually navigating to that page. This is new functionality, at time of writing only supported in Chrome, Safari and Firefox (and not completely in any of them) and unlikely to be included in IE9. When this functionality is more widely adopted, it will be possible to change state to a new base URI without causing a page load. When a change of state involves simply viewing a different part of existing content, or viewing it in a different way, a hash URI is often a reasonable solution. It supports addressability without requiring an extra request. Things become fuzzier when the same base URI is used to support different content, where transclusion is used. In these cases, the page that you get when you request the base URI itself gets content from the server as one or more separate AJAX requests based on the fragment identifier. Whether this ends up giving better performance depends on a variety of factors, such as: How large are the static portions of the page (served directly) compared to the dynamic parts (served using AJAX)? If the majority of the content is static as a user moves through the site, you’re going to benefit from only loading the dynamic parts as state changes. Can different portions of the page be requested in parallel? These days, making many small requests may lead to better performance than one large one. * Can the different portions of the page be cached locally or in a CDN? You can make best use of caches if the rapidly changing parts of a page are requested separately from the slowly changing parts.

Distributed Applications

Hash URIs can also be very useful in distributed web applications, where the code that is used to provide an interface pulls in data from a separate, unconnected source. Simple examples are mashups that use data provided by different sources, requested using AJAX, and combine that data to create a new visualisation. But more advanced applications are beginning to emerge, particularly as a reaction to silo sites such as Google and Facebook, which lock us in to their applications by controlling our data. From the unhosted manifesto:

To be unhosted, a website’s code will need to be very ajaxy first, so that all the servers do is store and serve json data. No server-side processing. This is because we need to switch from transport-layer encryption to client-side payload encryption (we no longer necessarily trust the server we’re talking to). From within the app’s source code, that should run entirely in JavaScript and HTML5, json-objects can be stored, retrieved, sent, and received. The user will have the same experience (we even managed to avoid needing a plugin), but the website is unhosted in the sense that the servers you talk to only see encrypted data and don’t even know which application you are running. The aim of unhosted is to separate application code from user data. This divides servers (at least functionally) into those that store and make available user data, and those that host applications and any supporting code, images and so on. The important feature of these sites is that user data never passes through the web application’s server. This frees users to move to different applications without losing their data. This doesn’t necessarily stop the application server from doing any processing, including URI-based processing; it is only that the processing cannot be based on user data — the content of the site. Since this content is going to be accessed through AJAX anyway, there’s little motivation for unhosted applications to use anything other than local storage and hash URIs to encode state.

SEO

A final reason for using hash URIs that I’ve seen cited is that it increases the page rank for the base URI, because as far as a search engine is concerned, more links will point to the same base URI (even if in fact they are pointing to a different hash URI). Of course this doesn’t apply to hash-bang URIs, since the point of them is precisely to enable search engines to distinguish between (and access content from) URIs whose base URI is the same.

Disadvantages of Hash URIs

So hash-bangs can give a performance improvement (and hence a usability improvement), and enable us to build new kinds of web applications. So what are the arguments against using them?

Restricted Access

The main disadvantages of using hash URIs generally to support AJAX state arise due to them having to be interpreted by Javascript. This immediately causes problems for: users who have chosen to turn off Javascript because: they have bandwidth limitations they have security concerns they want a calmer browser experience clients that don’t support Javascript at all such as: search engines screen scrapers clients that have buggy Javascript implementations that you might not have accounted for such as: older browsers some mobile clients The most recent statistic I could find, about access to the Yahoo home page indicates that up to 2% of access is from users without Javascript (they excluded search engines). According to a recent survey, about the same percentage of screen reader users have Javascript turned off. This is a low percentage, but if you have large numbers of visitors it adds up. The site that I care most about, legislation.gov.uk, has over 60,000 human visitors a day, which means that about 1,200 of them will be visiting without Javascript. If our content were completely inaccessible to them we’d be inconveniencing a large number of users.

Brittleness

Depending on hash-bang URIs to serve content is also brittle, as Gawker found. If the Javascript that interprets the fragment identifier is temporarily inaccessible or unable to run in a particular browser, any portions of a page that rely on Javascript also become inaccessible.

Replacing HTTP

There are other, less obvious, impacts which occur when you use a hash-bang URI. The URI held in the HTTP Referer header “MUST NOT include a fragment”. As Mike Davies noted, this prevents such URIs from showing up in server logs, and stops people from working out which of your pages are linking to theirs. (Of course, this might be a good thing in some circumstances; there might be aspects of the state of a page that you’d rather a referenced server not know about.) You should also consider the impact on the future proofing of your site. When a server knows the entirety of a URI, it can use HTTP mechanisms to indicate when pages have moved, gone, or never existed. With hash URIs, if you change the URIs you use on your site, the Javascript that interprets the fragment identifier needs to be able to recognise and support any redirections, missing, or never existing pages. The HTTP status code for the wrapper page will always be 200 OK, but be meaningless. Even if your site structure doesn’t change, if you use hash-bang URIs as your primary set of URIs, you’re likely to find it harder to make a change back to using hashless URIs in the future. Again, you will be reliant in perpetuity on Javascript routing to decipher the hash-bang URI and redirect it to a hashless URI.

Lack of Differentiation

A final factor is that fragment identifiers can become overcrowded with state information. In a purely hash-URI-based site, what if you wanted to jump to a particular place within particular content, shown with a particular view? The hash URI has to encode all three of these pieces of information. Once you start using hash-bang URIs, there is no way to indicate within the URI (for search engines, for example) that a particular piece of the URI can be ignored when checking for equivalence. With normal hash URIs, there is an assumption that the fragment identifier can basically be ignored; with hash-bang URIs that is no longer true.

Good Practice

Having looked at the advantages and disadvantages, I would echo what seems to be the general sentiment around traditional server-based websites that use hash-bang URIs: pages that give different content should have different base URIs, not just different fragment identifiers. In particular, if you’re serving large amounts of document-oriented content through hash-bang URIs, consider swapping things around and having hashless URIs for the content that then transclude in the large headers, footers and side bars that form the static part of your site. However, if you are running a server-based, data-driven web application and your primary goal is a smooth user experience, it’s understandable why you might want to offer hash URIs for your pages to the 98% of people who can benefit from it, even for transcluded content. In these cases I’d argue that you should practice progressive enhancement: 1. support hashless URIs which do not simply redirect to a hash URI, and design your site around those 2. use hash-bang URIs as suggested by Google rather than simple hash URIs 3. provide an easy way to get the sharable, hashless URI for a particular page when it is accessed with a hash-bang URI 4. use hashless URIs within links; these can be overridden with onclick listeners for those people with Javascript; using the hashless URI ensures that ‘Copy Link Location’ will give a sharable URI 5. use the HTML5 history API where you can to add or replace the relevant hashless URI in the browser history as state changes 6. ensure that only those visitors that both have Javascript enabled and do not have support for HTML5’s history API have access to the hash-bang URIs by using Javascript to, for example: redirect to a hash-bang URI rewrite URIs within pages to hash-bang URIs * attach onclick URIs to links 7. support the _escaped_fragment_ query parameter, the result of which should be a redirection to the appropriate hashless URI This is roughly what Twitter has done, except that it doesn’t make it easy to get the hashless URI from a page or from links within the page. Of course the mapping in Twitter’s case is the straight-forward removal of the #! from the URI, but as a human it’s frustrating to have to do this by hand. The above measures ensure that your site will remain as accessible as possible to all users and provides a clear migration path as the HTML5 history API gains acceptance. The slight disadvantage is that encouraging people to use hashless URIs for links means that you you can no longer depend quite so much on caching as the first page that people access in a session might be any page (whereas with a pure hash-bang scheme everyone goes to the same initial page). Distributed, client-based websites can take the same measures — the application’s server can send back the same HTML page regardless of the URI used to access it; Javascript can pull information from a URI’s path as easily as it can from a fragment identifier. The biggest difficulty is supporting the static page through the _escaped_fragment_ convention without passing user data through the application server. I suspect we might find a third class of service arise: trusted third-party proxies using headless browsers to construct static versions of pages without storing either data or application logic. Time will tell.

The Deeper Questions

There are some deeper issues here regarding web architecture. In the traditional web, there is a one-to-one correspondence between the representation of a resource that you get in response to a request from a server, and the content that you see on the page (or a search engine retrieves). With a traditional hash URI for a fragment, the HTTP headers you retrieve for the page are applicable to the hash URI as well. In a web application that uses transclusion, this is not the case.

Note: It’s also impossible to get metadata about hash URIs used for real-world or abstract things using HTTP; in these cases, the metadata about the thing can only be retrieved through interpreting the data within the page (eg an RDF document). Whereas with the 303 See Other pattern for publishing linked data, it’s possible to use a 404 Not Found response to indicate a thing that does not exist, there is no equivalent with hash URIs. Perhaps this is what lies at the root of my feeling of unease about them. With hash-bang URIs, there are in fact three (or more) URIs in play: the hash-bang URI (which identifies a wrapper page with particular content transcluded within it), a base URI (which identifies the wrapper HTML page) and one or more content URIs (against which AJAX requests are made to retrieve the relevant content). Requests to the base URI and the content URIs provide us with HTTP status codes and headers that describe those particular representations. The only way of discovering similar metadata about the hash-bang URI itself is through the _escaped_fragment_ query parameter convention which maps the hash-bang URI into a hashless URI that can be requested. Does this matter? Do hash-bang URIs “break the web”? Well, to me, “breaking the web” is about breaking the implicit socio-technical contract that we enter into when we publish websites. At the social level, sites break the web when they withdraw support for URIs that are widely referenced elsewhere, hide content behind register- or pay-walls, or discriminate against those who suffer from disabilities or low bandwidth. At the technical level, it’s when sites lie in HTTP. It’s when they serve up pages with the title “Not Found” with the HTTP status code 200 OK. It’s when they serve non-well-formed HTML as application/xhtml+xml. These things matter because we base our own behaviour on the contract being kept. If we cannot trust major websites to continue to support the URIs that they have coined, how can we link to them? If we cannot trust websites to provide accurate metadata about the content that they serve, how can we write applications that cache or display or otherwise use that information? On their own, pages that use Javascript-based transclusion break both the social side (in that they limit access to those with Javascript) and the technical side (in that they cannot properly use HTTP) of the contract. But contracts do get rewritten over time. The web is constantly evolving and we have to revise the contract as new behaviours and new technologies gain adoption. The _escaped_fragment_ convention gives a life line: a method of programmatically discovering how to access the version of a page without Javascript, and to discover metadata about it through HTTP. It is not a pretty pattern (I would much prefer that the server returned a header containing a URI template that described how to create the hashless equivalent of a hash-bang URI, and to have some rules about the parsing of a hash-bang fragment identifier so that it could include other fragments identifiers) but it has the benefit of adoption. In short, hash-bang URIs are an important pattern that will be around for several years because they offer many benefits compared to their alternatives, and because HTML5’s history API is still a little way off general support. Rather than banging the drum against hash-bang URIs, we need to try to make them work as well as they can by: berating sites that use plain hash URIs for transcluded content encouraging sites that use hash-bang URIs to follow some good practices such as those I outlined above * encouraging applications, such as browsers and search engines, to automatically map hash-bang URIs into the _escaped_fragment_ pattern when they do not have Javascript available We also need to keep a close eye on emerging patterns in distributed web applications to ensure that these efforts are supported in the standards on which the web is built.

New opportunities for linked data nose-following

For those of you interested in deploying RDF on the Web,
I’d like to draw your attention to three new proposed standards from IETF,
Web Linking“,
Defining Well-Known URIs“,
and “Web Host Metadata“,
that create new follow-your-nose tricks that could be used by semantic web clients to obtain RDF connected to a URI – RDF that presumably defines what the URI ‘means’ and/or describes the thing that the URI is supposed to refer to.

Most semantic web application developers are probably familiar with three ways to nose-follow from a URI:

  1. For # URIs – for X#F, the document X tells you about <X#F>
  2. When the response to GET X is a 303 – the redirect target tells you about <X>
  3. When the response to GET X is a 200 – the content may tell you about <X>

In case 3, X refers to what I’ll call a “web page” (a more technical term is used in the TAG’s httpRange-14 resolution). One of the new RFCs extends case 3 to situations where the RDF can’t be embedded in the content, either because the content-type doesn’t provide a place to put it (e.g. text/plain) or because for administrative reasons the content can’t be modified to include it (e.g. a web archive that has to deliver the original bytes faithfully). The others cover this case as well as offering improved performance in case 2.

Web pages as RDF subjects

Before getting into the new nose-following protocols, I’ll amplify case 3 above by listing a few applications of RDF in which a web page occurs as a subject. I’ll rather imprecisely call such RDF “metadata”.

  1. Bibliographic metadata – tools such as Zotero might be interested in obtaining Dublin Core, BIBO, or other citation data for the web page.
  2. Stability metadata – for annotation and archiving purposes it may be useful to know whether the page’s content is committed to be stable over time (e.g. this has changing content versus this has unchanging content). See TimBL’s Generic Resources note.
  3. Historical and archival metadata – it is useful to have links to other versions of a document – including future versions.

All sorts of other statements can be made about a web page, such as a type (wiki page, blog post, etc.), SKOS concepts, links to comments and reviews, duration of a recording, how to edit, who controls it administratively, etc. Anything you might want to say about a web page can be said in RDF.

Embedded metadata is easy to deploy and to access, and should be used when possible. But while embedded metadata has the advantages of traveling around with the content, a protocol that allows the server responsible for the URI to provide metadata over a separate “channel” has two advantages over embedded metadata: First, the metadata doesn’t have to be put into the content; and second, it doesn’t have to be parsed out of the content. And it’s not either/or: There is no reason not to provide metadata through both channels when possible.

Link: header

The ‘Web Linking’ proposed standard defines the HTTP Link: header, which provides a way to communicate links rooted at the requested resource. These links can either encode interesting information directly in the HTTP response, or provide a link to a document that packages metadata relevant to the resource.

In the former case, one might have:

Link: <http://xmlns.com/foaf/0.1/Document>;
  rel=”http://www.w3.org/1999/02/22-rdf-syntax-ns#type”

meaning that the request URI refers to something of type foaf:Document. In the latter case one might have:

Link: <http://example.com/about/foo.rdf>;
  rel=”describedby”; type=application/rdf+xml

meaning that metadata can be found in
<http://example.com/about/foo.rdf>, and hinting that the
latter resource might have a ‘representation’ with media type
application/rdf+xml.

Host-wide nose-following rules

The motivation for the “well-known URIs” RFC is to collect all “well-known URIs” (analogous to “robots.txt”) in a single place, a root-level “.well-known” directory, and create a registry of them to avoid collisions. The most pressing need comes from protocols such as webfinger and OpenID; see Eran Hammer-Lahav’s blog post for the whole story.

For linked data, .well-known provides an opportunity for providing metadata for web pages, as well improving the efficiency of obtaining RDF associated with other “slash URIs”, what is currently done using 303 responses.

Ever since the TAG’s httpRange-14 decision in 2005, there have been concerns that it takes two round trips to collect RDF associated with a slash URI. While some might question why those complaining aren’t using hash URIs, in any case the “well-known URIs” mechanism gives a way to reduce the number of round trips in many cases, eliminating many GET/303 exchanges.

The trick is to obtain, for each host, a generic rule that will transform the URI at that host that you want RDF for into the URI of a document that carries that RDF. This generic rule is stored in a file residing in the .well-known space at a path that is fixed across all hosts. That is: to find RDF for http://example.com/foo, follow these steps:

  1. obtain the host name, “example.com”
  2. form the URI with that host name and path
    “/.well-known/host-meta”, i.e.
    “http://example.com/.well-known/host-meta”
    (see
    here)
  3. if not already cached, fetch the document at that URI
  4. in that document find a rule generically transforming
    original-URI -> about-URI
  5. apply the rule to “http://example.com/foo” obtaining (say)
    “http://example.com/about/foo”
  6. find RDF about “http://example.com/foo”
    in document “http://example.com/about/foo”

The form of the about-URI is chosen by the particular host, e.g. “http://example.com/foo,about” or “http://about.example.com/foo” or whatever works best.

Why is this fewer round trips than using 303? Because you can fetch and cache the generic rule once per site. The first use of the rule still costs an extra round trip, but subsequent URIs for a given site can be nose-followed without any extra web accesses.

A worked example can be found here.

Next steps

As with any new protocol, figuring out exactly how to apply the new proposed standards will require coordination and consensus-building. For example, the choice of the “describedby” link relation and “host-meta” well-known URI need to be confirmed for linked data, and agreement reached on whether multiple Link: headers is in good taste or poor taste. (Link: and .well-known put interesting content in a peculiarly obscure place and it might be a good idea to limit their use.) Consideration should be given to Larry Masinter’s suggestion to use multiple relations reflecting different attitudes the server might have regarding the various metadata sources: For example the server may choose to announce that it wants the Link: metadata to override any embedded metadata, or vice versa. Agreement should be reached on the use of Link: and host-meta with redirects (302 and so on) – personally I think it would be a great thing as you could then use a value-added forwarding service to provide metadata that the target host doesn’t or can’t provide.

This is not a particularly heavy coordination burden; the design odds-and-ends and implementations are all simple. The impetus might come from inside W3C (e.g. via SWIG) or bottom-up. All we really need to get this going are a bit of community discussion, a server, and a cooperating client, and if the protocols actually fill a need, they will take off.

For past TAG work on this topic, please see TAG issue 62
and the “Uniform Access to Metadata
memo.

Thanks for a great 15 years at W3C

After 15 years working with all of you all around the world on Web technologies and standards, I’m taking a position as a Biomedical Informatics Software Engineer in the department of biostatistics at the University of Kansas Medical center.

The new job starts in just another week or two; I’ll update the contact information and such on my home page before I’m done here.
While my new position is likely to keep me particularly busy for a few months, I hope to surface in Mad Mode from time to time; it’s a blog where I’m consolidating writing on free software, semantic web research, and other things I’m mad-passionate about.

Thanks to all of you who contribute to the work at W3C; I’m proud of a lot of things that we built together. And thanks to all my mentors and collaborators who taught me, helped me, and challenged me.

The Web is an incredibly important part of so many parts of life these days, and
W3C plays an important role in ensuring that it will work for everyone over the long haul. Although it’s hard to leave an organization with a mission I support, I am excited to get into bioinformatics, and I look forward to what W3C and the Web community come up with next as well.

The Mission of W3C

I’ve now been with W3C for almost three months. My first priority was to meet with the global stakeholders of the organization.

I began with W3C membership. Through meetings, phone calls, technical conferences, and informal sessions I’ve met upwards of one hundred members and have had profound conversations with many of them.

I also made a point of meeting with organizations that are part of the ecosystem within which W3C works. This includes other standards organizations, government ministers, students, researchers in Web science, and thought leaders in the industry.

I also reached out to organizations that “should” be in W3C. Often this includes presenting our activities and roadmaps. I’ve reached over one thousand people in this way.

And it was important to do this on a global basis. During these two and a half months I travelled to eight countries, but have spoken to participants from many other locations.

The primary purpose of all of these meetings was to listen. W3C has been an effective organization, but any organization can do better. What are the stakeholders of W3C asking from us?

Four primary requests

W3C has established principles including Web for All and Web on Everything. We’ve established a technical vision as well. There is broad agreement to these principles and technical vision.

People are asking us to be more tangible and specific in how we achieve this.

There are many ways of summarizing the requests, but four recurring themes best capture the idea. W3C needs to:

  1. Drive a Global and accessible Web. There is little dispute that we should work towards a Web for All. But so many are deprived sufficient access – for reasons of handicap, language, poverty, and illiteracy – that we need a stronger technical program to improve the situation.
  2. Provide a Better Value Proposition for Users. Everyone is a consumer and everyone is an author. Yet our focus has been on vendors that build products. We need to complement that with a better user focus.
  3. Make W3C the best place for new standards work. I blogged last month about the expanding Web platform. There is so much new innovation and we must encourage the community to bring their work rapidly to W3C.
  4. Strengthen our core mission. With the expansion of innovation on the Web, we cannot do it all. We must be very crisp about what we achieve in W3C, what companion organizations achieve, and how do we relate.

Having identified clear imperatives, we are building teams that will look at each of these topics. Typically a team involves W3C staff, participating members, and outside experts. I expect to update you from time to time as this work gets underway.

One more focus area

As we try to improve the global accessible Web; the Web of Users, new standards work, and strong delivery of a core mission, there is a legitimate danger that we will find more work to do without the resource to do it. So we will also make sure that this clearer exposition of our mission is aligned with the resources required to complete that mission!

Why does the address bar show the tempolink instead of the permalink?

An important feature of HTTP is the temporary redirect, where a resource can have a “permanent” URI while its content moves from place to place over time. For example,
http://purl.org/syndication/history/1.0 remains a constant name for that resource even though its location (as specified by a second URI) changes from time to time.

If this is such a useful feature, then why does the browser address
bar show the temporary URI instead of the permanent one? After all,
the permanent one is the one you want to copy and paste to email, to
bookmark, to place in HTML documents, and so on. The HTTP
specification says to hang on to the permanent link (“since the
redirection MAY be altered on occasion, the client SHOULD continue to
use the Request-URI for future requests.”). Tim Berners-Lee says the
same thing in User Agent watch points
(1998): “It is
important that when a user agent follows a “Found” [302] link that the
user does not refer to the second (less persistent) URI. Whether
copying down the URI from a window at the top of a document, or making
a link to the document, or bookmarking it, the reference should
(except in very special cases) be to the original URI.”
Karl Dubost amplifies this in his 2001-2003 W3C Note Common User
Agent Problems
: “Do not
treat HTTP temporary redirects as permanent redirects…. Wrong: User
agents usually show the user (in the user interface) the URI that is
the result of a temporary (302 or 307) redirect, as they would do for
a permanent (301) redirect.”

So why do browsers ignore the RFC and these repeated admonitions?
Possibly due to lack of awareness of the issue, but more likely
because the status quo is seen as protecting the user. If
the original URI (the permalink) were shown we might have the following scenario:

  1. an attacker discovers a way
    to establish a 3xx redirect from
    http://w3.org/resources/looksgood to
    http://phishingsite.org/pretendtobew3 – either because w3.org
    is being careless, or because of a conscious decision to deed part
    of its URI space to other parties

  2. user sees address bar = http://w3.org/resources/looksgood with
    content X, and concludes that the X is attributable
    to the resource http://w3.org/resources/looksgood

  3. user treats the http://w3.org/ prefix as an informal credential
    and treats the http://w3.org/resources/looksgood content as
    coming from W3C (without any normative justification; they just
    do) when in fact it’s a phishing site pretending to be W3C

  4. user enters their W3C password into phishing form, etc.

Were the user to observe address bar = http://phishingsite.org/pretendtobew3 with the same content, she
might suspect an attack and decline to enter a password.

An attacker might make use of an explicit redirection service on a site similar to that provided by purl.org, or it might exploit a redirect script that takes a URL as part of the
query string, e.g.
http://w3.org/redirect?uri=http://phishingsite.org/pretendtobew3 .

This line of reasoning is documented in the Wikipedia article URL redirection and its references and
in Mozilla bug 68423.

There are two possible objections. One is that the server in these
cases is in error – it shouldn’t have allowed the redirects if it
didn’t really mean for the content source to speak on behalf of the
original resource (similar to an iframe or img element). The other is
that the user is in error – s/he shouldn’t be making authorization
decisions based on the displayed URI; other evidence such as a
certificate should be demanded. Unfortunately, while correct in
theory, neither of these considerations is very compelling.

If browser projects are unwilling to change address bar behavior – and
it seems unlikely that they will – is there any other remedy?

Perhaps some creative UI design might help. Displaying the permalink
in addition to the tempolink might be nice, so that it could be
selected (somehow) for bookmarking, but that might be confusing and
take too much screen real estate. One possible partial solution would
be an enhancement to the bookmark creation dialog. In Firefox on
selecting “Bookmark This Page” one sees a little panel with text
fields “name” and “tags” and pull-down “folder”. What if, in the case
of a redirection, there were an additional control that gave the
option of bookmarking the permalink URI in place of the substitute
URI? With further thought I bet someone could devise a solution that would work for URI copy/paste as well.

(Thanks to Dan Connolly, other TAG members, and David Wood for their
help with this note.)

Default Prefix Declaration

Default Prefix Declaration

Table of Contents

1. Disclaimer

The ideas behind the proposal presented here are neither
particularly new nor particularly mine. I’ve made the effort to
write this down so anyone wishing to refer to ideas in this space
can say “Something along the lines of [this posting]” rather than
“Something, you know, like, uhm, what we talked about, prefix
binding, media-type-based defaulting, that stuff”.

2. Introduction

Criticism of XML
namespaces
as an appropriate mechanism for enabling distributed
extensibility for the Web typically targets two issues:

  1. Syntactic complexity
  2. API complexity

Of these, the first is arguably the more significant, because
the number of authors exceeds the number of developers by a large
margin. Accordingly, this proposal attempts to address the first
problem, by providing a defaulting mechanism for namespace prefix
bindings which covers the 99% case.

3. The proposal

Binding
Define a trivial XML language which provides a means to
associate prefixes with namespace names (URIs);
Invoking from HTML
Define a link relation dpd for use in the (X)HTML
header;
Invoking
from XML
Define a processing instruction xml-dpd and/or an
attribute xml:dpd for use at the top of XML
documents;
Defaulting by Media Type
Implement a registry which maps from media types to a published
dpd file;
Semantics
Define a precedence, which operates on a per-prefix basis,
namely xmlns: >> explicit invocation >> application
built-in default >> media-type-based default, and a semantics
in terms of namespace
information items
or appropriate data-model equivalent on the
document element.

4. Why
prefixes?

XML namespaces provide two essentially distinct mechanisms for
‘owning’ names, that is, preventing what would otherwise be a name
collision by associating names in some way with some additional
distinguishing characteristic:

  1. By prefixing the name, and binding the prefix to a particular
    URI;
  2. By declaring that within a particular subtree,
    unprefixed names are associated with a particular URI.

In XML namespaces as they stand today, the association with a
URI is done via a namespace declaration
which takes the form of an attribute, and whose impact is scoped to
the subtree rooted at the owner element of that attribute.

Liam Quin
has proposed
an additional, out-of-band and defaultable,
approach to the association for unprefixed names, using
patterns to identify the subtrees where particular URIs apply. I’ve
borrowed some of his ideas about how to connect documents to prefix
binding definitions.

The approach presented here is similar-but-different, in that its primary
goal is to enable out-of-band and defaultable associations of namespaces
to names with prefixes, with whole-document scope. The
advantages of focussing on prefixed names in this way are:

  • Ad-hoc extensibility mechanisms typically use prefixes.
    The HTML5 specification already has at least two of these:
    aria- and data-;
  • Prefixed names are more robust in the face of arbitrary
    cut-and-paste operations;
  • Authors are used to them: For example XSLT stylesheets and W3C
    XML Schema documents almost always use explicit prefixes
    extensively;
  • Prefix binding information can be very simple: just a set of
    pairs of prefix and URI.

Provision is also made for optionally specifying a binding for the default namespace at the document element, primarily for the media type registry case, where it makes sense to associate a primary namespace with a media type.

5. Example

If this proposal were adopted, and a dpd document for use in HTML 4.01 or XHTML1:

<dpd ns="http://www.w3.org/1999/xhtml">
<pd p="xf" ns="http://www.w3.org/2002/xforms"/>
<pd p="svg" ns="http://www.w3.org/2000/svg"/>
<pd p="ml" ns="http://www.w3.org/1998/Math/MathML"/>
</dpd>

was registered against the text/html media type, the following would result in a DOM with html and body elements in the XHTML namespace and an input element in the XForms namespace:

<html>
<body>
<xf:input ref="xyzzy">...</xf:input>
</body>
</html>

Orthogonality of Specifications

,,

The general principle of platform design is that platforms consist of a set of standard interfaces. Standard interfaces allow substitution of components across the interface boundary, while independence of interfaces allow evolution of the interfaces themselves. In a PC, for example, the disk bus interface allows many different disk vendors to offer disk products independent of the model of display or keyboard, but the orthogonality of interfaces allow evolution of the interfaces themselves. If the display interface were linked to the disk interface too tightly, it wouldn’t be possible to evolve ISA to SATA without updating VGA.

In the web platform, the three important interfaces are transport, format and reference, and the current definitions of those interfaces are HTTP, HTML and URI. The interfaces are standard, allowing many different implementations: HTTP standard lets you use HTTP servers from many vendors, the HTML standard lets you use many different HTML authoring tools or template systems, and the URI specification allows identification of many different components.

While HTTP is the current “common denominator”  protocol that all web agents are expected to speak, the web should continue to work if web content is delivered by other protocols — FTP, shared file systems, email, instant messaging, and so forth.  HTTP as it has evolved has severe difficulties, and designing a Web that only works with HTTP as it is currently implemented and deployed would unfortunate. We should work harder to reduce the dependencies and isolate them.

HTML is the ‘lingua franca’, the common language that all agents are currently expected to be able to produce, process, read and interpret (or at least a well-defined subset of it). Having a common language is important for interoperability, but  the web should also work for other formats — extensions to HTML  including scripting, DOM APIs, but also other formats and application environments such as XHTML, Java, PDF, Flash, Silverlight, XForms, 3D objects, SVG, other XML languages and so forth. Certainly HTML has it has evolved is overly complex for the purposes to which it is designed.

The URI is the fundamental element of reference, but the URI itself is evolving to deal with internationalization, reference to session state, IRIs, LEIRIs, HREFs and so forth. Many applications use URIs and IRIs, not just the formats described above but other protocols and locations, including databases, directories, messaging, archiving, peer-to-peer sharing and so forth.

The is just one of many communication applications on the global Internet; for web browsing to integrate will with the rest of the distributed networking, web components should be independent of the application, and work well with messaging, instant messaging,  news feeds, etc etc.

A sign of a breakdown of this architectural principle would be for a specification of a format (say HTML) to attempt to redefine, for its purposes, the protocol (say HTTP) or the method of reference (URI).  The specifications should be independent, or at least, dependencies isolated, minimized, reduced. If those other elements of the web architecture are incorrect, need to evolve to meet current practice or have flaws in their definitions, they need to evolve independently, so that orthogonality of the specifications and reusability of the components are the promoted.

There may well be reasons to link some features of HTML to the fact that it is delivered over an interactive protocol, but linking HTML directly to HTTP in a way that features would work only for HTTP and not for any other protocol with similar features – that would be unfortunate. It might not matter in the short-term (that’s all we have right now) but it is harmful to the long-term evolution of the web.

(Should go without saying, but just in case: this is a personal post, not reviewed by the TAG)

Language semantics and operational meaning

W3C and other standards organizations are in the business of defining languages — conventions that organizations can choose to follow — and not in mandating operational behavior — telling organizations and participants in the network how they are supposed to behave. Organizations (implementors, operators, administrators, software developers) are free to choose which standards they adopt, and what their operational behavior will be.

In some posts on the www-tag mailing list, I was trying to point out the risks in defining languages such that the "meaning" of the language depends on operational behavior. In some ways, of course, this is a fallacy: in general, what an utterance "means" in some operational way depends on what the speaker intends and how the listener will interpret the utterance.

However, as an organization, W3C can, and should, define languages in which the meaning is defined in the document, in terms of abstractions rather than in terms of operational behavior. The result is more robust standards, those that have wider applicability, that can be used for more purposes, and that create a more vibrant and extensible web.

Search Engines take on Structured Data

Structured data on the web got a boost this week, with Google’s announcement of Rich Snippets and Rich Snippets in Custom Search. Structured data at such a large scale raises at least three issues:

  1. Syntax
  2. Vocabulary
  3. Policy

Google’s documentation shows support for both microformats and RDFa. It follows the hReview microformat syntax with small vocabulary changes (name vs fn). Support for RDFa syntax, in theory, means support for vocabularies that anyone makes; but in practice, Google is starting with a clean slate: data-vocabulary.org. That’s a place to start, though it doesn’t provide synergy with anyone who has uses FOAF or Dublin Core or the like to share their data.

The policy questions are perhaps the most difficult. Structured data is a pointy instrument; if anyone can say anything about anything, surely the system will be gamed and defrauded. Google’s rollout is one step at a time, starting with some trusted sites and an application process to get your site added. The O’Reilly interview with Guha and Hansson is an interesting look at where they hope to go after this first step; if you’re curious about how this fits in to HTML standards, see Sam Ruby’s microdata.

While issues remain–there are syntactic i’s to dot and t’s to cross and even larger policy issues to work out–between Google’s rollout and Yahoo’s searchmonkey and the UK Central Office of Information rollout, it seems that the industry is ready to take on the challenges of using structured data in search engines.