Use Case 10 - Persistent URIs
Use Case: Persistent URIs
Tim Berners-Lee famously said, cool URIs don’t change. Unfortunately the structures of government can shift almost as rapidly as the Web itself is changing. Departments are split, merged and renamed. New domain names are created and Websites reconfigured, making it very difficult to maintain persistent URIs.
The Web provides a means for Governments to disseminate transient and ephemeral information in a cost effective way to millions. Conversely, the Web is increasingly used as an interface to repositories of information that enable access over long periods. For many government departments and agencies a tension can arises between optimising websites for effective communication, with only topical and relevant information available, and ensuring that all information published is maintained and available for future reference.
Government departments often seek to optimise topicality and relevance to their audiences by regularly moving or removing content, while other parts of government treat their websites as a document store and expect to be able to refer to Web content published years earlier.
This may not matter in a commercial context, but when it comes to mandatory guidance and legislative matters, it becomes critical that there is a reliable historical record. It is important that people reach the latest mandatory guidance rather then earlier copies; on the other hand, earlier versions may have guided critical decisions and there needs to be access to them, e.g. in a court of law.
There is considerable evidence that URIs are not persisted by governments and consequently there is an extensive loss of information from the web. The prevalence of broken Web links impacts negatively on the reputation of government because it is perceived that government is managing its information poorly. A frustrating user experience also has the potential to reduce public confidence in the services the state provides online. Scrutiny of government is impaired by the inability to reliably refer to key government documents published on the Web.
There are a number of reasons for poor maintenance of URIs by government departments:
- changes of political administration or changes of policy, the government department no longer wishes to be associated with the previous statements or seeks to make it clear that there has been a change
- rebuilding of Websites on different technical infrastructures (e.g. changing content management system, without taking care of the URIs);
- changes in the structure of government, for example when departments are split or merged, has their responsibilities or name changed;
These are major events in the life of a government department's Website and unless URI persistence is planned for, links will be broken in some cases universally. The trend towards electronic publishing of official publications only and not producing in print makes the integrity of URIs crucial to the business of government.
Identified problems or limitations
Lack of standards compliance by many popular content management systems.
The UK government is implementing a cross government solution for long term URL persistence, in a project entitled “Web Continuity”.
In November 2008 The [UK] National Archives began the comprehensive archiving of the UK Government Web Estate. This involves the harvesting of content from around 1,500 Websites three times a year, with additional crawls by request. Government departments need to introduce XML sitemaps as a supplementary means of directing Website crawls. This enables more comprehensive capture of Website content, by providing information on the location of ‘hidden’ (unlinked-to) Web pages or ‘virtual’ pages generated by dynamic (CMS or database-driven) applications, to avoid missing content.
Associated with comprehensive collection by The National Archives is the use of a software component on each government department’s web server. This effects the desired redirection behaviour, so will serve the resource from the web server in response to a request if that resource still exists and, if not, initiates a checking process with the Web Archive to see if the resource exists there.
The components, based on open source software, configured, tested and supplied to departments by The National Archives, have been designed to work with Microsoft Internet Information Server versions 5 and 6, and Apache versions 1.3 and 2.01. The IIS component is produced by Ionics www.codeplex.com/IIRF The Apache component is the mod-rewrite module: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html.
If the user requests a resource: http://www.mydepartment.gov.uk/page1.html, then:
1. If the request to the URL can be resolved, the resource is served back to the user in the normal way;
2. If the request cannot be resolved, the Web Archive is checked to see if the resource exists there. If it does, the user is served with the latest version of the resource held there e.g. http://Webarchive.nationalarchives.gov.uk/*/http://www/mydepartment.gov.uk/page1.html;
3. If the resource does not exist in the Web Archive, the user is served a “custom 404” from the original department Website, which states that the page was not found on the original site, or in the Web Archive.
There are several benefits of this approach:
- Government departments can focus on topicality with their websites (government's are never going to follow the W3C's own approach of an ever growing website);
- there is a way of comprehensively archiving all content no longer considered topical or relevant;
- people can continue to link to content whether live or archived and all the URIs are persisted;
- links are maintain between the content, whether live or archived – important given the extent of interlinking between government websites