For the October 2010 TAG F2F
Jonathan Rees
13 October 2010
15 and 25 October 2010: deletions, {additions}
This document is {in very rough form and is} likely to be revised. When citing please look for the latest revision.
There has been considerable debate and confusion on the topic of what constitutes an adequate persistent reference on the Web. This memo attempts a first-principles analysis of the question in order to establish a framework in which solutions can be compared rationally.
To ensure an inclusive treatment, "reference" here means both traditional "pre-Web" references such as
Mary-Claire van Leunen.
A Handbook for Scholars.
Knopf, New York, 1985.
and Web references such as
<a href="http://www.w3.org/TR/REC-html40/">HTML 4.01 Specification</a>
I'll call the document referenced by a reference its target.
A reference can (but needn't) indirect through a specific catalog, as any of
K. 626
PMID:16899496
doi:10.1093/bib/bbl025
{15 Oct added 'doi:'}
The targets of these references are: a composition by Mozart that is indexed in the Köchel catalog; a scholarly article that is indexed in the US National Library of Medicine bibliographic database; and the same article, which happens to be known the handle system. Each of these forms is well recognized inside a substantial community (musicians, biologists, librarians[?]). Similarly we have
http://www.w3.org/TR/REC-html40/
which refers to a technical specification by indirecting through a well-known catalog system known as the Web.
The scenario under discussion is that of a general user or robot with a reference in hand, using a well-known apparatus or method to "chase" the reference and obtain the target document. The reference was not created with that particular user in mind; the reference might be seen by anyone (or anyone in some substantial community), so no special knowledge can be assumed. Of course some knowledge must be assumed, such as how to read the Latin alphabet or ASCII or use a browser; but not special knowledge peculiar to the user or reference.
References vary greatly in the efficiency with which they can be chased using well-known methods. To illustrate this here are some points in this spectrum:
I'll define persistence as survival beyond events that you would expect would imply extinction, such as the survival of an author's works beyond his/her death, or the survival of a product beyond the life of the company making it. As properties go persistence is somewhat peculiar: because it refers to the future, there is no test for it, so any assessment of persistence is speculative.
By persistent reference I mean the persistent ability to chase a reference. In order for a reference to be persistent, therefore, the reference itself and the target must be persistent, and during its lifetime there has to some well-known apparatus, perhaps different ones at different times, that is competent to chase the reference.
Time scales of 10 to 100 years are typically cited in discussions of document and reference persistence. Usually what's of interest is survival beyond the particular projects or people that created the document, or the apparatus that initially enabled the reference.
The ideal reference is both fast (can be chased quickly and automatically) and persistent (can be chased over the long run). This section surveys failure modes and ways that failures can be either prevented or remedied.
Persistent reference requires that there be a working reference-chasing apparatus through time. In today's architecture the apparatus might consist of server(s), network(s), and client(s). To maintain function the apparatus needs to either not break, or it needs to be replaced or repaired when it does breaks. Since any apparatus typically has many parts, prevention and repair can be effected in many different places.
Following are major failure modes for persistent reference, and with each a grab-bag of techniques proposed or in use, intended as illustration. No endorsement of any technique is intended.
See Masinter 2006 [tbd: link, see below] for a thorough treatment of document persistence.
{That is, it exists somewhere, but the party seeking it doesn't know where.}
For example, search engines are in many ways the ideal apparatus for reference-chasing; the problem is that the results delivered are not necessarily either unique or correct, so a manual postpass is required to locate the desired document among the many 'hits' returned.
Server-side reference maintenance and mapping may be the dominant persistent-reference strategy on the Web today. A dead reference (404) is often fixed at the source, not in the apparatus connecting the source to the target. We send email to the webmaster, or the webmaster does an audit, and often the desired document can be found somewhere, and the reference is updated.
Similarly, generic tactics such as adding a prefix to convert a handle to an http: URI are widely deployed. This is fine if either the http: reference itself is durable, or if the source-side mapping can change over time to track changes in available services. (handle.net works now, but a different service may have to take over in the future.)
Unfortunately a reference that is stored in such a way that it can't be "improved" before presentation to a user is a common case. For example, the reference may be stored in a private file that does not enjoy constant maintenance, or it may be stored in an archive that by policy or license terms must not be modified.
File formats go extinct, and with them the documents that use them. - TBD.}
In choosing one form of reference over another - traditional vs. electronic, urn: vs. http:, and so on - one is placing a bet that the form you chose will be adequately persistent, or at least more persistent than the one you didn't choose. Communicating the reference is an act of faith, and an assessment of persistence includes an assessment of the interests and prospects of all institutions that would be involved in stewardship of the reference and its target.
If responsibility for different parts of a reference is spread among multiple institutions - as it is, at present, in the case of URIs - then persistence is only as good as the weakest participant.
Some common considerations when assessing potential persistence bets:
Ubiquity
Competence
Values
Safety net
Let's look at some examples of persistent reference schemes to see how they size up as persistence risks.
Köchel listing: The listing itself is small and well-known among musicians (it's even in Wikipedia), and Mozart is so popular that all of his works are well indexed and highly replicated. This system requires no particular institutional support. But the number itself (K. 626) is not automatically actionable.
MIT Dspace repository: (example: hdl:1721.1/36048) {hdl: added} Chasing these references relies on the handle system and on MIT, both of which seem pretty good bets (the details would take us too far afield I think). The competence to chase bare handle strings is not widespread, but this deficit is remedied by mapping the handle to an elaborated reference that contains a corresponding http: URI (http://hdl.handle.net/1721.1/36048). If http: falls into disfavor in the future, or handle.net is threatened, a different server-side mapping can be substituted.
Crossref DOIs: {The Digital Object Identifier (DOI) system
has wide penetration in academic publishing and is trusted as a
source of reliable identifiers. The primary force behind DOIs is
Crossref, which is funded by the publishing industry.
(Traditionally the publishing industry has left responsibility for
persistence up to libraries and archives, so this arrangement
is an experiment.) Because of their success, a careful
threat analysis of Crossref DOIs is warranted, examining the same
issues that affect any putatively persistent identifier system. In
particular, provisioning depends on Crossref's publisher/members,
so we should ask what happens when they withdraw from the system;
and Crossref itself may have
organizational vulnerabilities related to, say, database replica
licensing or organizational succession plan, that threaten
longevity. Of course, as with any such system, as these identifiers
become increasingly important to the community, they gain resistance
to failure, since even if Crossref suddenly
disappeared, another organization could step in to
recover the index and provide the needed services.}
{from earlier draft:} Crossref is a service provided to the
publishing industry, which by tradition, market forces, and
appropriate division of labor is not expected to be involved in
persistence. While the Crossref management very likely considers
persistence to be very important, this is not the primary purpose or
function of Crossref, so DOIs should be handled carefully as
references. However, the DOI has become so important that
it probably falls under the category of "too big to fail" - if
something went wrong with the system, there would be a scramble
among its customers to repair it.
HTTP URIs at w3.org: http://www.w3.org/... is similar to the Dspace case. IETF, which nominally controls the URI scheme namespace (http:), has persistence and stability in general as a core value and has gained widespread trust; anyhow they wouldn't be respected if they tried to change the meaning of 'http:' in an unfriendly way. The weak link is probably ICANN, which does not have persistence as a core value, and a nagging feeling that W3C might lose its domain name registration, although if anyone can keep a domain name registered indefinitely then one would think W3C can. If ICANN were to unilaterally end access to W3C's document via the domain name www.w3.org, it is likely that the community of users of www.w3.org URIs would rally to recover this access by bypassing ICANN. w3.org is probably "too big to fail".
Webcite: http://webcitation.org/... has earned the favor
of the publishing
industry as an archive of selected Web documents and a source of
stable URIs for them. webcitation.org likely has many of the same properties as
www.w3.org, such as, appropriate core values (sorry, I do not have many
details at present). If its
current steward fails, it will probably be taken over by its
customers.
OCLC PURL server: http://purl.org/... Of course the "p" in the acronym is as usual wishful thinking; most purl.org URIs have proven ephemeral. That does not mean that some of them aren't good bets. While OCLC is a stable, central institution, vulnerabilities to purl.org come from the uncertain business model for the service (distraction), from the usual discomfort of the OCLC/ICANN relationship, from the fragility of purl.org's authorization framework (what if authority to repair redirection is lost through death or bankruptcy) and of course from all the vulnerabilities of the organizations responsible for the secondary reference that is the target of the redirect.
Those not comfortable with my "too big to fail" analysis of ICANN dependence {i.e. w3.org references are not vulnerable to w3.org losing its domain name registration because the safety net would arrange for ICANN to be bypassed in that event} might be motivated to attempt to carve out a new part of the domain name system that really has institutional backing from ICANN for persistence, not just leases requiring perpetual re-registration. For example - I am not really proposing this, just conducting a thought experiment - one might convince ICANN to create a top-level 'urn' domain that has multilateral community backing to operate according to the URN scheme registrations on deposit with IETF, i.e. http://foo.urn/123 would by universal agreement have the same meaning as urn:foo:123, whether the author of the URN registration (or whoever) has paid their registration fees or not.
Conservative practitioners will probably not want to rely on a URI or any other short "identifier" string alone for reference - they would continue to provide references as a combination of conventional metadata (slow to act on) and machine-actionable forms such as some http: URIs. It might be nice if this practice of hybrid references were codified a little bit, perhaps using RDFa; at present the metadata is rarely parseable by machine and each publisher has its own way to present it, and you can't even tell automatically where one reference ends and the next starts.
The notion that URNs and handles are persistent references and http: URIs aren't is easily seen as simplistic when it is recognized that persistence is a function not of technology but of the intentions and competence of the institutions involved in access. Legions of URNs have already failed as persistent references, and many http: URIs are likely to succeed.
On the other hand, http: is the recommended way to "identify" something according to Web architecture, and has the obvious advantage over other kinds of references of being directly actionable using today's infrastructure. http: therefore deserves careful consideration as a persistent reference mechanism.
Today's http: apparatus relies on the ICANN-based domain name system, and therefore seems to have an inherent weakness due to ICANN's lack of commitment to stability and persistence. I've mentioned several workarounds for this, including taking ICANN out of the loop, negotiating with ICANN, and vigilance around registrations.
I've suggested that domain names that are "too big to fail", such
as w3.org, ought to be good persistence bets. But what about
smaller operators, such as minor institutional archives that are
competently managed and replicated, but which in the event of an
accessibility compromise (domain name registration loss) would not
be able to rally any of the remedies given above? These
organizations (and their successors) are the primary market of the
handle system and URNs.
Can they can advocate simple http: URIs as
adequate references?
Well, obviously they can rely on http://hdl.handle.net/..., which is probably another "too big to fail" domain. Another solution would be reliance on a new "persistent" DNS zone as sketched above.
{Remaining questions: Is persistent reference an appropriate use of bare http: URIs? If not, what is the recommended alternative? If so, what advice to we give to the community regarding assessment of persistence of any particular http: URI?}
TBD: Expand into a longer 'discussion' section
TAG issue 50 (URNs and registries) is not primarily about persistence, but understanding persistence seems to remain the most challenging barrier to the formulation of any kind of policy recommendation for persistent references. I hope this memo sheds some light on ISSUE-50, or at least helps to organize the problem space.
Larry Masinter and Michael Welch.
A system for long-term document preservation.
IS&T Archiving 2006 Conference.
http://larry.masinter.net/0603-archiving.pdf
Larry Masinter.
Problems URIs Don't Solve.
Presentation at TWIST 99, The Workshop on Internet-scale Software
Technologies, Internet Scale Naming.
http://larry.masinter.net/9909-twist.pdf
TBD: expand reading list
Thanks to MacKenzie Smith for insight into the minds of archivists and librarians, to Henry Thompson for his prior work on ISSUE-50, to Larry Masinter for general advice, and to Alan Ruttenberg for draft comments.