Persistent Reference on the Web

For the October 2010 TAG F2F
Jonathan Rees
13 October 2010
15 and 25 October 2010: deletions, {additions}

This document is {in very rough form and is} likely to be revised. When citing please look for the latest revision.

There has been considerable debate and confusion on the topic of what constitutes an adequate persistent reference on the Web. This memo attempts a first-principles analysis of the question in order to establish a framework in which solutions can be compared rationally.

What is a reference?

To ensure an inclusive treatment, "reference" here means both traditional "pre-Web" references such as

Mary-Claire van Leunen.
A Handbook for Scholars.
Knopf, New York, 1985.

and Web references such as

<a href="http://www.w3.org/TR/REC-html40/">HTML 4.01 Specification</a>

I'll call the document referenced by a reference its target.

A reference can (but needn't) indirect through a specific catalog, as any of

K. 626
PMID:16899496
doi:10.1093/bib/bbl025

{15 Oct added 'doi:'}

The targets of these references are: a composition by Mozart that is indexed in the Köchel catalog; a scholarly article that is indexed in the US National Library of Medicine bibliographic database; and the same article, which happens to be known the handle system. Each of these forms is well recognized inside a substantial community (musicians, biologists, librarians[?]). Similarly we have

http://www.w3.org/TR/REC-html40/

which refers to a technical specification by indirecting through a well-known catalog system known as the Web.

What do you do with a reference?

The scenario under discussion is that of a general user or robot with a reference in hand, using a well-known apparatus or method to "chase" the reference and obtain the target document. The reference was not created with that particular user in mind; the reference might be seen by anyone (or anyone in some substantial community), so no special knowledge can be assumed. Of course some knowledge must be assumed, such as how to read the Latin alphabet or ASCII or use a browser; but not special knowledge peculiar to the user or reference.

References vary greatly in the efficiency with which they can be chased using well-known methods. To illustrate this here are some points in this spectrum:

What is persistence reference?

I'll define persistence as survival beyond events that you would expect would imply extinction, such as the survival of an author's works beyond his/her death, or the survival of a product beyond the life of the company making it. As properties go persistence is somewhat peculiar: because it refers to the future, there is no test for it, so any assessment of persistence is speculative.

By persistent reference I mean the persistent ability to chase a reference. In order for a reference to be persistent, therefore, the reference itself and the target must be persistent, and during its lifetime there has to some well-known apparatus, perhaps different ones at different times, that is competent to chase the reference.

Time scales of 10 to 100 years are typically cited in discussions of document and reference persistence. Usually what's of interest is survival beyond the particular projects or people that created the document, or the apparatus that initially enabled the reference.

Failure modes, preventions, and remediations

The ideal reference is both fast (can be chased quickly and automatically) and persistent (can be chased over the long run). This section surveys failure modes and ways that failures can be either prevented or remedied.

Persistent reference requires that there be a working reference-chasing apparatus through time. In today's architecture the apparatus might consist of server(s), network(s), and client(s). To maintain function the apparatus needs to either not break, or it needs to be replaced or repaired when it does breaks. Since any apparatus typically has many parts, prevention and repair can be effected in many different places.

Following are major failure modes for persistent reference, and with each a grab-bag of techniques proposed or in use, intended as illustration. No endorsement of any technique is intended.

Failure: The target doesn't exist.

  1. Prevention: Get copies of the target into credible repositories.
  2. Prevention: Make the target so popular or important that many parties are looking after it. Documents whose value is recognized are less likely to be lost.

See Masinter 2006 [tbd: link, see below] for a thorough treatment of document persistence.

Failure: The target can't be found.

{That is, it exists somewhere, but the party seeking it doesn't know where.}

  1. Prevention: The reference should include information that all hosting repositories can understand.
  2. Prevention: Provide redundant information in the reference: include metadata, physical location, "identifier" strings, etc.

Failure: The retrieved document is ambiguous or wrong.

For example, search engines are in many ways the ideal apparatus for reference-chasing; the problem is that the results delivered are not necessarily either unique or correct, so a manual postpass is required to locate the desired document among the many 'hits' returned.

  1. Prevention: Make the reference be very specific and clear (cf. traditional references).
  2. Prevention: Incorporate elements that are unlikely to be reused (e.g. dates, UUIDs, tag: URIs).
  3. Prevention: Provide document checksum in the reference (cf. Dataverse).
  4. Prevention: Provide UUID or other "unique" string in reference and in or near document, so they can be compared (cf. Rich Pyle [TBD: find reference])
  5. Prevention: rely on a persistent "authority" infrastructure responsible for collision avoidance (e.g. IETF URN scheme registry; or DNS to the extent it is persistent)
  6. Prevention: Be careful who/what you trust.
  7. Remediation: Compare document metadata against the reference: title, date, checksum, etc.

Failure: The reference is actionable, but action requires human intervention (is "slow").

  1. Prevention: also include http: URI(s) in reference
  2. Prevention: place trust wisely (who is able to ensure that this particular reference "works" for a long time?)
  3. Prevention or remediation: Map slow reference forms to fast forms on server (e.g. add a prefix to convert non-http: to http: cf. http://handle.net/ prefix, or the idea in the proposed new urn: URI scheme registration)
  4. Remediation: Map slow reference forms to fast forms on client
  5. Remediation: teach clients how to expedite new kinds of reference directly (e.g. LSID Firefox plugin)
  6. Remediation: replace the ICANN DNS root with another DNS root that behaves "better" (thought experiment only)

Server-side reference maintenance and mapping may be the dominant persistent-reference strategy on the Web today. A dead reference (404) is often fixed at the source, not in the apparatus connecting the source to the target. We send email to the webmaster, or the webmaster does an audit, and often the desired document can be found somewhere, and the reference is updated.

Similarly, generic tactics such as adding a prefix to convert a handle to an http: URI are widely deployed. This is fine if either the http: reference itself is durable, or if the source-side mapping can change over time to track changes in available services. (handle.net works now, but a different service may have to take over in the future.)

Unfortunately a reference that is stored in such a way that it can't be "improved" before presentation to a user is a common case. For example, the reference may be stored in a private file that does not enjoy constant maintenance, or it may be stored in an archive that by policy or license terms must not be modified.

{Failure: The retrieved document is incomprehensible (a.k.a. "bit rot").

File formats go extinct, and with them the documents that use them. - TBD.}

Placing bets

In choosing one form of reference over another - traditional vs. electronic, urn: vs. http:, and so on - one is placing a bet that the form you chose will be adequately persistent, or at least more persistent than the one you didn't choose. Communicating the reference is an act of faith, and an assessment of persistence includes an assessment of the interests and prospects of all institutions that would be involved in stewardship of the reference and its target.

If responsibility for different parts of a reference is spread among multiple institutions - as it is, at present, in the case of URIs - then persistence is only as good as the weakest participant.

Some common considerations when assessing potential persistence bets:

Ubiquity

Competence

Values

Safety net

Let's look at some examples of persistent reference schemes to see how they size up as persistence risks.

Köchel listing: The listing itself is small and well-known among musicians (it's even in Wikipedia), and Mozart is so popular that all of his works are well indexed and highly replicated. This system requires no particular institutional support. But the number itself (K. 626) is not automatically actionable.

MIT Dspace repository: (example: hdl:1721.1/36048) {hdl: added} Chasing these references relies on the handle system and on MIT, both of which seem pretty good bets (the details would take us too far afield I think). The competence to chase bare handle strings is not widespread, but this deficit is remedied by mapping the handle to an elaborated reference that contains a corresponding http: URI (http://hdl.handle.net/1721.1/36048). If http: falls into disfavor in the future, or handle.net is threatened, a different server-side mapping can be substituted.

Crossref DOIs: {The Digital Object Identifier (DOI) system has wide penetration in academic publishing and is trusted as a source of reliable identifiers. The primary force behind DOIs is Crossref, which is funded by the publishing industry. (Traditionally the publishing industry has left responsibility for persistence up to libraries and archives, so this arrangement is an experiment.) Because of their success, a careful threat analysis of Crossref DOIs is warranted, examining the same issues that affect any putatively persistent identifier system. In particular, provisioning depends on Crossref's publisher/members, so we should ask what happens when they withdraw from the system; and Crossref itself may have organizational vulnerabilities related to, say, database replica licensing or organizational succession plan, that threaten longevity. Of course, as with any such system, as these identifiers become increasingly important to the community, they gain resistance to failure, since even if Crossref suddenly disappeared, another organization could step in to recover the index and provide the needed services.} {from earlier draft:} Crossref is a service provided to the publishing industry, which by tradition, market forces, and appropriate division of labor is not expected to be involved in persistence. While the Crossref management very likely considers persistence to be very important, this is not the primary purpose or function of Crossref, so DOIs should be handled carefully as references. However, the DOI has become so important that it probably falls under the category of "too big to fail" - if something went wrong with the system, there would be a scramble among its customers to repair it.

HTTP URIs at w3.org: http://www.w3.org/... is similar to the Dspace case. IETF, which nominally controls the URI scheme namespace (http:), has persistence and stability in general as a core value and has gained widespread trust; anyhow they wouldn't be respected if they tried to change the meaning of 'http:' in an unfriendly way. The weak link is probably ICANN, which does not have persistence as a core value, and a nagging feeling that W3C might lose its domain name registration, although if anyone can keep a domain name registered indefinitely then one would think W3C can. If ICANN were to unilaterally end access to W3C's document via the domain name www.w3.org, it is likely that the community of users of www.w3.org URIs would rally to recover this access by bypassing ICANN. w3.org is probably "too big to fail".

Webcite: http://webcitation.org/... has earned the favor of the publishing industry as an archive of selected Web documents and a source of stable URIs for them. webcitation.org likely has many of the same properties as www.w3.org, such as, appropriate core values (sorry, I do not have many details at present). If its current steward fails, it will probably be taken over by its customers.

OCLC PURL server: http://purl.org/... Of course the "p" in the acronym is as usual wishful thinking; most purl.org URIs have proven ephemeral. That does not mean that some of them aren't good bets. While OCLC is a stable, central institution, vulnerabilities to purl.org come from the uncertain business model for the service (distraction), from the usual discomfort of the OCLC/ICANN relationship, from the fragility of purl.org's authorization framework (what if authority to repair redirection is lost through death or bankruptcy) and of course from all the vulnerabilities of the organizations responsible for the secondary reference that is the target of the redirect.

Those not comfortable with my "too big to fail" analysis of ICANN dependence {i.e. w3.org references are not vulnerable to w3.org losing its domain name registration because the safety net would arrange for ICANN to be bypassed in that event} might be motivated to attempt to carve out a new part of the domain name system that really has institutional backing from ICANN for persistence, not just leases requiring perpetual re-registration. For example - I am not really proposing this, just conducting a thought experiment - one might convince ICANN to create a top-level 'urn' domain that has multilateral community backing to operate according to the URN scheme registrations on deposit with IETF, i.e. http://foo.urn/123 would by universal agreement have the same meaning as urn:foo:123, whether the author of the URN registration (or whoever) has paid their registration fees or not.

Conservative practitioners will probably not want to rely on a URI or any other short "identifier" string alone for reference - they would continue to provide references as a combination of conventional metadata (slow to act on) and machine-actionable forms such as some http: URIs. It might be nice if this practice of hybrid references were codified a little bit, perhaps using RDFa; at present the metadata is rarely parseable by machine and each publisher has its own way to present it, and you can't even tell automatically where one reference ends and the next starts.

Persistence and http:

The notion that URNs and handles are persistent references and http: URIs aren't is easily seen as simplistic when it is recognized that persistence is a function not of technology but of the intentions and competence of the institutions involved in access. Legions of URNs have already failed as persistent references, and many http: URIs are likely to succeed.

On the other hand, http: is the recommended way to "identify" something according to Web architecture, and has the obvious advantage over other kinds of references of being directly actionable using today's infrastructure. http: therefore deserves careful consideration as a persistent reference mechanism.

Today's http: apparatus relies on the ICANN-based domain name system, and therefore seems to have an inherent weakness due to ICANN's lack of commitment to stability and persistence. I've mentioned several workarounds for this, including taking ICANN out of the loop, negotiating with ICANN, and vigilance around registrations.

I've suggested that domain names that are "too big to fail", such as w3.org, ought to be good persistence bets. But what about smaller operators, such as minor institutional archives that are competently managed and replicated, but which in the event of an accessibility compromise (domain name registration loss) would not be able to rally any of the remedies given above? These organizations (and their successors) are the primary market of the handle system and URNs. Can they can advocate simple http: URIs as adequate references?

Well, obviously they can rely on http://hdl.handle.net/..., which is probably another "too big to fail" domain. Another solution would be reliance on a new "persistent" DNS zone as sketched above.

{Remaining questions: Is persistent reference an appropriate use of bare http: URIs? If not, what is the recommended alternative? If so, what advice to we give to the community regarding assessment of persistence of any particular http: URI?}

Tangential subjects

TBD: Expand into a longer 'discussion' section

  1. Only-persistently-unique names (not necessarily chaseable, but never repurposed)
  2. References to non-documents ("names" in general)
  3. Uniqueness: When can we use differences/samenesses in references to detect differences/samenesses in documents?
  4. Memento time-travel protocol
  5. Document vs. changing document (draft series, news feed, log file, etc.)

Relation to ISSUE-50

TAG issue 50 (URNs and registries) is not primarily about persistence, but understanding persistence seems to remain the most challenging barrier to the formulation of any kind of policy recommendation for persistent references. I hope this memo sheds some light on ISSUE-50, or at least helps to organize the problem space.

Further reading

Larry Masinter and Michael Welch.
A system for long-term document preservation.
IS&T Archiving 2006 Conference.
http://larry.masinter.net/0603-archiving.pdf

Larry Masinter.
Problems URIs Don't Solve.
Presentation at TWIST 99, The Workshop on Internet-scale Software Technologies, Internet Scale Naming.
http://larry.masinter.net/9909-twist.pdf

TBD: expand reading list

Ack

Thanks to MacKenzie Smith for insight into the minds of archivists and librarians, to Henry Thompson for his prior work on ISSUE-50, to Larry Masinter for general advice, and to Alan Ruttenberg for draft comments.