W3C

[To preview without draft comments, remove "#dc" from the address]

URI-based Naming Systems for Science

Authors' Draft 23 May 2008

This version:
2008-05-23 16:26:06 -0400
Current version:
http://purl.org/science/report/uris-for-science
Authors:
Jonathan Rees, Science Commons
Alan Ruttenberg, Science Commons

Abstract

Formal nomenclature is essential to scientific communication because of the importance of clearly specifying the entities under discussion. It takes on additional importance as computational agents are brought in to help marshal and process the enormous amount of information available to scientists, as computational agents are unable to resolve most of the ambiguities that human readers tolerate. In this note we will examine some examples of naming systems used in science, and extrapolate from this analysis to consider the suitability and effective use of Uniform Resource Identifiers (URIs) in computationally mediated communication about science.

Status of this document

This is an authors' draft with no official standing. It was composed in response to needs expressed by members of the W3C Semantic Web Health Care and Life Sciences Interest Group. The authors have been advised that this note will not be published under the group's charter, which ends at the end of May 2008, and therefore plan to bring it to completion for publication in a different venue. Comments are welcome and may be sent to the authors at uri-note@mumble.net.

Brackets [like this] indicate changes and additions that we intend to make before final publication.

[Draft comments are in this style and for the sake of reducing clutter will be removed from versions to be made available for public review. They will be kept for the authors to consult.]

[To be done: Make a pass through entire document to add citations where needed]

This document was produced by group members operating under the disclosure obligations of the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information to public-semweb-lifesci@w3.org [public archive] in accordance with section 6 of the W3C Patent Policy.

Scope

This report is intended for those considering the use of URI-based naming, as well as those who are committed to using URI-based naming but are looking for some guidance in how to do so. It attempts to set expectations for URI-based naming systems and to answer concerns about its limitations and reliability. Some technical guidance is provided.

This report is not intended to cover the entire spectrum of uses of names and URIs, which is very broad. The ideas presented here are not intended to apply to names that have very short lifetimes, such as names used only within a private conversation; to names with limited scope, such as within a single database or organization; to names limited to a particular application or purpose; or to names that are used only as navigation links in hypermedia browsing. Instead we are concerned with URIs used as names of any kind of entity important to scientific investigations and clinical activities. Our goals are: (1) to share and use names globally via the Semantic Web, [what is that? the Internet?] (2) to encourage reuse of names for arbitrary purposes (including as yet unknown purposes), and (3) to enable effective use by computational agents. Together, these requirements are aimed at building Web scale systems that make data and knowledge integration as easy as possible. Such systems should allow more time to be spent exploiting knowledge and less time managing it.


1. Advice

For the impatient reader, we summarize our recommedations here. The remainder of the note following this provides our rationale for these recommendations. While this may not be the only way to do it, we feel confident that if you follow this advice you will avoid many of the pitfalls into which we have seen people fall.

  1. Use HTTP URIs as names
  2. Document names well
  3. Think about ontology. When naming, understand the difference between individuals, classes, and relations. Understand the difference between documents and the things they are about. Make clear, in your documentation, which sort of thing you are naming.
  4. Don't create a new name when there already exists a well supported name for what you want to refer to
  5. Don't use existing names if they don't mean what you do.
  6. Become familiar with existing domain ontologies and naming efforts so that it becomes easier to determine whether the name you want already exists. If you need to name a biological process, work with the Gene Ontology. [citation]
  7. Organize specialist communities that are responsible for the association of names to entities. Don't rely on publishers to establish names that are well documented and stable.
  8. A community should purchase a domain name, direct it to purl.org (CNAME), and base their names on their registered domain name. This gives them flexibility of purl.org, without the responsibility to maintain infrastructure. Having the registered domain name allows them to migrate to different redirect service should PURL go out of business.
  9. To ensure ongoing correct forwarding, communities should select several backup administrators for PURLs and PURL domains.
  10. Place useful information at the URI - A user who places the URI in the location bar of a web browser should get back useful documentation for the URI's referent. See section ??? that describes the redirection apparatus used by OBI.
  11. Use URIs without fragment identifiers (i.e. without "#") - they don't isolate the documentation for a URI and they don't scale to large numbers of URIs. Arrange for servers to respond with 303 to names, reply with RDF, and optionally HTML.
  12. Get other people to review your documentation.
  13. When documentation is confusing or otherwise inadequate, submit a bug report. Documentation authors should include their contact information in the documentation they write.

2. Introduction

The most important characteristic of a scientific naming system is that names are introduced through a deliberate act such as formal publication and subsequently maintained through use. The act of introducing the name, usually a written communication (either physical or electronic), has the purpose of causing the name to be used to denote some phenomenon, category, method, or other entity that arises in the process of doing scientific research. Through the initial and ongoing communication of the name's meaning, the meaning comes to permeate the literature. [JR: consider flushing following sentence] An attempt to fundamentally change what a formal name is supposed to mean is disruptive, and therefore not generally respected.

A formal name may be established through a naming system, by which we mean an articulated set of practices and techniques that lead to the establishment, for some community, of a name as having a certain meaning. For example, the Linnaean system of binomial species names [cite the wikipedia article version that was consulted] is still in use 250 years after its invention and is arguably the most successful formal naming system in history. The success of this naming system rests on two pillars. First, there was an advance in the form and manner of association of names with species introduced by Linnaeus, whose work included simplifying the form of names, which earlier had consisted of long phrases, to one in which only two words were used in a specified way. Linnaeus tested his naming system by using it to organize a vast number of plant and animal species, and its utility was proved when the publication of his work became the basis of the modern taxonomy of species.

After Linnaeus, a decentralized system emerged based on the publication of careful documentation as the manner of introducing a new name. Although there exist international organizations that act to set rules for the naming of species, to introduce a new name it is considered sufficient to publish the description in a journal of record. For example, a 1962 article in the journal Psyche [Brown62] introduces "Epitritus laticeps" as the name of an ant species that was newly discovered as of that writing. The article formalizes the association between name and species through designation of a canonical representative ("type") specimen and by careful description of that specimen. Any questions about what the name means can be resolved by consulting the specimen, or if not available, its description.

Another example of a naming system predating the computer age is that for naming minor planets. A minor planet name consist of a number and a proper name, e.g. "(3402) Wisdom". The number is issued sequentially by the Minor Planet Center (MPC). The proper name is proposed by the discoverer, vetted by the Committee on Small-Body Nomenclature (CSBN), and approved at periodic meetings of the General Assembly of the International Astronomical Union (IAU). [IAU08] The publication introducing the name records the set of observations from which one infers the existence of the minor planet.

Naming systems usefully apply to any entity that has a role in scientific discourse. To give just a few examples:

As an example of a naming system for database records, GenBank [genbank] associates an name in the form of an accession number for a specific DNA or amino acid sequence with the record describing the number.

For a naming system to function well, meanings must be made clear and consistent through documentation. Names and meanings should correspond in a way that avoids both polysemy (multiple meanings for one name) and synonyms (multiple names for one meaning). Controls for establishing novelty and unambiguity may come from various sources. A controlling organization, when there is one, can impose a review process that enforces good documentation, and oversees revisions to a name's documentation so that meaning is preserved. Limits on a system's scope may be helpful in that each name introduced into the system can only name certain kinds of things, simplifying documentation and setting expectations accordingly. However, ultimately the quality of the naming system comes from the individual efforts of users of the system.

Not only must documentation be clear and consistent, it must also be easy to discover if the naming system is to come into general use. Prior to the computer age, new names were propagated through printed communication. A name would be introduced through print publication and dissemination to libraries and individuals, and its documentation found through bibliographic research. Collections of names might be gathered into catalogs and indexes to make the process easier. Computer networks appear to be an ideal match for naming systems, as the network and associated software and servers may be used to look up a name.

However, the move from print-based discovery to network-based discovery is radical. Network-based discovery increases spontaneity by lowering the cost of dissemination, removes the need for replication (to multiple libraries), and enables rapid revision of documentation. These benefits do not come without costs, however. Spontaneity can lead to neglect of quality, lack of redundancy introduces vulnerability to infrastructure failures (server down or gone), and dynamic update introduces vulnerability to instability (new editions of documentation that change a name's meaning). Much of the challenge of using network-based discovery comes from the need to adopt the discipline necessary to prevent or mitigate these problems.

The system of naming which we will call URI-based naming is based on the Internet standards for Uniform Resource Identifiers [cite RFCs]. URI-based naming extends the original scope of URIs by extending their use from communications protocols, which use them to identify network-based resources, to applications such as knowledge representation [citation?] that require names for arbitrary things. URI-based naming has been proposed as a universal formal naming system - one that can give an arbitrary meaning to a name, subsume existing naming systems, and promote interoperability between disciplines.

Several innovations of technique come with URI-based naming: Systems of schemes [define] and registries, network protocols such as the domain name system (DNS) and their associated oversight organizations, techniques for assigning globally unique names, and the behaviour of network-based protocols for communication keyed by URIs. As in the case of binomial species names, successful use of these techniques depends not only on technique but also on additional factors such as clear documentation and how well naming and documenting fit in to the practice of doing science. [explain]

It is the purpose of this note to discuss the factors that may lead to successful establishment of URIs as names for use in scientific communication. Some of the factors that lead to the successful use of URIs apply to any kind of naming system, while others are specific to network-based naming or to URIs in particular. We will offer some suggestions to those who are investing in what we hope will be a powerful naming practice for the future.

Our advice around using URIs as names has two parts. One part is the same as one would give for using any kind of formal name. The goal is effective communication. We want to make sure we're understood. The second part speak to the specific use of http URIs for naming.

3. How to use a naming system

If names be not correct, language is not in accordance with the truth of things. If language be not in accordance with the truth of things, affairs cannot be carried on to success.

Confucius, Analects, Part 13 (c. 300 BC)

[This section needs the most work. Moved material has not been integrated, and subsections are not representative of the final form.]

The purpose of this section is to orient those who think naming is easy, or who expect naming to be as casual as tagging, or who think of the semantic web as being as informal as hypertext linking, to the idea that what one asserts using names can have substantial consequences.

3.1. We name things, classes, relationships

[explicate our earlier list in these terms]

[AR: talk about different names for format variants as used in the demo - we have been specifically asked about this by different people and receive general approval for the approach. / That was a tbd]

[A more direct way to be clear about variability is to give one name to the abstraction that is common to a set of variants, and separate names to each individual variants. The common name should be well documented and should link appropriately to the individual variants. For example, consider the following three related URIs] [Use an example of records to point out the different things]

[use http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_000546.2 versus entrez gene because referent is clearer]

http://purl.org/commons/record/ncbi_gene/24866
denotes an Entrez Gene record "without commitment as to representation" - that is, the record's declarative content independent of whether the record is rendered as XML, ASN, or RDF. The information in the record may change over time as annotations are added and corrections are made, but it will always be about the same "gene" (a term that unfortunately is not defined).
http://purl.org/commons/xml/ncbi_gene/24866
denotes the XML version of the record.
http://purl.org/commons/html/ncbi_gene/24866
denotes a web page presenting information from the record in human-readable form.

[Cite: http://sw.neurocommons.org/2007/uri-explanation.html ]

3.2. Mistakes in naming

[Sharpen this story about typical scenario of choosing names]

When we learn what a new name means, we are learning that how to make certain predictions about what happens when we use the name.

Consider the scenario in which we are designing a laboratory information management system (LIMS) to record information about DNA array hybridizations. Among other things, we have been asked to record information about scanners used to measure the intensity of the reporter that is proxy for how much DNA has hybridized. We take a walk around the lab and notice a two labels on one scanner - the first says "SXCM 2100", the second says: "SXC11231". The first label names a class of scanners - five of the twenty scanners on the floor have the same label. The second is a serial number - among all the SXCM 2 scanners no two have the same serial number. On the rest of the floor, ten scanners are model GD1000, and another five are model BD1050-K.

We interpret our instructions to record scanner information as recording the model number of the scanner.

When [strike: computational] biologists later analyze the results of a large series of hybridizations including technical and biological replicates, they find that there is more variance than would be expected. In an effort to understand [was: reduce] the source of variance, they try to use the scanner as one of the variables in their model. Unfortunately, despite a literature that suggests scanner variability as a source of variance, the results do not bear this out. Upon review they realize that rather than having recorded information that distinguishes each scanner, they have only enough information to distinguish different models of scanner. Since the principal source of variation is voltage to the photomultiplier tubes, and this is different in each scanner, they don't have the information they need to correct for this source of systematic error. Without this correction, the data collected from the experiment doesn't yield the level of confidence needed to definitively answer the questions for which the experiment was designed.

There are two lessons to be learned here. The first is that being able to talk effectively about names matters. Had it been clear to the LIMS implementor that what was needed was a name for each individual scanner, rather than each class (identified [danger] by model number) of scanner, the LIMS would have recorded the variable that mattered. The second lesson is that how we name things matters. Choose the wrong way of identifying things, [danger] and the consequences can be as bad as using the wrong reagent.

The scenario described here involved humans who were able to determine where the process broke down. When humans are not as not as closely involved, as will be the case as we rely on more and more computation to analyze the vast amount of data coming from scientific and clinical investigations, the consequences of mistakes such as these will be more serious.

For those who will build computational systems upon which science depends, two skills involving names are important to learn. The first is how to think carefully about what names denote; that is the domain of ontology. The second is how to ensure that what is understood survives the process of being communicated to another system; this is the subject of documentation.

3.3. Polysemy - one name, many meanings

In order for communication to occur the sender and receiver of a message must understand names occurring in the message in the same way. If a name is used two mean more than one thing, there is a strong risk that they won't - especially if sender and receiver are automated agents.

Polysemy can arise in many different ways:

Most of these risks lead to simple quality control goals that are more or less straightforward to implement: write documentation and keep it available, don't publish your own documentation that specifies a different meaning, don't alter documentation so radically that the name's meaning changes, and use names according to their documentation. Internal documentation inconsistency is more subtle than the other risks, and therefore warrant further disussion.

One way in which documentation can be ambiguous is the well-intentioned but incorrect use of a name to designate something related to the proper referent of the name, rather than the referent itself. For example, biodiversity databases contain records about the use of species names in the scientific literature. Formal names might be assigned to the record, to the species name, or to the species itself. The temptation to reuse the name for all three purposes is natural and strong, due to the overhead of documenting names and to the complexity of articulating and observing the necessary relationships among the named entities. However, using a name for multiple purposes can lead to nonsensical inferences. Suppose 'S' is a formal name, and suppose we say 'S was created by Ed Wilson' (with 'S' denoting the species record) and 'S was collected at Plum Island' (meaning 'S' the species). Then one would incorrectly conclude that Ed Wilson created something that was collected at Plum Island.

[Footnote maybe: An indirect and somewhat awkward way to avoid creating multiple names is to be careful about the statements one makes. Instead of saying "S was collected at Plum Island," which would be nonsense if "S" denoted a record, one might say "an individual of {the species denoted by {the species name that is the subject of record S}} was collected at Plum Island", using a single complicated relationship between the record S and Plum Island in order to avoid creating a separate name for the species.]

Checkable logical relationships are valuable in detecting polysemies [Zucker07]. In the above example, if we declared that 'S' was a species record, that 'was collected at' had domain 'species', and that 'species record' and 'species' were disjoint, then the error would be detectable automatically.

3.4. Synonymy - one meaning, many names

Synonyms dilute the effectiveness of names as well. Although the knowledge that two names are synonymous can be be made known, there is no guarantee that such equivalences will be communicated and understood. Agents (computational and otherwise) unaware of a synonymy fail to make deductions that they ought to make, and even agents with access to a statement of synonymy need to be clever in order to make good use of it. Avoiding synonymy in names used for biomedicine is of particular importance given the desire to facilitate translational research. Just as specialty specific jargon is an impediment to effective communication among the variety of disciplines needed to bridge research to clinical practice, so alternative formal names for the same entity become a barrier against effective integration of knowledge about the entity. A typical disconnect would be a case of one system using "hepatocyte" while another used "liver cell". This mismatch could translate into a failure to recognize that a drug's metabolites are toxic.

3.5. Better names through shared ontologies

The process of developing and sharing ontologies is, in part, aimed at avoiding such naming problems. Such efforts can help stave off polysemy, as community process involving diverse viewpoints is likely identify cases of polysemy. Most commonly, in such a process, there is potential for polysemy when people disagree about the correct meaning of a term. By identifying the different competing meanings that the participants are in conflict about, then adding terms and documentation letting each be used, polysemy is avoided.

Synonymy is avoided by publishers doing careful searches for existing terms and ontologies before coining new names themselves, and by allying themselves with community ontology building efforts in their domain. This can sometimes be arduous work, but pays off well when both publishers and clients can trivially integrate information based simply on mention of the same name. In the next section, we discuss the question of how to document names and publish that documentation so as to increase the likelyhood that others will discover and understand them. [JR: What is the defense?] [ AR: We don't talk about ontology yet, but perhaps this is the place to discuss it, as shared ontology is the principal mechanism for avoiding creating duplicate names.] [Maybe: Ontology is also one way to stave off polysemy, as community process is likely to have better quality controls than a private naming practice.]

4. Documenting names

As an example of documentation for a URI, consider figure 1, which documents the URI http://purl.obofoundry.org/obo/OBI_0000225. [581: was: The documentation links to (in http://purl.obofoundry.org/obo/obi.owl) that specifies what the name denotes (a particular class). etc] This documentation is available via HTTP and specifies that the name denotes a particular class whose members are instances that represent the role of a particular site as somewhere where investigations take place. [581 JR: new] The documentation gives a brief definition, a descriptive label, examples, curation information, and its relations to other terms in a larger ontology, a link to which is given in a list of resources at the top of the documentation. The ontology has a policy to use the same ID for this class as long the class is one that is considered to make sense. Documentation can be expected to be updated as improvements are made. The documentation is in a machine-readable form (OWL). The server has arranged that the OWL form is the only form served from the site.

[Appendix with setup instructions?]

Investigation Site

Class: http://purl.obofoundry.org/obo/OBI_0000225
(and role
   (disjoint-with nutrient·role study·personnel·role patient·role regulatory·role drug·role
       study·participant·role vector·role reference·role))
definition: Investigation site is a role borne by a site realized in an investigation which is located at the site
curation status: metadata·incomplete ['?']
preferred term: Investigation site
example of usage: A field, a laboratory, a medical institute, a pharmaceutical company
definition source: source pending
editor note: solution2: site is related to trial used_in relation - site can bear the role
editor note: solution1: is a physical location, should maybe go under processual context, and then be used in conjunction with the located relation
editor note: site is a material (building) having the role site
definition editor: Jennifer Fostel
Subject of: location_of, participates_in, is_proxy_for, proper_part_of, derives_from, relationship, transformed_into, has_proper_part, is_output_of, is_realized_as, part_of, has_integral_part, derived_into, is_input_of, agent_in, has_part, integral_part_of, contains, transformation_of, has_improper_part, improper_part_of, located_in, contained_in, adjacent_to
Object of: location_of, is_proxy_for, proper_part_of, derives_from, relationship, transformed_into, has_proper_part, part_of, has_participant, has_integral_part, derived_into, has_output, has_part, has_input, integral_part_of, contains, transformation_of, is_realization_of, has_improper_part, improper_part_of, located_in, contained_in, has_role, adjacent_to, has_agent
Figure 1: Documentation, as viewed in a Web browser, of OBI_0000225: Investigation Site, as downloaded in May, 2008

The server responds 303 to the request for the URL, making it clear that it can not serve the actual class [Fielding05]), and it is upon request of the redirected-to URI that the documentation is served. The HTML that one sees in the browser is available from the OWL via an xlst transformation that is applied by the browser, avoiding the need the use of content negotiation, which might otherwise lead to polysemy.

The domain name purl.obofoundry.org is maintained by the OBO Foundry [Smith07], a community of biomedical ontology developers and users that has committed to maintaining the domain. The domain is configured to use a CNAME record to make it a synonym for the PURL server purl.org. The PURL server exists in order to enable persistent access, via fixed URIs, to documents and documentation by providing a service that redirects the purl.org URIs to a document's actual location. There are three administrators of the PURL redirects, one from the OBO Foundry, and two from the OBI project.

By using this mechanism there is a high probability of being able to continue serving documentation for the URI as long as it is needed. purl.org and the OBI-maintained server to which its requests are forwarded have strong community support. However, should purl.org become unreliable, purl.obofoundry.org could be redirected to another service to do the redirection. Should the OBI-maintained server need to move to a different host, purl.org can be used to redirect to the new server.

[Comment that commonly access will be via SPARQL]

There is nothing about this documentation that is special to URIs, except that it can be accessed on the Web using the URI. Its role in associating the URI with the class is not affected by the manner in which the documentation is delivered.

[We have used "documentation" to describe what we publish about a name. We might here elaborate on what kinds of things might be included in documentation, such as definitions, examples of usage, references, change policy, etc]

[Two distinct points here: (1) what constitutes good documentation, (2) what aspects of documentation must be stable and which can change over time.]

[ Even when apparently obvious, meaning is difficult to pin down. No amount of documentations can force someone else to use a name only in ways that one considers correct. We have to provide precise and persuasive documentation, and enable automated validation of uses relative to intended meaning (to the extent it is possible to do so).] [AR: while correct usage is hard to enforce, we have some control over whether incorrect usage is detectable, based on how well we write ... definitions. Or identifications, in the case of instances]

There are many factors that contribute to the effectiveness of URIs as names. Both publishers, when designing their documentation policy, and clients when evaluating whether to use URIs should consider these questions:

  1. Is there available documentation about the use of the URI (its introduction and subsequently published information)? Is the documentation free of ambiguity that might lead to inconsistent use?
  2. Is available documentation about the use of the URI sufficiently clear and unambiguous to guide effective use?
  3. Is it likely that documentation will remain faithful to the meaning of the URI?
  4. Is it likely that the documentation will be available when needed?
  5. Is documentation available to computational agents via a well-known protocol and in a form that is useful to them?

There is no way to guarantee that all URIs will be well specified, or that protocol mediated documentation will be well supported. Therefore, before relying on a name or naming system for communication, it is wise to review the name and/or naming system for the clarity of available documentation and the long-term prospects for having documentation be available over time. Again, this is true of any kind of name, not just URIs. While the authors do not propose that the specific documentation and manner of publishing it described in this section is necessarily appropriate for all projects, we feel confident that if you emulate this example you will avoid many of the pitfalls into which we have seen people fall.

[definitions, examples of usage, relationships, references]

5. URI-based naming

Internet protocols and web standards use URIs as names. The use of URIs in scientific nomenclature is attractive because it helps to bring scientific information to the Web, connects disciplines by providing a universal common namespace, and enables computational agents to use any of the growing number of tools that use URIs as names, including processors for the RDF [Beckett04], OWL [Dean04], and SPARQL [Prud08] languages.

Syntactically, a URI consists of a URI scheme name (http:, news:, and so on) followed by other information that is interpreted relative to the URI scheme, a particular application of the strategy of two-part global names described above. The method for assigning meanings to URIs varies from one URI scheme to the next, and within each scheme for different sets of URIs. Each scheme's specification (see [Schemes]) describes how its URIs are intended to be used in certain contexts. For example, the tel: URI scheme specifies that tel: URIs are intended to be used to identify telephone communication endpoints.

In order that names be comprehensible to the largest possible audience, a widely deployed method for locating documentation is essential. Some URI schemes have associated network protocols that can provide information via common Web client software using the URI as an "identifier" for accessing the information. URI-based naming exploits this by using such protocols to transmit human- and machine-readable documentation that says what the URI is supposed to name. Some URI schemes lack any protocol association, while others that have a protocol association either limit the scope of what can be named or are not widely deployed. While these URI schemes may be suitable for use in naming, we will not further consider them here. It happens that the best-supported URI scheme with both naming of arbitrary entities and wide deployment is http:, which is specified in [Fielding99]. [and further interpreted by Jacobs04 and [Fielding05]]

Any naming framework must provide mechanisms to enable the creation of new names while avoiding conflicts with existing ones. Under URI-based naming, new URIs are introduced by making documentation available via the associated Internet protocol applied to the URI. Collisions are avoided because at most one party, either the owner of the DNS domain named in the URI or someone empowered by them, has the ability to make documentation available in this way (i.e. to "publish" documentation on the Web at the new URI).

5.1. The myth of location independence

URI schemes that lack protocol association, or that are explicit in making protocol association advisory instead of central, might be seen to be preferable to those that do, for the purpose of naming. There are two reasons that one might want to avoid protocol association: (1) protocol association presents the risk that the network infrastructure may fail, at some time, to support the expected naming system, and therefore avoiding protocol association relieves the community of the burdens of assessing and taking on this risk; and (2) eliminating the privilege of some particular documentation set (that accessed via standard protocols) encourages the creation of multiple documentation sets, or of multiple copies of documentation sets. By removing the domain name and server as points of failure, and the appearance that the validity of the naming system relies on infrastructure, the overall scientific infrastructure is made more robust.

These benefits are largely illusory. First, any naming system will require review for quality in any case; adding review for infrastructure reliability merely raises the bar by a modest increment. [can be stronger her: infrastructure is necessary these days, so will always be part of any review or risk profile.] Second, any naming system that is to be useful to computational agents (and to web-dependent humans) will require online documentation, and agents will require knowledge of how to obtain it. It is no more difficult to support alternative and replicated documentation when there is an associated protocol than when there is not an associated protocol. The only difficulty is convincing application developers to understand that users need for names to be understood according to what they should mean, not processed slavishly by observing any particular apparatus such as that determined by a network protocol. But again, this difficulty exists independently of whether there is protocol association.

The greatest cost of protocol association is overcoming the frequent belief, at odds with the sometimes subtle technical specifications, that a protocol associated URI can only be understood by using that protocol - that is, that the protocol association is so strong as to preclude the use of associated URIs as names. This is reinforced by the observation that for many URIs, different responses can be obtained from the Web (or other protocol) at different times. The conclusion is that the meaning of the URI has changed, making the correct interpretation prior uses dependent on which meaning is intended. It has been argued that because the HTTP protocol permits this kind of instability, all http: URIs carry an unacceptable risk of being unreliable.

But there are explanations for changing responses other than deficits in HTTP: [talking both 303 and 200 now]

  1. Maybe the URI is not intended for use in URI-based naming at all, and one should substitute a URI that is designed to mean something.
  2. Maybe there is some kind of mistake or technical malfunction, as discussed previously.
  3. Maybe responses all document the same meaning for the URI, but later responses are enriched in some way, for example by the addition of examples or bibliographic citations.
  4. Maybe the URI denotes a changing document-like thing (draft series, blog) and has its meaning documented in some other way.
  5. Perhaps the name is new and the documentation is still under development. [change policy?]

In any case, just because some URIs aren't well documented and supported doesn't mean that none of them will serve well as a name.

Building on existing widely deployed protocols benefits science, because by doing so all scientific disciplines and segments of society will gain access to scientific knowledge on the web without any change in their technology base. Because the protocol associated with http: URIs is so widely available, and because of the lack of any clear advantage of another kind of URI, we recommend the use of well-documented and well-supported http: URIs as names. As our example in section 4 shows, with design that is deliberate, an http-based scheme can address both needs that are sometimes claimed only possible with new schemes, and integrated well with existing practice on the Web.

6. Naming versus a dynamic Web

Things change and publishing on the Web is so easy that change is quickly visible. Managing change in life, in organizations, and on the Web will never be easy, and this note does not offer a solution to such problems. Rather, in this section we discuss several issues of change from the perspective of how naming carefully can reduce unecessary change, and how to avoid real change from harming the utility of names.

6.1. Documentation and change

It may be safe to revise documentation to correct mistakes and remove unforeseen ambiguities as long as it preserves the intended meaning of the term. One way to characterize allowed changes is to attempt to distinguish those parts of the documentation that are intended to define or specify the term from the rest of it, which may be based only on current understanding of the science and are subject to revision. Documentation may also include examples, cross-references, and bibliographic information, which are all meant to be helpful, not "normative". [More discussion of what sorts of documentation changes make sense, with mention of change policy, and perhaps insight on what is definitional versus informative] [ (e.g. correction, clarification, addition of references and other links) Documenting a name can be difficult and mistakes may occur.]

6.2. Polysemy in http

A similar confusion, of thing with description of thing, from the dual use of the HTTP protocol for sometimes accessing documentation for the URI (via 303) and sometimes information denoted by or associated with the URI (via 200). A common mistake would be to see some documentation and take the URI to name the documentation itself instead of the URI's referent. Another would be on the server side, with the use of a 200 response where a 303 was needed, which would suggest the same thing. The mechanism of documentation URIs (303 or fragment truncation) automatically provides separate names for the two related entities, so a careful choice of one or the other URI and knowledge of the protocol is all that's needed to avoid this problem.

The use of URIs to name document-like things - loosely speaking, web pages - can lead to polysemy. Servers, and the web architecture that constrains their behavior, generally give little guidance about the meaning of URIs for such entities. It is tempting to assume that a successful retrieval of a document via an HTTP "200 OK" provides sufficient documentation for the URI. In good circumstances this is true, because the document contains information that helps us to understand what the URI means, which might be response tells us that the URI denotes denotes what was received.

[Refer back to discussion of records earlier. Discuss some uses of CN to server record variants gives rise to polysemy]

However, making a similar assumption about the URI following a change to the document on the server [tbd: allude to evils of CN] would result in a polysemy because then the URI would seem to denote two different things. [AR wants explanation to say what the two different things are.] [jar check] If the HTTP responses vary over time, the URI, asumming no server error, denotes not a single unchanging document, but rather a draft series, the changing output of an instrument, a blog, the changing bylaws of an organization, or an otherwise evolving entity. [thanks dbooth] To avoid misunderstandings, a publisher should clearly state the change policy that applies to its documents. [FOAF, OBO]

[ (formerly h2) Section: http: is not a single naming system]

[AR: Could drop this next section, or substantially shorten to quickly review utility of http. / JR: shortening rewrite follows]

[This relates to URNsAndRegistries. Review this section to ensure that desired message is clear]

[

One way to ensure quality of a naming system is to put it under the protection of a well established organization. The organization can ensure that names [flushed: are catalogued and] have associated documentation that is well formed, well supported, and sensible. In the case of http: URIs, the organization can also make sure that protocol responses help readers use the name in accordance to the meaning the community has come to give it.

The http: URI scheme as a whole is not subject to quality assurrance and many of its URIs do not act like names in the sense discussed above. [JR: 581 new sentence] However, some useful naming systems with quality control are to be found embedded in the http: space. HTTP has the property that anyone may acquire a domain name and set up a new naming system; whether the community can use the naming system without becoming confused must be assessed by the community.

To suppose that uncertainty over quality is a defect of the http: URI scheme that can be fixed by switching to a different scheme fails to recognize that use of a name coming from any new naming system will [JR: changed: require review of that naming system] pose infrastructure risks and require review. http: is not a single naming system but a large number of independent systems, each of which has to be evaluated for the quality and sustainability of its names.

]

The best way to reinforce practices of clear naming is to provide checkable logical relationships, such as domain and range constraints, that software systems can use to alert authors to inconsistencies. [give an example?]

6.3. How is documentation delivery kept consistent with meaning?

[HOW IS CONSISTENCY MAINTAINED?] [Previous: The principle that URI namespace ownership gives the right to introduce a URI ends once the community takes up the URI with the meaning intended by the owner. If the URI owner can change his/her mind about the meaning of the URI then any statements they make that muddle an established meaning have the potential to be highly disruptive.] [move] [fallacy of authority] Over time web site administrators change their minds about what information should be served at any particular URI, and DNS domain ownership changes leading to more radical changes. The ability to publish documentation to be accessed over the network via a URI does not mean that one may publish whatever one likes without consequence. Any statement published via the network that muddles an established meaning has the potential to be highly disruptive to the community. [added in response to dbooth; AR dislikes: The binomial species naming system has the published record to refer to in resolving any uncertainties that may arise, but this is not generally available with URI-based naming. ] The community relies on domain owners and site administrators to make documentation available, and because of this it should make sure its trust is well placed.

On the other hand, no system is completely reliable. One should be aware of the possibility that documentation may become unavailable, or that it may change disruptively, perhaps with the best of intentions. Server-provided information should be treated with skepticism and independent sources of documentation for URIs should be sought when necessary in order to understand a use of a URI. [this repeats the eggs-and-baskets passage]

[flush: Even if the URI owner has not made any clear statement about the URI's meaning, a community may still establish a meaning for a URI through use. As participant in such communities, it would be wise for a URI owner to respect that meaning, as contradictory statements would probably be ignored.] [The more general point is: Disruption may come from anyone - the original author, web site owner, or the community (e.g. an encyclopedia). No one has any special authority.]

[flushed because JAR felt it was either covered elsewhere or not important. AR will review for salvage.

would define a universal virtual catalog for the entire global namespace. Assuming such a method leads to correct answers, it ought to be preferred to ad hoc maintenance of a variety of specialized methods [flushed: appropriate] for different regions of the global namespace.

In the context of the World Wide Web, where control is distributed and publishers are fallible, there is no way to guarantee that that all names will be well defined, or that protocol mediated information about the names will be well supported. [redundant] Given an an arbitrary URI, the best we can do is to review it. (... AR text follows: doesnt make sense to JAR) For more predictability the scientific community will establish sets of URIs and standards for how to find out what these URIs name, and for what behaviour under protocol should be. Such behaviour need not conflict with or redefine existing protocol behaviour - but may refine it. Indeed building upone existing behaviour will benefit science as different segments of society using the protocol will gain access to scientific knowledge on the web without necessarily changing the technology they use.

]
[jar writes, but AR thinks not ready for prime time

Some URI schemes, such as http:, have associated network protocols, and it is necessary that the behavior observed when using their associated protocol be somehow consistent with the meaning of the URI.

[new] For example, the http: URI scheme is associated with the HTTP protocol. Correct behavior under protocol follows a rather complex formula [Fielding99][Fielding05], but two important cases are 200 responses, used only for URIs that denote document-like things (the "information resources" of [cite Jacobs04]), and 303 redirects, which are correct for any kind of thing. The document retrieved in either case should at the very least not say something inconsistent with the thing denoting what it does; ideally it is documentation for the URI that tells you what it should denote.

]

7. Technical failure and how to moderate it

7.1. Meaning is not hostage to the network

A naming system that has an associated protocol relates to the protocols only in that the protocol provides what can be construed as a standard catalog or dictionary that aids in the understanding of the names. Regardless of whether or how the naming system exploits a technical apparatus such as the Web, meanings of names are not hostage to mistakes or technical or administrative failures, because the meaning of a name is infused in all communication that uses the name, and a retrieval of the name's documentation is only one such communication. This is easy to see in the case of binomial species names, which are universally understood to be based on primary literature, not catalogs. [alan wants to dissent that Linnaeus's system started with a catalog.] Only recently has it had comprehensive catalogs at all, and these are considered secondary sources subject to verification. However, even a naming system such as GenBank [citation] that is very closely associated with a web-accessible source of primary documentation is ultimately based on what its names (accession numbers) are believed to mean, not on what the web responds. If GenBank were to become corrupt or drop off the face of the earth, the community would scramble to create an alternative source for the retrieval of sequence information associated with the accession numbers, because so many scientific communications depend on the accession numbers to name the information that the records carry. As with any naming system, GenBank's [5/5 alan wants to emphasize the following] technical infrastructure is a community trust, not a final authority.

7.2. Insurance against technical failure

[582 JR: changed 'protocol' to 'technical'] [AR: Keep this section and give some more practical guidance] [Expect some expansion and further discussion here]

How can the community protect itself against technical and administrative failures, such as a server outage, a bug in server software, or the abandonment of a domain name, that make documentation unavailable or otherwise make it difficult to understand a name? [was: How can the community protect itself against the divergence of behavior under protocol from the way a URI is used in scientific communication and elsewhere?] There are two ways to do this (with apologies to Henry Thompson [Thompson08]):

  1. Put your eggs in one basket, and watch the basket: Be cautious in one's choice of URIs for use as names, admitting only URIs controlled by organizations having the highest standards and technical competence - those that appear able to provide, in perpetuity, service that supports the meaning of the name. [we need more than behavior that is consistent with x, we need consistent behavior / changed "is consistent" with to "supports" / 582 also changed "behavior" to "service"]
  2. Put your eggs in multiple baskets: Be prepared, if necessary, to bypass direct use of the network protocol, reserving the right to use mappings to different URIs and/or alternative protocols (such as SPARQL) [JAR added the "such as"] to interact with alternate documentation for [was: an ... catalog that knows about] the name.

Either of these would work on its own. Method 1 places a burden on namespace administrators, while method 2 moves the burden to individual users. Users have little ability to watch the basket and do not control how the network is going to respond, while altering software to route requests to alternative documentation sets [was: catalogs] is in principle within everyone's reach. Over time one might expect a certain amount of "documentation decay" even in the best of circumstances, in which case there would have to be a shift from reliance on method 1 to reliance on method 2.

[ (An example of a naming system that explicitly encourages the multiple-baskets approach is that of Archival Resource Keys. Cite ARK article.) ]

The most robust approach would be to combine both tactics: Evaluate protocol infrastructure carefully and be prepared to use other documentation in the unlikely event it becomes necessary to do so. [581 JR: the following is not necessary given other new text above: Take the time during which method 1 works to put method 2 into place.]

[563 Para moved here from appendix, first sentence new, some tweaks] Of interest here is a service called purl.org, which solves the problem of long-term domain name ownership and provides for collision avoidance. purl.org gives out limited control of URIs in its namespace on demand. The individual (the "owner") to which the URI is assigned gets to determine HTTP forwarding for the URI. This is no guarantee of accessibility of the redirection target, but the guarantee that the URI will never be recycled for another purpose is valuable (it prevents polysemy), and the forwarding service allows any document connected to the URI to be moved easily by changing the forwarding rule. Another way to achieve these ends would be to register a new DNS domain on which to base the new names, but purl.org relieves the owner of the responsibility of having to keep a domain registration up to date forever. It is likely that OCLC, a nonprofit acting for the benefit of the library community, will maintain purl.org's registration and infrastructure for as long as it is needed, saving us all the bother.

What happens if the owner of a purl.org URI (or any other URI for that matter) becomes incompetent or unresponsive, and the URI fails to forward to a useful document? This is a betrayal of the community, which might depend on access to that document. To protect against this we recommend that an assigned purl.org URI (or purl.org "domain") have multiple administrators, each of whom has demonstrated the ability and desire to act on the community's behalf should something go wrong. Should the URI break (404, etc.) any administrator can change protocol-associated behavior to lead to a copy of the correct document.

8. Conclusion

[Conclusion is out of sync with many changes since it was first written. General conclusion re: http will stay but other conclusions will be added]

The authors feel that the prospect of accelerated communication enabled by the widely deployed HTTP protocol outweighs the risk that readers will be misled by information transmitted by the protocol as a result of an missing, inconsistent, or incorrect [563: changed catalog to documentation base] documentation base. On balance we advise that some http: URIs should be enabled and promoted for use as names in online science - with caution.

http: must be used with the same care one would exercise with any naming scheme:

[?is this ok to say / worth saying?] Although the use of URIs on the web and the semantic web has been the subject of other much discussion, we feel that the above exposition is a contribution to the discussion because it approaches the issue of URIs as names from the perspective of scientific nomenclature, with the needs of scientists and science in mind.

Acknowledgments

Thanks to the following people, who commented on drafts: Alan Bawden, Jake Beal, Dan Corwin, Michel Dumontier, Chimezie Ogbuji, Bijan Parsia, Eric Prud'hommeaux, Matthias Samwald, Gerald Jay Sussman, Kaitlin Thaney, Mark Tobenken, Sankar Virdhagriswaran, John Wilbanks, and Stuart Williams. Thanks as well to Peter Ansell, Dan Connolly, Eric Jain, Marc-Alexandre Nolin, Henry Thompson, and Mark Wilkinson, all of whom provided valuable insights.

Special thanks to David Booth for his repeated thoughtful and meticulous readings.

References

[AR: Fewer web references, and more science references / JR: We can only reference what we cite in the text.]
[Beckett04]
Dave Beckett.
RDF/XML syntax specification (revised).
W3C recommendation, 2004.
http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/
[Brown62]
William L. Brown, Jr.
A new ant of the genus Epitritus from south of the Sahara.
Psyche 69:77-80, 1962.
http://hdl.handle.net/10.1155/1962/16824
[Dean04]
Mike Dean and Guus Schreiber (editors).
OWL Web ontology language reference.
W3C recommendation, 2004.
http://www.w3.org/TR/2004/REC-owl-ref-20040210/
[Fielding99]
RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1.
IETF, 1999.
http://www.ietf.org/rfc/rfc2616.txt
[Fielding05]
Roy Fielding.
[httpRange-14] Resolved.
Email to www-tag@w3.org list, 2005.
http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039
[IAU08]
IAU Minor Planet Center.
How are minor planets named?
http://cfa-www.harvard.edu/iau/info/HowNamed.html as downloaded on 1 April 2008.
[Prud08]
Eric Prud'hommeaux and Andy Seaborne.
SPARQL query language for RDF.
W3C recommendation, 2008.
http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/
[Schemes]
IANA Uniform Resource Identifer (URI) Schemes.
Undated registry; updated as new schemes are added.
http://www.iana.org/assignments/uri-schemes.html
[Smith07]
Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J Mungall, The OBI Consortium, Neocles Leontis, Philippe Rocca-Serra, Alan Ruttenberg, Susanna-Assunta Sansone, Richard H Scheuermann, Nigam Shah, Patricia L Whetzel & Suzanna Lewis.
The OBO Foundry: Coordinated evolution of ontologies to support biomedical data integration.
Nature Biotechnology 25:1251-1255 (2007).
http://www.nature.com/nbt/journal/v25/n11/abs/nbt1346.html
[Thompson08]
Henry Thompson.
Persistence, Delegation and URIs.
Unpublished memo, 2008. [
http://lists.w3.org/Archives/Member/tag/2008Feb/att-0101/persistence.html - unfortunately this is W3C members only. Have asked the author to make it public.]
[Zucker07]
Jeremy Zucker and Alan Ruttenberg.
Debugging the bug.
http://bio.freelogy.org/wiki/Debugging_the_bug as downloaded on 16 May 2008.