Notes toward HCLS Recommendations for Choosing and Using URIs

To feed into the recommendations draft as it is being written.

This text is intended to be provocative. It reflects the editor's opinions and observations, not those of HCLS or BioRDF.

Sorry if the prose is sketchy or hard to follow; I am focusing on coverage here and still working out ideas. Crafting a good document will come later.

Goals

Not an exhaustive or particularly clear list:

Focus exclusively on URIs that occur in RDF triples
Nonrepurposing is most important (do no harm)
Facilitation of joins (i.e. single name for each resource) is almost as important
Access to descriptions and access to data are both important
Community inclusiveness and technology compatibility are important

(Eric: maybe "persistence of meaning" instead of "nonrepurposing")

(Bill B: I agree - persistence of meaning is more to the point in the context of URIs used in RDF)

Use cases

There ought to be some here, but I don't feel like working on this yet.

The main case I have in mind right now is the investigator who attaches supplementary materials to a published paper. The materials consist of information that is expressed in RDF. Ten years later, a reader of the paper wants to understand that RDF, perhaps even to reproduce results. Each URI in the materials needs to be understood by the reader. What URI's does the original investigator choose, and how is the reader supposed to understand what was meant by them?

(Bill B: This particular topic was taken up at a recent meeting between publishers, funders, researchers, library scientists, and SemWeb experts - the PubMed Plus meeting held in mid-June and sponsored by the ScienceCommons, NIH, and Society for Neuroscience)

Definitions

Resource: something that can be identified.

URI: an identifier, as defined in RFC 2396. (Eric: probably should switch to IRI's here.)

OK URI: a URI that meets basic standards of quality as specified below.

Piece of data: a sequence of bits; something that might be the payload of an HTTP response. ("data" is an awful word for this, but it has some currency and I can't think of a better term.) N.b. pieces of data don't change. (Eric suggests that "representation" might be better; a piece of data might pay the role of a representation of something, but a priori is not a representation of anything.)

Document: One or more pieces of data that are considered to be the document's "versions". (The intent is that the document forms a coherent concept, so that the versions are somehow related to one another logically or historically; but this cannot be formalized. I also want something that's compatible with foaf:Document, and also that captures current practice on the web.)

Version (of a document): A piece of data that is related to (belongs to) a document. (Compare AWWW/REST "representation".)

Variant (of a document): A subset of a document's versions with a fixed language and format. The versions of a variant form a single temporal sequence.

Access (a document): retrieve a version of the document.

Dereference (a URI): retrieve a version of the document identified by the URI. (Sometimes this is called "resolution.")

Statement: an RDF triple that is meant by someone or something.

Description (of a resource): a set of statements that help to define, specify, and/or describe a resource. (If the resource is a document, its descriptions are generally called "metadata".)

Dynamic document: "content may change." (NLM definition)

Stable document: "content is subject only to minor corrections or additions." (NLM)

Unchanging document: "content will not change." (NLM) (Can an unchanging document have variants?)

OK URIs

An OK URI is a globally unique identifier for something in particular.

To be an "OK URI" for present purposes, a URI must satisfy the following criteria:

An OK URI identifies a clearly defined single resource. URI's are cheap, so a horse should get one OK URI, while a document that describes the horse should get a different OK URI (assuming both have URIs).
An OK URI is stable: once its definition is established (through whatever means), the URI is never repurposed. This doesn't mean that the definition can't be clarified, corrected, or otherwise improved, or that the named resource cannot change, but the URI's intended meaning must remain consistent.
The resource identified by an OK URI possesses at least one description, including at least a type or class information. (Do we want to specify other properties as well?)
Methods (nonexclusive) for accessing a description are mechanizable and widely known (perhaps via publication in the HCLS URI recommendations document).

An OK URI is meant to be similar to a bibliographic reference. It should clearly identify what is denoted in a way that will avoid confusion for the indefinite future. Perhaps its form will give hints as to how to track it down, but institutions and addresses come and go, and it may be necessary to consult multiple sources (e.g. libraries) in order to find someone who has cared enough about it over time to remember something about it.

As soon as a URI becomes circulated beyond its publisher's walls, it becomes community property. Thus, prevention of repurposing is in the community's interest, not just the publisher's.

Just because an OK URI has the potential to have durably accessible descriptions doesn't mean that it will. You can be sure an OK URI won't ever be repurposed, but no guarantee is made about description availability. (By comparison, DOIs have a stronger metadata accessibility contract.) Similarly, if the OK URI identifies a document, there is no guarantee that any version of the document will be available in the long run. These guarantees may be made (in appropriate RDF), but they are not part of the definition of "OK URI".

(My aim in introducing this concept is to lay out requirements that force us to a particular solution. As laid out above, DOIs and LSIDs fit the description, and we could certainly arrange for HTTP URIs to do so as well. Whatever choice we eventually make should be forced by the addition of further requirements.)

(Question: do we need a convention whereby we can determine, either by syntactic form or by making some kind of query, whether a URI is an OK URI? E.g. info:doi/nnnn URIs are OK URIs, and this is obvious just by looking at the URI and knowing how the info: scheme works.)

Minting URIs

When minting a URI, make it an OK URI.

Preventing repurposing and maintaining accessibility (of descriptions and data) is challenging. One answer: use OCLC's purl.org server - then at least no one else will repurpose it. Another answer: coordinate with a long-lived institution such as a library or university.

When minting a URI that happens to be an HTTP URI, try to make sure that meaningful server responses are provided, in perpetuity if possible. See below under "publisher recommendations".

(Wilkinson claims that PURLs are "centralized" and "unsustainable." Alan R proposes the establishment of a backup plan so that we know ahead of time what to do in case purl.org has trouble - how to establish a backup server if needed. By the time purl.org goes away, I hope we have a much better way to do all of this!)

(Do we have anything to say about the rest of the URI?)

How to access descriptions and data

Very important to distinguish two problems:

How to get descriptions of a resource
For a resource that is a document, how to get a version of it

Nonspeculation principle: machines don't like to use heuristics, which are inefficient and error prone. Provide or obtain sufficient information from the outset to allow the machine to find descriptions or versions directly and not have to guess or explore.

You're going to want to get the information you need through whatever means are available to you. Not all URIs are OK URIs, so you need to be prepared to dereference defensively.

Lacking independent knowledge of the resource's availability, the form of the URI may give hints as to where to find the resource's descriptions or versions (e.g. it may name a publisher's web site). However, it does not dictate any particular method for doing so. If the publisher goes out of business, or demands too high a price, the resource may still be available from another provider.

(This statement is asserted to be true of HTTP URIs, in spite of their dual use as locators. It is interesting that some people assert the opposite, that HTTP URIs can only designate access to a resource via a particular server. This seems to ignore the reality of browser caches, caching proxy servers, the wayback machine, Akamai, Google's web cache, etc., as well as Semantic Web experience to date with the non-document resources defined in ontologies such as FOAF and DC, and experience in XML and other domains with use of HTTP URIs as identifiers.)

Descriptions can be retrieved using a variety of retrieval methods. Here are some methods that have been proposed:

From a SPARQL endpoint
If the resource is a self-describing document: From the resource itself
If the resource is not a document and the URI is an http URI: By dereferencing the URI, and treating a 303 redirect target as a description carrier (this then reduces to data access, see below)
By transforming the original URI into another URI (e.g. fragment removal, or referral to a document-based description repository: cf. Alan's wiki idea) which is then taken to identify a description carrier
From a web services based resolver

Data (technically speaking, versions of documents) can be retrieved using a variety of retrieval methods as well:

By consulting a mirror or cache, local or otherwise (e.g. SQUID, Google cache, wayback machine)
By performing ordinary web access (e.g. DNS + HTTP) treating the URI as a URL
By transforming the original URI into another URI
From a web services based resolver

I believe it would be a very good idea to codify recommended methods for doing these retrievals for OK URIs, restricting the definition of OK URI if necessary to match or simplify the codifications.

(Eric is beginning to pick nits)

Example of use of a rewrite rule for accessing data: a URI with the prefix info:pmid/ can be rewritten to a CGI referring to NLM by replacing this prefix with another. Rewrite rules can also be used to set up a local mirror or adjust for publishers' changes in their URLs (e.g. PubMed Central moved within the NIH URL hierarchy a year or two ago, and this broke internal URLs).

The best way to access descriptions and data is to use knowledge of retrieval methods specific to the resource or to its URI. One way to represent such knowledge is using RDF statements. For example, a very naive approach is to have two string-valued properties, one giving a URL for a document that carries a description of the resource, and the other giving a URL for access to the resource's data (if the resource is a document). Dan Brickley suggested something like this once. A comprehensive solution would include access to descriptions and to versions using CGI, POST, SPARQL, etc.

A resource may have multiple sources of descriptions, and it may be desirable to consult more than one of these.

How does one implement and configure a client for the purpose of getting descriptions and data? (Distinct methods/rules will be needed for accessing data and descriptions.)

(Insert something similar to Alan's ontology and reference implementation here. How does it deal with data vs. descriptions? Anticipate objections to using OWL. Note question as to whether rules are associated with resources or with URIs. Resources may have multiple URIs, or none at all, and after all it's the resource we care about, not its URI; on the other hand, applicability of a rewrite rule is based on the URI, not on other aspects of the resource.)

How to publish descriptions and data

When there is a choice, publishers of RDF are encouraged to identify resources using OK URIs in preference to non-OK URIs.

As noted above, a description must exist for every extant resource named by an OK URI.

When multiple variants are available for a document, these should be treated as resources in their own right and each given its own OK URI. A description should be provided that relates the revisions to one another (how?).

(From here down, just ramblings about things publishers might need to do in various scenarios)

purl.org URIs obviously require that redirection be configured with the purl.org server.

For HTTP URIs, servers should respond with a 200 for documents, and a 303 redirect for non-documents. (This would be after following 301/302/307 redirects.) The payload of a 200 response should be a version of the document. (As a special case, if the URI identifies a piece of data, that piece of data should be delivered.) A 303 response should redirect to a document that carries a description of the resource. (what formats are acceptable? RDF/XML, RDFa, turtle?)

If for any reason it is not possible to perform a 303 redirect (e.g. access to this level of the server configuration is denied the publisher), make the response's relationship to the resource clear in the response, somehow (good luck). Or better, get someone else to provide redirection for you.

(Why are 303's important? If you dereference a URI for a non-document and get a 200, you may be misled into believing that the piece of data you get is a version of the resource, with the implication that the resource is a document. For example, you might save the URI so that you can access the description later, or you might compose RDF that talks about the description carrier but naming it using the non-IR's URI. This would be very confusing.)

It is not clear how descriptions for documents should be provided. We (HCLS) have to decide. If the document is an HTML file, there is an impendingly standard way to put RDF in it (RDFa), and some other formats, such as JPEG and PDF, have places to keep descriptions (metadata). But this is not true in general. A way to assert that resource2 carries a description for resource1 is desirable; something a bit stronger than rdfs:seeAlso. Such assertions would have to be carried out of band (conveying it in an HTTP header doesn't fly).

If an OK URI identifies a piece of data, its description should entail the statement that the resource has rdf:type piece-of-data. (We will need a URI that identifies the piece-of-data type.)

Alan sez: [How shall we] deal with descriptions added by others, after the fact, a big win according to Mark W. My solution, reserve a piece of /commons, e.g. /commons/about/<url> that has ability to register new providers of information about, e.g. wiki.

What about content negotiation?

HTTP content negotiation is for browsers, not semweb applications. CN complicates life for alternative dereferencing mechanisms. A document that has variants corresponding to different formats or languages should be related to those variants through a description.

Variants per se aren't evil, it's just reliance on CN which is evil. Publishers either shouldn't use CN, or they should combine it with exposure of their variant interrelations as RDF. (See the rec that Alan found recently.)

Which URIs to use for public database records?

When the publisher of the database designates credible OK URIs for its records, use them.

Otherwise, if another credible source provides OK URIs, use them.

Otherwise, ask the HCLS URI Board to mint the OK URIs that you need. (We will need to create such as board! Perhaps it will use something like the http://purl.org/commons/ URIs prototyped for the Banff demo.)

Otherwise, try to mint OK URIs yourself. This is not recommended, since the "OK URI" standard is difficult to meet, but you may be forced to do this for reasons beyond the imagining of the authors of this document. Publicize them as best you can so that other people will be likely to discover them. If you mint PURLs, be sure to give administrative access for your PURL domain to some worthy party (such as the HCLS URI Board) who can take care of the PURL redirects after you go out of business, so that redirects can be maintained.

Document stability properties

Definitions for the National Library of Medicine's Permanence Levels (http://www.nlm.nih.gov/permlevels.html)

(JAR's note: stability and availability need to be decoupled.) (JAR's note: another nice property we might want to capture is monotonicity.)

Permanence Not Guaranteed: The National Library of Medicine has made no commitment to retain this resource. It could become unavailable at any time. Its identifier could be changed.

Permanent: Dynamic Content: The National Library of Medicine has made a commitment to keep this resource permanently available. Its identifier will always provide access to the resource. Its content could be revised, replaced or recaptured.

Permanent: Stable Content: The National Library of Medicine has made a commitment to keep this resource permanently available. Its identifier will always provide access to the resource. Its content is subject only to minor corrections or additions.

Permanent: Unchanging Content: The National Library of Medicine has made a commitment to keep this resource permanently available. Its identifier will always provide access to the resource. Its content will not change.

Ontology description

 foaf:Document
 PieceOfData

 x carries description of y

 y is a version of x
 y precedes z (as a version of x)
 y is the Dutch variant of x

 rewrite_rule
    regexp   (follows Apache syntax)
    template

 m is_a_description_access_method_for ir
 m is_a_data_access_method_for ir

 AccessMethod
   RewriteAccessMethod
   SPARQLetc.
   LSIDetc.
   etc.

What is this recommendation's attitude toward LSIDs?

Some useful RDF will contain LSIDs and DOIs. There needs to be a story about how to make use of them. I hope that someone who understands "LSIDs in the wild" will provide a HOWTO that can be cited.

Whether we recommend minting LSIDs (or DOIs, etc.) will mainly depend on whether LSIDs can meet the requirements that we end up posing.

FOOTNOTES

httpRange-14 (200 vs. 303) is not rigorously implemented on the web, and many servers will get semantic web architecture wrong for the foreseeable future. Therefore, be prepared to justify (by appeal to contract and/or published description) any assumptions you make about what a response code or redirect location mean.

The phrase "piece of data" is from page 8 of the LSID spec, which also uses "set of bytes" and various other things for the same idea.

This document ought to talk about the concept of "nonsense detection" and prior checking (nonspeculation) of meaningfulness of dereference (i.e. don't even think about talking about a piece-of-data for a non-document).

Differentiating Between Web Resources and What They Describe [ericP]

There are two conventions for labeling non-documents (proteins, people, cities, ...): 303s and fragment identifiers.

303s

With this method, a GET of a non-documents identifier will return a 303 which points to a document describing said non-document. The 303 redirection may do conneg, as may the resolution of the document it points to.

[JAR: As has been pointed out many times, a 303 response doesn't imply that the resource is not a document.]

Fragement Identifiers

These URIs include a fragment identifier whose interpretation is dependent on the mime-type of the returned document. For instance, dereferencing the identifier for rdf:type yields a document with a mime-type of application/xml+rdf

HTTP/1.1 200 OK
Content-Type: application/rdf+xml

The fragment identifier is grounded by the text <rdf:Property rdf:about="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">, which does not identify any part of document, but instead the concept of types in RDF.

Grouping Non-Document Descriptions

Both 303 and fragment identifers allow groups of non-documents to be described by the same document. For 303's, the grouping is done by having multiple identifiers respond with 303s to the same document. For example, foaf:name and foaf:homepage respond with

HTTP/1.1 303 See Other
Location: http://xmlns.com/foaf/spec/

HTTP fragment URIs have similar grouping, though the grouping is known by the resolver. rdf:type and rdf:label are known by inspection to be fragments identifers into the same document without using the network.

It is sensible to group identifiers into reasonably-sized clouds, providing enough context to make use of the identifier. Protein records are numerous, so it is not practical to describe all of them in a single document. Taxonomies are, however, of limited size and atomicity; the consumer benifits from getting all of the terms in a taxonomy at once.

[JAR: I hate to interrupt, Eric, but there's no way to tell ahead of time whether either technique will work, and in any case they don't help at all with providing descriptions for non-self-describing documents.]

Recommendations 2007-09-10

This is a mockup, inspired by http://wiki.tdwg.org/twiki/pub/GUID/LsidApplicabilityStatementRfC2007Sep/TDWG_LSID_Applicability_Statement_2007_08_30.pdf

Every term (URI) should be given a clear definition by its owner.
Definitions should be published through standard protocol-specific methods, such as 303 and #-truncation for http:, as permitted.
You decide whether to use # in your URIs, based on other W3C recommendations.
Definitions should be made available for the lifetime of the term.
Publishers are encouraged to mint URIs using the http: schemein preference to other schemes (with the possible exception of data:).
Owners should not change definitions inconsistently.
Definitions should specify single and particular usage. For example, a term should be used for a document describing something, or the thing, but never both (unless the document is self-describing).
Definitions must be in natural language using rdfs:comment (or a subproperty), or in rigorous and specific RDF.
Separate definitions (which should be small and extremely stable) from other RDF related to the term (such as statements that describe the denoted resources). [But how, exactly? Consult D Booth's memo.]
Do your best to ensure that your URIs are fresh. Remember that a previous domain name owner or project at your organization may have minted semantic web URIs.
Don't mint a new term when an existing one will do. "Will do" means both consistent in definition with what you want, and of high enough durability. Know your community. Find out what terms are in use.
Definitions should be sought close to the point of use: in the same RDF graph, or in a cited document.
In non-ephemeral applications, endeavor to mint names that can be assigned and served durably.
For archival applications, use only impeccable URIs.
Relate versions of documents to one another using [what relationship? and where to put the assertions?]
It is recommended that the community develop a scheme-independent protocol for semantic web caches that can provide definitions and (where appropriate) 200 responses for arbitrary terms.
It is recommended that the community figure out what terms should be used for public database records (such as those in Entrez Gene), and come up with a versioning story for them. The terms should be impeccably hosted - that is, made available durably and consistently - by an organization that the community can trust.

Can we give any recommendations at all relating to 200-responders?

Link tag inside HTML

Has anyone written anything about using the <link rel="alternate" type="application/rdf+xml" href="blah.html"> tag inside html as a way of finding the RDF description for a page? PA