HCLSIG BioRDF Subgroup/Tasks/URI Best Practices/Recommendations/SputnikDraft

From W3C Wiki
Jump to: navigation, search

URI Practices main page . URI Note main page

DEPRECATED

Please see URI Note main page for link to current version.


Note on Minting, Defining, and Using URIs (Sputnik draft)

Jonathan Rees, W3C Semantic Web Health Care and Life Sciences Interest Group, 2007-10-07

Abstract

[Abstract, to be written based on what ends up going into this document.]

How to comment on this draft

There is a shorter version: [[/../ShorterSputnikDraft]] - you might want to look at that first.

For now, please put your comments on the /DraftTalk page. I am editing this file off line, and keeping comments on a separate page helps ensure that your comments are heard and tracked. I will attempt to address all concerns and record dissenting views fairly.

There is obviously still a lot of work to be done. Right now I'm most interested in hearing about problems with the organization of the document, sections that should be cut, sections that should be expanded (except for the obvious ones), and, especially, claims I make that you disagree with.

[Brackets] usually indicate work to be done, information to be integrated, and questions to be answered. Please answer questions if you have answers. I will process or, if necessary, remove all remaining bracketed sections before final publication.

Status of this document

This is an editor's draft with no official standing.

I am hoping to ask HCLS at its October 25 teleconference for approval to publish this note (in whatever state it's in at the time) on the W3C web site. The note will still be just an editor's draft at that time.

It is my intent to publish preprints of this note on my employer's web site under a Creative Commons Attribution 3.0 license in advance of publication on W3C's web site. W3C's version will be published under the more restrictive W3C license. The non-W3C version will be clearly marked as being a preprint. I promise to do my utmost not to misrepresent anyone or anything.

Introduction

This note is about problems surrounding choice and use of URI, including choice of new URIs ("minting") and publication of definitions - part of the larger problem of how to use RDF well. URI choice is a problem for a variety of reasons, including definition quality, stability of meaning, and accessibility of defining documents.

The problems aroung URIs have been so vexing, and the argument so heated, that it is worthwhile to step back and consider what we're trying to accomplish. This is what the next section attempts to do. With this background, it becomes possible to formulate strategies rationally, by reference to the goal.

[intro needs to be fleshed out, continuing with summarization of document structure: goals, comment on scope, threats, advice, future work.]

The first section needn't be read if you just want to know what we recommend.

What we're trying to do

Technical disciplines such as life sciences research and health care are awash in information, most of which is locked up in a combination of written reports and formal objects such as tables, databases, and structured files written in a variety of notations. The transition from natural language and diverse formal notations to a uniform declarative language, rendering large amounts of this information in a common formal notation, will enable these multiple sources to be combined and processed together, enabling a broad range of computational uses, including

  • summarization and display - reviewing large amounts of selected information
  • query - obtaining precise answers to questions
  • validation - determining whether a body of information makes sense or is consistent
  • discovery via statistical methods - locating information that is unusual by some measure

To this end, RDF [cite] has been proposed as a common representation language. RDF is a simple formal declarative language intended to augment and substitute, in certain applications, for other information-carrying languages, including natural language and other formal notations. Its key features are its blandness, which helps to ensure its generality and neutrality, and its ability to gracefully combine information coming from multiple sources.

Technical disciplines place particular demands on the use of RDF and related [cite] technologies. The activities of practitioners range from the highly exploratory, which place a premium on spontaneity, flexibility, and inclusiveness, to the highly rigorous, with premiums on chains of inference, repeatability, and durable documentation. Inaccuracies in transcribing information into RDF and mismatches in combined sources may have serious consequences when information is being used -- for example, consider the cases where RDF-encoded information is used as part of a grant application or in deciding on the correct treatment for a medical condition.

RDF-encoded information is organized into graphs. Each graph is a set of statements employing a vocabulary of terms relevant to the graph's subject matter. [Footnote explaining why I say "term" and not "URI": "term" is evocative of requirements; much of what needs to be said sounds ridiculous when you say "URI"; URIs do not "identify" according to Pat Hayes; "resource" is undefined and its ordinary meaning is too restrictive.] Statements are supposed to declare something about the world -- that is, they are supposed to have the capacity to be true or false. [Footnote: definitions of terms are true "by definition" - as in mathematics, a definition can be meaningless or inconsistent, but it cannot be false.] The meaning of a graph -- what it says and what it implies, both logically and socially -- ultimately depends, in large part, on how an agent will understand the terms occurring in the graph's statements. It is therefore important to have, at the very least, an understanding of how terms are coined and used in RDF used in our domain.

We have no direct control over how an unknown agent will interpret a graph. W3C recommendations and other guidelines will be used to determine the interpretation of a graph, but other social and political processes will also influence interpretation, just as just as for natural language. The best a document such as the current one can do is to inspire practice that will increase efficiencies and reduce confusion among those who take its advice.

The most essential characteristic of scientific endeavor is skepticism. This does not mean that we restrict ourselves to uttering only well-established truths. Instead, skepticism is reflected in careful attention to the logical and bibliographic support for assertions. The chain of support helps an agent processing some information to form its own judgments of the validity and usefulness of the information for the application at hand. Inference and citation are therefore at the heart of any use of natural or formal language in science.

Scope of a graph

One axis along which to classify communications is according to scope - basically, the standards and expectations surrounding a message's applicability, and in particular the expected separation in time between the writer and the reader. In RDF this question has bearing on choice of terms and the manner in which definitions and other graphs are published.

For example, the following situations imply different scope:

  • Message passing - assertions valid only in the context of a conversation - writing and reading separated by minutes
  • Time-sensitive communication - assertions that might not be true tomorrow - separation by hours to months
  • Knowledge curation - true to the best of one's knowledge, today - separation by months to years
  • Archiving - presented with the hope that it still will make sense a long time from now - separation by years to decades

Scope determines, in part, how much effort one needs to invest in careful preparation and use of RDF graphs. A context requiring only short-lived RDF graphs may be more forgiving than an archival context. For example, a short-lived graph may refer unambiguously to states of affairs that change infrequently ("the president"), while a long-lived RDF graph needs to avoid such context-sensitive reference.

(more of my blabbing on this subject [1])

Threats to the successful use of RDF

[or of any kind of scientific information, really]

Given the goals of efficient communication and integration using a commons language as stated above, we can set out potential problems that need to be addressed. The advice given here is aimed at mitigating these particular threats.

Threats to common interpretation

[This is just a summary of the advice that follows. I think this should be condensed or flushed in favor of better organization of the advice section.]

The intended use of a term can be established in several ways:

  • by explicit published definition ([cite Booth on declarations])
    • written in natural language
    • written in RDF
  • implicitly according to statements made in what is supposed to be an authoritative document
  • according to how it is used (reverse engineering or inference required)

For brevity I'll say "defining document" to mean any of the above. (Not quite the same as a Boothian "declaration", but close.)

When a common understanding of a term fails to be established, the reasons for failure include

  • No defining document
  • Defining document difficult to locate
  • Poor quality definition - vague, ambiguous, or unclear
  • Definition/use inconsistency - the term is used differently from how it's defined
  • There are multiple credible defining documents
    • change in concept over time (example from natural language: 'transient ischemia attack')
    • accidental collisions
    • disagreement over formulation of definition

Threats to integration (graph combination)

  • Two terms for same thing (missed opportunities for unification)
  • Failure to recognize or state relationships (e.g. missing subclass/subproperty assertion)
  • Incompatible logical systems (inconsistent entailments). This is an important subject but not in the scope of this note

Advice

[Alan says: "Some of the advise seems different from the other... Ensure unique definition, find the best definition, are something like what to do when something isn't perfect and rest are how to make things perfect" -- maybe this idea should be turned into an introductory paragraph]

Publish defining descriptions

When you mint a new term, make sure you write a defining description (DD) and publish it somehow. [I don't like the acronym DD, but got tired of writing the two words out so many times. Please advise.] Publish the DD in a venue that will last at least as long as all uses of the term. Ideally, publish it archivally, e.g. in a journal article or persistent web archive.

A special case is that of network resources [cite RFC 2616]. By convention advocated in AWWW and elsewhere, a network resource [JAR's coinage] is automatically named by its URI (see [[/../StatusOfHttpScheme]]). If a term is an http: URI, and the following hold:

  1. the responsible web server yields a 2xx response when responding to any GET request using the URI
  2. the server handling the request observes to the httpRange-14 recommendation

then you may take the term to be defined to mean the network resource defined by the server's responses to requests using the URI.

Make the defining description easy to locate

Making a DD available is of primary importance. Manual research of the old-fashioned kind is effective in scholarly work using natural language, and will be the most general and robust way to track down the author's or community's intended use of a term. But making a DD easy to find, and in particular making it easy for machines to find, is also important (not that an automated agent will necessarily know what to do with the DD, but hey). This can be done either by the DD's original publisher, by choosing URIs for terms that enable sufficiently durable "follow your nose" access, or by subsequent users of terms, who can facilitate interpretation either by repetition or by citation.

The conveyor of some RDF might:

  • Include the DD in a graph that uses the term
  • Or, cite a document containing a DD
    • (!!! We need to agree on a way to do citation. compare owl:imports.)

The publisher of a term's DD might:

  • Publish the DD in a private or public location [check MeaningOfaTerm] that readers are likely to know about
  • Mint terms that are locators (understood by a wide variety of browser-like things) and publish DD's at related locations (see below).
  • Make sure the chosen URI is sufficiently durable as a locator, or at least has a good shot at being durable. Since it is often difficult to predict the lifetime of a term, that means being conservative (talk about purls/handles here?).

In the http: scheme, two methods have been advocated for publishing DD's.

  1. Use a 303 redirect to send the agent to a DD for the term
  2. Mint # URI's and place a DD at the #-racine (note that # relinquishes server control) (avoid sharing the same #-racine among multiple terms as this breeds confusion over when one definition stops and the next begins, overcommitting and risking versioning headaches)

(Must acknowledge the LSID criticisms of http: URIs here. Most agents don't know better and will foolishly follow the URI, and stop there. The argument is that it's better to use a non-http URI, so that you will get no answer (go to no location) rather than what might be the wrong answer (or the right answer from the wrong location).)

(Find a DD is not the same as finding all information relating to what the term names. Address Mark W's LSID use case here.)

Compose clear, unambiguous definitions

No magic here; good definitions are difficult to craft.

Definitions should specify single and particular usage. For example, a term should be used for a document, or a thing described by the document, but never both (except in the unlikely event that the document is intentionally self-describing). This applies even if the document is a database record: Some statements are true of the record but not the thing, and vice versa. If both have names, they need to have different names.

[Of course you may be able to avoid making a name for one or the other, using blank node notation specifying the relation between the two.]

[Insert advice on writing clear and effective definitions. Talk about examples, counterexamples, "type specimens," comparisons, etc.]

Although any DD is better than no DD, it is better if a DD is expressed in RDF. This is obviously not a guarantee of quality, nor a guarantee that any particular automated agent can do anything useful with it. Certain formal aspects of a definition, such as a subclass relationship, can be expressed well in RDF. It is recommended that DD's contain either an rdfs:comment, OWL constraints adequate to uniquely determine the referent, or some well-justified alternative.

Don't issue a 2xx response unless you intend for the term to denote the network defined by responses to requests that use the term. [httpRange-14]

[The "Banff Manifesto" [cite] insists that a certain set of properties be specified. Alan suggests that at the very least there should be an rdf:type.]

Definitions (which should be small and extremely stable) should be separated from other RDF related to the term (such as statements that describe the denoted resources). The non-definitional RDF is likely to be less stable than the definition, and those attempting to understand the commitment assocaited with the term may be confused as to what information constitutes part of the definition and what parts don't. [But how to delimit, exactly? Consult D Booth's memo. Put them in separate documents, yes? What a pain.]

Use a term in a manner consistent with definition

There is no magic solution. You must be willing to do some research to make sure the way you're using a term is consistent with how the community is using it. If a definition is unclear, figure out how the term is used in practice and attempt agreement with that.

Ensure unique definition

The existence of multiple credible DD's forces an unpleasant choice on those who would use a term. Choosing the most recent or most "authoritative" DD may lead to misunderstanding of a graph since the graph may have been composed using an earlier or different DD.

Multiple definitions can arise in various ways:

  1. Accidental collisions
    • Make sure, when you mint a new term, that it's not already in use. Collisions are addressed by the "URI owner" or "naming authority" mechanism of web architecture [cite]. Not covered by this convention is the problem of URI reassignment as a result of domain name loss and capture, but as this possibility seems rather unlikely at the present time, there is little call right now for solving this problem.
  2. Change in concept over time (example from natural language: 'transient ischemia attack')
    • Be clear on expectations. Some of what you say involving the term may be intended to be defining (true indefinitely) while other statements are observations related to the term's referent and subject to correction or other kinds of revision. E.g. a definition of a term for a particular mountain should not include a statement of the mountain's height, which would be subject to change or correction.
    • Don't change a definition - mint a new term instead.
  3. Disagreement over "correct" interpretation leading to multiple "clarifying" DD's
    • Another remote problem for now, but likely to arise as it has in the technical literature. This can only be solved through community process, just as disagreement over a popular term in natural language would be. Statements by the "URI owner" should be given special weight but if the owner is not available for consultation or is as fallible as the rest of us then they should not necessarily be considered an "authority" on the term's meaning.
  4. Disagreement over methods of definition accessibility

If for some reason a term must get a new DD for some reason, at least publish a new DD under a distinct document name (URL, etc.), so that it can the correct DD can be cited unambiguously. Relate versions of DDs to one another using a suitable ontology [which one?] [where to put such assertions - effectively a citation of a previous document version by a new version? in the later DD?].

Find the best definition

It is important for a consumer of RDF to obtain the best definition of a term. Ideally, there is only one definition, but one must defend against the instability and disagreement.

As the goal is communication, "best" means the definition closest to the intent of the author(s) of the RDF you're trying to use. In the absence of inline definition or citation, this may be difficult to track down, so a heuristic search may be required.

In the event one chooses to consult the web (don't skip the steps of seeking definitions closer at hand or more definitely cited), do so carefully [see ../MeaningOfaTerm]. The following heuristics may be helpful [per TAG/Cool URIs]:

  • Some servers will observe the httpRange-14 recommendation. If so, then a 2xx response implies that the term refers to a network resource.
  • Some servers will provide defining RDF in a document reached by following a 303 redirect.
  • If the spelling of the term contains a #, the servers may provide defining RDF via a network resource named by the URI that is the "racine" of the term (the truncation of everything starting with the #).

It would certainly be nice to know which servers obey which rules, yes? Certinly TAG/AWWW recommendations are not part of the HTTP protocol, so there is no obligation to follow them. On the other hand, if a term occurs in RDF, there is a better than even chance that the named server is aware of this architecture and is using these publication conventions.

But so far the only methods for determining conformance are informal [future work].

It is important to be skeptical of definitions, as they can fail in a variety of ways. For example, an author may use terms in contradiction to definitions supplied in the same graph, a web server may provide a definition that differs from the one consulted by the author of a graph you're trying to understand, a definition found in a standard location may be unclear while community practice around the term's use is not, etc.

Make an effort to re-use terms

It is undesirable to have two distinct terms in use for the same thing.

  • There is no magical solution. Please try to be aware of what others in your field are doing terminologically and replace terms if necessary in order to build community consensus.
  • But be careful: The value of term reuse is so high that one may be tempted to use an existing term when not completely appropriate. This leads to overloading and confusion.
  • There may sometimes be an awful tradeoff between stability and popularity: a popular term may be unstable as a locator, while the application at hand may require durability. In this case make sure that there are other ways to locate a DD, as described above (sort-of example: BMC's new practice of caching web pages).
  • Documents and database records whose publishers have not provided a URI, or who have provided a URI that is unstable or difficult to use, present an important special case.
  • [Talk about ontology version: when to re-use and when to mint?]
  • Existing well-documented terms may be rejected because either (a) the corresponding definitions are difficult to locate by certain applications, or (b) because they are not "browser friendly". This presents an as yet unsolved quandary for the community; see below. My advice is to assess suitability of a term based on criteria other than the infrastructure's ability to deal with it. I will give dissenting views in future versions of this document.
  • Similarly, existing well-documented terms may be rejected because the access infrastructure that they imply in other contexts (such as web browsing) appears to collide with the goal of establishing the terms' credentials as "identifiers". In particular, you can't tell whether an http: URI means something other than a network resource without locating a definite statement to the contrary (e.g. via a 303 redirect). It is probably impossible to reverse this practice, especially given that it is used at the heart of RDF practice (e.g. rdf:type) has an influential constituency (the TAG), so it is not clear what is gained by avoiding these terms and replacing them with redundant terms in a different region of URI space.

Seek out and state relationships

Failure to recognize or state relationships (e.g. missing subclass/subproperty assertion) can lead to incomplete answers to questions. For example, a graph containing mother assertions combined with a graph containing parent assertions is less connected than it should be if the mother-parent subproperty relationship is unstated.

So, if after publishing a graph you discover a related ontology, make an attempt to establish relationships between your terms and theirs, and publish the relationships.

But be careful: The value of such relationships is so high that one may be too eager to relate. The correct relationship should be sought and if necessary defined; there is no need to latch on to "loose fits" that are less than accurate, and correct relationships can themselves often be related to the obvious ones via subproperty assertions or using OWL. The effort will pay off in query accuracy.

Use of owl:sameAs, even when legitimate, should be avoided except as a way to bridge independently created graphs neither of which can be modified. Simply using the best term of the two (the one judged most likely to rally consensus; usually the one published first) is preferable since it allows linking through the term even when inferences using owl:sameAs assertions are not made (e.g. when inference is limited to RDF entailment [cite]).

Future work

Here is what we as a community need to do in order to make the above advice easier to follow.

(Alan: Points in this section have to link back to the previous arguments. The motivation needs to be clear.)

  1. Citation is central to scientific discourse. We need to develop a theory - an ontology - of document reference that is both principled (ontologically web-independent) and as harmonious as possible with current practice.
  2. Terms should have the potential to outlive any domain mentioned in their spelling. We need to figure out how. BMC has taken one small step in this direction [cite]. URI resolution ontology is another approach.
  3. It is recommended that systems be established, similar to journals, for quality control of RDF graphs. If a graph meets requirements for documentation, consistency, citation, and coordination with other graphs, it should be recognized as such.
  4. Develop an ontology for stating server policy regarding 2xx-responders
  5. It is recommended that the community figure out what terms should be used for public database records (such as those in Entrez Gene), and come up with a versioning story for them. The terms should be impeccably hosted - that is, made available durably and consistently - by an organization that the community can trust.
  6. Develop ontology for describing site policy and network resource specifications (relations between temporal versions, language variants, format variants, etc.).
  7. Figure out whether to use published terms that are not browser-friendly or HTTP-friendly (e.g. belong to non-"locator" URI schemes). Personally I sympathisize with both sides of the debate, but I don't see this issue as being important enough to warrant the creation of a second set of terms when terms that work perfectly well already have published definitions. (e.g.: the new info:inchi/ URIs - although one might want to avoid these just because they're not very well specified.)
  8. An independent document repository would be nice, as a way to adopt abandoned projects' documents and otherwise do durably-named and durably-served publication (a la Genbank), especially of defining documents.

Acknowledgments

David Booth, Michel Dumontier, Pat Hayes, Eric Jain, Sean Martin, Eric Neumann, Marc-Alexandre Nolin, Chimezie Ogbuji, Roderic Page, Bijan Parsia, Eric Prud'hommeaux, Daniel Rubin, Alan Ruttenberg, Matthias Samwald, Mark Wilkinson, and others (please remind me).

Change log

  • 2007-10-08 Added remarks attempting to explain how the locator/identifier debate fits into this framework.

Rough notes

Only the truly committed should read beyond this point.

TBD:

  • Most advice is to application developers; say so
  • Talk about genbank and/or NIH permanence policy
  • Distinguish between *potentially* durable names (e.g. purls) and credibly durable service (e.g. libraries)? Persistent of a *name* is not the same as persistence of the *information*.

Controversies =

2xx-responders (network resources): do their URIs denote actual server behavior (as TimBL suggests), or ideal server behavior? If the former, the LSID proponents and librarians will not want to use them to refer to document-like things (e.g. for citation). If the latter, new mechanisms will need to be developed to direct agents to definitions (specifications, promises) so that readers and writers of RDF know exactly what's meant by the URI.

The LSID/HTTP battle: It is a goal of mine to get LSID and http: users to collaborate and start worrying about issues more important than what naming scheme to use. Both naming schemes are in use, so all diligent clients will need to be able to deal with both kinds of terms. Minting new URIs for things that have perfectly good ones already is a bad idea (although "perfectly good" is a high bar). The question is whether, for new terms, there is any reason to prefer one scheme over the other. The LSID spec has plausible versioning and metadata stories; but these can be replicated in http-land (using 303s or #-racines) once we have a versioning ontology. LSID has a plausible location independence story; I have tried to show that the HTTP space does too -- no sensible agent should limit themselves to what's they find at the http: URI. I know this is a tough pill to swallow but as I say it's already been swallowed, e.g. any time you use a non-LSID ontology such as RDFS. What http: has to offer is the ability to get at definitions using ubiquitous software (http GET) via the follow-your-nose heuristics. Yes, this is unreliable, but so is consulting the DNS record for the LSID authority -- and so far there is no central registry of LSID authorities that might deal with loss of the authority's LSID DNS record, while at least purl.org provides for forwarding.

SW FAQ

Following is a list of questions about SW, some of which we might try to answer in this document or in a companion document.

Questions that HCLS members might have:

  1. What is the semantic web? (collective RDF, or an access apparatus?)
  2. What can I use it for?
  3. How do I use the semantic web?
    • How do I figure out what a term means?
    • How do I browse the semantic web (esp. for HCLS content)?
    • How do I search the semantic web?
  4. What is a resource? document? information resource? representation? (TimBL has ontologies)
  5. What naming scheme should I use - http:, LSID, handle, other urn:, or info: ?
    • If HTTP - Should I use # or / URIs in ontologies?
  6. What host should I name in my URIs (LSID or otherwise) - that of my own project, or one that can provide persistence?
    • How do I find a persistent definition provider?
      • How do I use purl.org?
      • Where should my stuff be hosted?
      • Why should I trust it with my stuff?
  7. What are my civic responsibilities as author/publisher?
    • [Provide definitions somewhere - anywhere.
    • Definition should include at least an rdf:type, and either an rdfs:comment or a rigorous formal definition.
    • Provide definitions at the location implied by the URI, using the protocol implied by the URI.
    • Don't confuse definitions with (other) discourse.
    • For HTTP: Follow httpRange-14, if you can figure out what it means.
    • If you inherit naming authority to a namespace (e.g. domain name), don't reuse previously circulated URIs for new purposes. (but how do I know if there were any, and what these were?)]
  8. This is hard. Where do I get help?

Definition quality hierarchy

(The following is an idea I'm trying out - an analysis inspired by Latour's book Laboratory Life.) Quality of definition/description (of term) hierarchy.

  • Consensus definition. (everyone knows, goes without saying)
  • High-quality citation. (positively identified document)
  • Adequate information in same graph.
  • Ad hoc citation (web reference). (unreproducible)
  • Reverse engineered.
    • via protocol specified by spelling (303)
    • spelling = hint
  • Unconstrained / undefined...

Etc

... central authority for technical terms not possible, but particular namespaces will have authorities (e.g. CAS numbers, genbank ids, pubmed ids...). Different kinds of authority: authority over the namespace (what a term means) vs. authority w.r.t. a subject matter (to coin a term of a certain kind, coordinate with the authority)...

... automated agents will vary in their expertise at locating DD's... some will only be as smart as a web browser, while others will consult a wide variety of sources and speak a wide variety of protocols (LSID, info, wayback machine & other 3rd-party archives, etc)... we encourage the latter of course.

... relate "decision to use or not use RDF" to "trust" layer in semweb layer cake (Sandro)

... would the document be improved with presentation of detailed use cases, as opposed to examples in the running text? e.g. web browser, web application, SPARQL (RDF cache), computation. I think not, but the text needs many more examples than it has.

... talk about publisher conflict of interest ... no interest in stability or consistency... example: out of business, mergers, ISBN recycling.

(from outline.html)

  1. Establishing connections with other works requires vigilance in the selection of terms. Before a new term is coined, a literature search should be conducted to find candidate terms that can satisfy the need.
  2. Because of the possibility of creating fresh terms at almost no cost, RDF has the potential to eliminate confusions that ordinarily plague scientific discourse. Any time a new meaning is needed that is at variance with meanings of known terms, a new term can be created.

... talk about the problem of using 2xx-responders for citation. 2xx-responder is defined to be web behavior. Tempting to use them for citation, but such a term does not have the intended meaning (according to TAG) - it refers to the way the server actually behaves, rather than the way it is supposed to behave. Server policy statements are important so that the latter is transparent. Work in progress.