HCLSIG BioRDF Subgroup/Tasks/URI Best Practices/Recommendations/MeaningOfaTerm

From W3C Wiki

HCLS URI Recommendations

The Meaning of a Term

Jonathan Rees, September 2007

This memo is intended to cover, in part, one of the subjects that the HCLS URI recommendations document (in progress) is supposed to address: How to use a URI. Topics to be covered separately include how to mint URIs, how to publish declarations, and how to choose URIs to identify public database records (such as Genbank and PubMed).

I wrote this to help my own understanding of the issues. My goals were: compare LSIDs with W3C recommendations (httpRange-14 and so on) without being too disrespectful of either; highlight problems with received (W3C) semantic web architecture; and make steps toward an HCLS URI recommendations document.

Introduction

RDF is a language for communication, and the terms of this language are URIs (or perhaps IRIs). In order for a language to be effective for communication, terms need to have the same 'meaning' for sender and receiver. Therefore the problem of determining the sender's intended meaning for a term is central to the effectiveness of RDF. (For simplicity I'm assuming that 'sender' and 'author' are synonymous.)

Definitions:

  • Term - any kind of URI or IRI as used in RDF Michel: A URI is a name that identifies something, but in itself has no (explicit) meaning. Implicit meaning might be derived from a human readable definition. Explicit semantics is provided by axioms associated with the name.
  • Meaning (of a term) - constraint on use (see note)
  • Definition - a piece of writing that specifies a term's meaning
  • Declaration - a particular manner of definition (cf. Booth)

So how do we determine the meaning of a term? I will review the state of the art. I do mean to be provocative. If you find that I'm distorting the truth for my own ends (and this is likely) please call me on it. Feel free to insert commentary prefixed by your initials. DBooth: Names please? I find initials hard to decipher.

Consistency of meaning

The "right" meaning for this analysis is the one intended by the sender, since the point of the exercise is to get the receiver to understand what the sender meant.

According to web architecture, putative meanings for a term, such as the one intended by the sender in our scenario, are ultimately supposed to be based on the behavior of one of the term's naming authorities or "owners" - that is, an owner of the URI that is the term. Ownership is determined by the spelling of the term according to a set of rules outlined in AWWW, and may change over time.

The consistency of a term's meaning is the degree to which putative meanings agree over time and space. A term is consistent if all putative meanings are consistent with one another.

There is no test for consistency. We can gather evidence for it, such as past URI owner behavior, owner's promise of integrity, consistency ethic associated with URI syntax (e.g. urn:), testimonials, organizational charters, ability to pay DNS registration fees, etc., but decisions that depend on consistency, such as whether or not to use a particular term, ultimately have to be made on an assessment of relative risk.

Michel: One can check the consistency of information described using formal knowledge representation languages like RDF and OWL. The number of consistency checks will depend on the expressiveness of the language.

If a term's meaning is consistent over the duration of a transaction (publication of putative definition, sender learns of definition, sender composes and sends message, receiver receives and processes it, receiver learns of definition) then use of the term will be successful. If the sender and receiver have some other way to agree on the meaning of the term, such as use of a citation, then there will also be success. Otherwise, the sender and receiver may have different understandings of the term without even knowing that they do, and confusion can result.

For this reason, inconsistency is considered antisocial. That is, if a communication fails as a result of inconsistent published definitions, then the sender might be faulted for relying on a term whose owner does not have impeccable consistency credentials (but note that even an agent with impeccable credentials may fail), and the receiver may be faulted for attempting to make use of a message containing such a term, but primary responsibility falls on the owner for not following web architecture. Of the three, the owner was the only one in a position to prevent confusion.

Michel: Inconsistency is the result of a logical violation leading to an unsatisfiable logical position such that there no model exists for a given ontology. It would be difficult to ascertain whether the meaning is different than another without an language that is expressive enough to evaluate whether these are inconsistent with one another. RDF does not provide the necessary semantics to determine whether two things are i) the same, ii) different, iii) consistent, iv) inconsistent. 

Accessibility of definitions

One may obtain defining documents (such as declarations) for a term from a variety of sources. In order to have the best shot at recovering the sender's meaning in spite of threats to consistency and accessibility, I would encourage agents to try the following:

  1. The message itself Michel: Indeed, OWL reasoners requires that users provide the knowledge base for reasoning. The omission of data that might be obtained from a 'follow your nose' approach may be in fact be intentional - so as to avoid inconsistencies, etc
  2. A document cited by the message (e.g. an OWL "imports" clause) DBooth: Or an rdfs:isDefinedBy, as suggested by Chimezie? Michel: I agree with the use of rdfs:isDefinedBy to indicate any number of documents containing statements about the resource.
  3. A repository or cache provided by the sender
  4. Another repository or cache DBooth: Is this receiver-supplied?
  5. A past or present owner of the URI

The ordering here approximately reflects both reliability and performance. Even if the term's meaning is inconsistent, the message or a cited document is likely to give the meaning intended by the sender; and a repository maintained by someone who cares about the meanings of terms is more likely to reflect the sender's intent than a URI owner, who is not necessarily in a good position to maintain either accessibility or consistency.

DBooth: I think the main point is that all of these techniques may be needed or desirable at one time or another, so a flexible solution should enable them all.
Michel: I think that the first two options are the real options from which others may be derived.

These sources should be consulted for any term, regardless of its URI scheme. Unfortunately, for terms that are URLs, many semantic web agents will probably go straight to the current URI owner. Michel: That's not necessarily true. Web architecture (AWWW) does not require such behavior, however, and W3C is urging everyone to treat http: URIs as identifiers first and locators second. Whether the community can be convinced of this has yet to be determined, especially given that conflicts of interest can easily arise between a term's role as locator and its role as identifier (and we know which interest will win). Michel:The latter comment is argumentative However, the semantic web already is committed to using some http: URIs as identifiers, and this implies that all agents ought to be prepared to obtain declarations (and other related information) from the most reliable source -- which is not necesarily the URI owner.

Few, if any, current web applications consult repositories (or caches). (Applications that consult local LSID resolvers would be a notable exception.) Such applications may (a) fail to find declarations that would otherwise be available, and (b) incur unnecessary network traffic, with damage to performance and privacy. Probably the main reason for this failure is the absence of any widely understood and deployed protocol for talking to such repositories. It has been proposed to use an ordinary web proxy such as Apache or SQUID for this purpose. However, a proxy can't overcome the limitations of "follow your nose", can't give declarations for 200-responders, and will be no use for URI's (such as LSIDs) that user agents won't refer to it.

DBooth: Out-of-the-box proxies may not do this, but these are all things that would straight-forward for a proxy such as squid to be extended to do.  Actually, squid already has a redirect_program option that can be used to specify a program to rewrite URIs.  We would still need a standard way to express such URI rewrite rules though, so that they can be easily shared between users.  I don't know if either the Apache mod_rewrite format or the W3C Rules Interchange Format (RIF) would be useful for this.  I suspect RIF would be overkill.  I would guess that simple Regex rewriting would suffice in many (most?) cases.
DBooth: Another question: How can an agent know whether it has the declaration for a given URI?  If it received a repository or cache along with the message, how can it know whether that repository merely contains additional assertions about the URI, which should be combined with the URI declaration, or the URI declaration itself?  If follow-your-nose URI dereferencing is always used by the agent to get the URI declaration, but an intervening proxy redirects the request elsewhere (perhaps to a local cache) if follow-your-nose dereferencing should not be used for that URI, then the agent's algorithm for getting the URI declaration will always be clear and simple.

Are inconsistency and inaccessibility ever OK?

The information we're recording in RDF is of such potential value that we may want to use others' terms even when those terms have potentially inconsistent meanings, as with an ontology still in draft form, or RDF graphs still under development by those not skilled in composing definitions. So it seems plausible that we will want to be able to live with some amount of inconsistency.

DBooth: Yes, inconsistency will be a part of life.
Michel: Again, there can be no inconsistency in RDF graphs. The challenge with working with inconsistent OWL ontologies is that any conclusion will follow, so reasoners tell you there's a problem, and stop. I suspect that there will be methods developed in the future to work out which parts of an ontology are compatible, thus enabling reasoning across multiple ontologies that diverge in their meaning.

Accessibility has similar properties: we may want to use a term even though the URI owner doesn't guarantee that its declaration will be accessible; we may be willing to keep our own copy, or to gamble that somehow accessibility will be recovered in some other way.

Certain projects with which we would like to communicate may not know about the need to obtain terms from impeccable URI owners, and may have followed received semantic web wisdom and used their own domain names in minting terms. Even those that know the value of consistency and accessibility in principle may find the overhead of obtaining high-quality terms to be too heavy a burden.

In principle, inaccessibility is not as bad as inconsistency. At worst, inaccessibility only leads to a message being not understood, while inconsistency can lead to incorrect understanding. Accessibility is relative - a definition that is unavailable via one access method method (such as HTTP GET) may, with more work, be available using another method (such as http://web.archive.org/collections/web.html or a backup tape).

In the future I believe there may be a need for consensus definition to force a consistency that is threatened by a URI owner. Is it reasonable to think that we will all change all of our RDF documents that use a term whose owner was thought to be impeccable and whose meaning was thought to be consistent, just because some unexpected change led to loss of consistency? There ought to be a social convention whereby a URI owner's misbehavior relating to its own popular terms can lead to its losing its moral right to publish definitions of those terms.

DBooth: It is possible that such conventions will be needed (thought we do have courts of law already), but I think it would be better to wait and see how big a problem this is before we try to solve it.
Michel: I would argue that the web is a natural democracy... users will pick and choose those documents whose semantics they agree with.

Certain applications may be less vulnerable to inconsistency threats than others. Scholarly publication is extremely demanding of consistency and accessibility since transmission and reception might be separated by many years. On the other hand, communication transactions that happen over short time periods will usually be reliable if definitions change much less frequently than typical transaction length. This may be the use case intended by the bulk of existing semantic web applications and tutorials, which seem to not worry very much about these issues.

What kind of thing specifies a term's meaning?

When looking for evidence of a sender's meaning, at least three distinct kinds of artifacts are relevant:

  • Declarations (see http://dbooth.org/2007/uri-decl/) - definitional prose and/or RDF, perhaps written by the agent that minted the term, either prescribing how the term is meant to be used, or describing how it is used in practice
  • Sample usage - RDF from various sources that uses the term (perhaps in describing the denoted resource); potentially useful in determining meaning from context
  • Related artifacts - e.g. when a term means a biological species, the type specimen for the species serves to define the term ("the species to which this specimen belongs"); or when a term names a document, then a version of the document itself may help to define the term (although note that a term for a document often wants to denote some abstraction that subsumes multiple distinct bit sequences)

Of these, a declaration is most useful (if it is correct). Lacking that, reverse engineering from usage may be necessary. A related artifact may also help for reverse engineering, especially if the relationship of the artifact to the meaning is known, inferrable, or unimportant.

DBooth: If there is no declaration, then I think the best option is to make one up (by reverse engineering) and publish it as a stake in the ground, so that others can at least use the same one if they choose.

The notion of declaration is not widely deployed at present. For the time being we will need to settle for RDF graphs that use the term, which is all that's available right now. Most of what I say here about finding declarations should apply to such graphs as well.

Protocols for finding a declaration

As described above, the URI owner may be the least reliable source of definitions for a term. However, the URI owner is the ultimate origin a term's meanings (consistent or otherwise), and is expected to publish definitions of it. Supposing all other sources of definitions fail, a definition may be sought in various ways according to scheme:

data:

Shouldn't be a problem, right? You would expect these terms identify particular documents. I am not aware of any use of these URIs in the semantic web, but it is easy to think they would be useful. (I would be very interested to hear of any instances.) On the other hand, it is not too difficult to imagine mismatches between a sender's use of such a term and a receiver's interpretation, so some amount of caution is in order.

info:

info: URIs are all documented at the main web site for the info: scheme, and in principle it's possible to figure out correct usage for all of them. But their consistency is not always clear (e.g. some ISBNs identify multiple publications), and the documentation is bound to be rife with vaguenesses and problems that only show up in a semantic web context.

Handle (including DOI)

The handle system provides stable identifiers, so you can be pretty sure that intervening events will not cause a rift between the sender's intent and the receiver's interpretation. However, although the handle protocol certainly allows for the attachment of declarations and citations to handles, as deployed they do not generally provide any defining information. For handles that name documents, the handle data often include a URL field, but following this URL does not necessarily result in the document, much less a citation or declaration, even when the GET is not thwarted by access restrictions.

(It is not clear what the best URI form of a handle is. Suggested prefixes include http://hdl.handle.net/, info:hdl/, urn:hdl:, and hdl:.)

urn:lsid:

If the authority named in the LSID and the resolver that the authority designates are in business, or if you're able to discover another resolver, then one may be able to use a getMetadata web service call specifying an RDF MIME type to get RDF that defines the term. There's no guarantee that the result will be RDF, much less a declaration, but it's worth a shot speculatively.

LSIDs that possess version components have the virtue that they are specified to identify particular "pieces of data," so the piece of data may be taken to be a definition of such an LSID.

According to Liefeld et al [cite], LSID authorities are supposed to register themselves centrally to facilitate location of resolvers and to ensure persistence of authority name definition - that is, to decouple authority naming (persistent) from the domain name system (ephemeral). I have no idea whether this in fact happens or whether it makes a difference, but it is related to the claim the LSIDs are location independent. Authorities are urged to make sure that all LSIDs are defined persistently, which would imply consistently.

http: and https:

Again assuming that the URI owner is in business and can be trusted, you can attempt to use the recently articulated "follow your nose" approach, which is something like the following:

  • If the URI contains a #, truncate the URI by dropping the # and everything following, dereference the truncated URI, and look for RDF in the resulting HTTP response.
  • Otherwise, do a GET of the URI and following non-303 redirects. If the response is a 303, treat the Location: URI the same way as a #-truncation, as above.
  • Otherwise, if the response is not a 303, you're out of luck (see below).

This is a fragile and wasteful protocol, and can fail in a number of ways.

  • The document retrieved after #-truncation and 303-following may be anything at all, not necessarily a declaration or even RDF.
  • If you end up with RDF, there is no guarantee that the RDF is a declaration; at best it is likely to be a mix of a declaration and non-definitional assertions.

Worse, there is no way to tell ahead of time whether this protocol is likely to work at all.

It has been suggested that one might try asking for RDF in content negotiation - this might give a declaration. Unfortunately such server behavior would appear to be at odds with AWWW, because the RDF is a description of the resource, not a representation of it as AWWW requires of GET. (Here I'm assuming that "representation" actually means something. If it is tautologous and simply means whatever you get back from an HTTP request, then there are no "best practice" constraints on what GET can do and this idea works perfectly well.)

DBooth: I don't understand what you are trying to say in the above paragraph.  Semantic Web agents wanting RDF SHOULD ask for it.  Asking for RDF in the HTTP request is not at odds with AWWW.  If the URI denotes a non-information resource, then the server should return a 303 rather than a 200, but that is independent of the content type requested.

Others

Not clear that HCLS cares about any other kind of URI. Candidate for treatment include ftp: and mailto:. Let me know of others.

200 responses are not declarations

A prevalent but dangerous assumption is that a 200 response to an HTTP request somehow helps to define the correct usage of an http: URI. The relationship between 200 response payloads and the meaning of the term used to GET them is effectively unspecified by AWWW and is vague in practice. Even the simplest case, in which the same payload is delivered with utter consistency and the term ought to identify that payload, there is no way for an observer to know that a different payload won't be delivered tomorrow, so the payload is of no help in determining the resource.

DBooth: I think you need to clarify whether you are talking about a 200 response after a 303 redirect (or after truncating the fragment identifier) versus a 200 response when the original URI is dereferenced.  I.e., are you talking about URIs for information resources or non-information resources?  If you are talking about a 200 response after a 303 redirect (or after truncating the fragment identifier), then I think existing convention is to treat the returned representation as a URI declaration, even if people have not necessarily called it a URI declaration.  OTOH, if you are talking about the significance of the 200 response itself, then the httpRange-14 decision says that the URI identifies an information resource, and I think the convention is to take it as identifying the particular information resource that just responded to the request.  :)  

An obvious and simple position would be to take the payload, or some abstraction of it ("essence"), to be the term's meaning, so that a change in payload or "essence" would be a change in meaning. However, AWWW specifies that the meaning of a URI (i.e. its identification of a particular resource) can be constant across variation in HTTP responses ("representations"). That is, meaning can be consistent even when server behavior isn't. As frequent changes in meaning would imply enormous meaning instability on the web, and would might be at odds with current practice, I prefer to choose AWWW's interpretation.

In the case of dynamic resources such as http://news.google.com/ or http://boingboing.net/, defining statements that characterize the class of 200 payloads would be very interesting.

Difficulties are not limited to changes to payloads over time. Documents that have different payloads for different combinations of HTTP headers (especially cookies and Accept-type:) pose a similar problem. A statement that the length in bytes of the document identified by a term T is 5531 might not be true of all language variants; but how is someone reading or writing T supposed to know whether such a statement can be made?

DBooth: I don't understand the relevance of these questions.  Can you explain?

The term means web behavior?

DBooth: I don't understand the relevance of this section.

It has been suggested that terms that are http: URIs for which 200 responses are given ("200-responders") are to be defined operationally: The term simply means the mapping from (HTTP request for T, time) to responses. That is, the traditional web has no semantics independent of server behavior (ordinarily one would just say 'no semantics'). The only statements one can make with certainty are historical, and statements that involve future behavior can only be believed if there is trust in future server behavior.

This operational semantics, while trivial to specify and implement, would be very unsatisfying to most people, I think, including TimBL. Writers of RDF want to talk about documents and blogs, not "network sources of representations". Documents and blogs have meaning independent of their embedding in the web: If the web suddenly disappeared, the document would still exist.

The cost of abandoning operational semantics is potentially high, however. Should server behavior ever vary from the published definition of a term, senders and receivers would face a difficult choice between adhering to the original, consistent definition (if they know it) and inventing a new definition that matches current server behavior.

Sometimes the payload does define the term?

It has also been suggested that a 200 payload may have metadata embedded in it that can help to define the term. Among content types that admit embedded metadata (PDF, HTML, JPEG, etc.), each has its own way of doing so, so finding embedded metadata can be exceedingly difficult. Even if RDF metadata is located, or the payload itself is RDF, assertions therein are likely to be descriptive only of that payload, not to the use of the term (URI) in general. I would not advise anyone to trust embedded metadata to define a term without independent assurrances of consistency.

What should be done?

Encourage publication of declarations separate from other information

David Booth has already explained why. I completely agree. In the present context, the major reason to distinguish a declaration from other RDF that uses the term is that only the declaration need be stable - the other information, which is falsifiable and perhaps even variable, can change without changing the meaning of the term (and the useability of documents that use it).

Michel: The meaning of the term is defined by the document that makes statements about it. Under the open world assumption, there could be any number of such statements.

Establish a standard way to represent and publish declarations

There are many reasons for wanting a new protocol for publishing and accessing declarations.

  1. Currently there is no standard way to publish declarations (or other associated RDF) of terms that are 200-responders.
  2. Such a protocol is needed for semantic web proxies, which answer a variety of questions about resources that aren't their own.
  3. A deterministic protocol would be far preferable to the current 303 and #-truncation heuristics, which are unreliable and inefficient.

The LSID specification provides something close, but unfortunately it only works for LSIDs, and gives the appearance of being difficult to implement. Michel:This is too subjective. There are resources to implement this protocol. Since we must cope with RDF that contains URIs that are not LSIDs, a protocol that handles arbitrary URIs would be very helpful. Michel: Each then requires an implementation for resolution, if one is defined. (Thought experiment: could we piggyback arbitrary URI definition access on available LSID software by defining a conversion of URIs to LSIDs, vaguely analogous to the way that LSIDs can be converted to http: URLs?) DBooth: Ouch! Michel: Unnecessary. Use the resolution method as per the protocol.

Establish reputation systems

If reputation is important, then we will want to represent reputation information, and trade it with people we trust:

  1. Which URIs have consistent meaning and which ones don't
  2. Which URIs implement 303 and #-truncation rules reliably
  3. Which semweb proxies have various kinds of friendly behavior

(Sets of URIs, such as those based on regular expressions, may be expressible using [POWDER http://www.w3.org/2007/powder/] - although there is a problem in that POWDER deals with resource sets, not URI sets. I have not explored this.)

DBooth: Again, though reputation systems may eventually be necessary, I think we're putting the cart before the horse to spend time on solving that problem now.  I think a simple proxy-like mechanisms for mapping or rewriting URIs to obtain the desired URI declarations will get us a long way.
Michel: Reputation systems will evolve out of the semantic web, and so will the choice of which definitions to use for a given URI. As you mentioned, even the owner's definition may not in fact be useful to anybody, but people will still want to reuse URIs rather than mint their own (which then requires a mapping between URIs).

Create islands of sanity

To reduce the frequency with which we need to employ heuristics, we can create and use well-behaved outposts, and encourage others to do so. These outposts would clearly separate declarations from other RDF, employ 303 and #-stripping to reliably redirect to definitional RDF, provide declarations for 200-responders, commit to consistency and accessibility, and so on. Of course a list of such outposts should be compiled and circulated.

DBooth: Yes, any group publishing efforts would be good candidates for this.

Capture lookup tactics in a reuseable module

Even using current web architecture, much less extensions that might solve some of its problems, declaration lookup (and related problems of finding related RDF and documents) is such an art that ordinary programmers will not want to take on the job. The logic (which could include intelligence for local caching and SPARQL endpoint access) should be captured in some combination of a reuseable program library and/or replicable services (proxies), as has been done for LSIDs. For example, http://semweb.name/getdeclaration?term=http://news.google.com/ might return the URL of a document containing a declaration (represented as an RDF graph?) of the term http://news.google.com/ . Of course, lookup agents would need to be transparent, because we all need to be skeptical of any entity that claims to know the right definition of a term. For purposes of democracy and performance any service would also need to be replicated - perhaps to the level of institutions and ISPs, just as domain name servers are.

DBooth: I agree.  And I don't think it has to be very complex to be very useful.  I think something simple can go a long way.
DBooth: One thing that wasn't mentioned: A proxy approach can work well for HTTP URIs.  But presumably a platform-specific URI resolution module would have to be embedded in an agent that wishes to handle other kinds of URIs, such as LSIDs, DOIs, etc.  However, such a module could be very simple, merely mapping these other URIs to HTTP URIs which in turn may be further mapped or rewritten by a separate proxy.

Terminology note

I don't want to be pedantic about what is meant by "meaning"; I think what I'm saying will apply regardless of whether you think that terms identify things / resources, or simply have acceptable patterns of use (Wittgenstein). I say "meaning" instead of "denotation" in order to permit, but not require, a reference-based theory of meaning. The uncommitted meaning of "meaning" should also allow for models in which a term identifies different things at different times.

Similarly, I prefer to say "term" instead of "identifier". I think that "term" is more likely than "identifier" to evoke use cases and requirements that are appropriate to our application domain, since terminology is a classic problem in the scientific literature.


Thanks to Chimezie Ogbuji, Hal Abelson, Sandro Hawke, David Booth, and others.