Best Practices for Resource Identification
		    and Access to Representations
      Jonathan Rees (with Alan Ruttenberg and Matthias Samwald)
			   12 February 2007


Abstract

The Semantic Web Health Care and Life Sciences interest group
[citation] would like to agree among itself on conventions for
identifying and accessing web resources.  This document seeks to
surface the problems and possible solutions in this area, and (in a
future draft) to make recommendations targeted for the HCLS community.


Status of this document

Version status: Let's call this draft '0.5', derived from version
'0.1' which was posted on the HCLS wiki on 2006-02-01.

Still very rough.  Not yet adequately reviewed by AR or MS.  I will
convert this to HTML and format it as a W3C "Interest Group Technical
Note" when it is closer to being done.

I've written much of this for myself, and once a few more people have
seen it and there is some more progress, I should be able to excise
much of the text.

Parts of this document are based on Alan Ruttenberg's
presentation at the HCLS face-to-face meeting in October 2006:
http://tinyurl.com/y5tmud


Terminology

I rely on definitions from "Architecture of the WWW, Volume One"
[http://www.w3.org/TR/webarch/], in particular:

- Uniform Resource Identifier (URI):
  A global identifier in the context of the World Wide Web.

- Resource: Anything that might be identified by a URI.

- Representation: Data that encodes information about resource state.

- Dereference a URI: Access a representation of the resource
  identified by the URI.

I introduce the following terms, since they will be needed:

- Observe a resource: Access a representation of the resource.
  (I'm sorry that the term is somewhat pretentious, so I will change
  or remove it if there are objections.  It would be nice to use
  'access' but unfortunately one accesses representations, not
  resources.)

- Locator: a specification (usually a URI) that is expected to be used
  successfully to access a representation of some resource.

I'll introduce these terms, but won't use them:

- Observable resource: A resource that may be observed.  For example,
  http://news.google.com/ identifies an observable resource.  [Alan
  and most of the world would probably like to call this an
  "information resource", but this would disagree with the
  web architecture document's definition, which requires conveyance of
  the resource's "essential characteristics".]

- Nonnegotiable resource: An observable resource that has a consistent
  content type, and is therefore the outcome of, or otherwise not
  subject to, content negotiation.  
  Example: http://xmlns.com/foaf/0.1/index.rdf, but not
  http://xmlns.com/foaf/0.1/

- Stable resource: A resource that possesses only one representation;
  that is, every observation of the resource accesses the same
  representation, independent of time and other variables.
  Example: data:text/plain,example  but not http://news.google.com/
  It is safe to say in this case that the resource *is* the
  representation.
  Whether for this concept we require bit-for-bit identity, 
  identity relative to the representation's MIME type (whatever that
  means), or some looser notion of stability, is open for debate.

I will avoid the following terms:

- Information resource: A resource which has the property that all of
  its essential characteristics can be conveyed in a message.
  -- There is vagueness in, and disagreement on, the definition
  (e.g. the architecture document disagrees with TAG resolution
  httpRange-14 http://www.w3.org/2001/tag/issues.html#httpRange-14).

- URL: A URI that provides a means of locating a resource by
  describing its primary access mechanism [definition from RFC 3986].
  Not used in "Architecture of the WWW, Volume One".


Overview

HCLS members are in the process of creating semantic web documents and
building a variety of semantic web applications.  These applications
identify, reason about, and use resources.  A resource is ordinarily
identified by a URI.  URI's are transmitted between applications and
stored in various locations.  If two semantic web applications
identify a resource by the same URI, then they have the opportunity to
exchange information about the resource.

In a distributed setting, we will often need to locate or discover
another application (a server or service) that can provide information
and services related to the resource.  Although there is no end to the
ways an application can relate to a service, two important cases are
the following:

  1. Observation: We use something resembling HTTP GET to access a
     representation of the resource.  For example, if the resource is
     the one with URI http://news.google.com/, we might want to obtain
     an HTML page giving the day's news.

  2. A server (a SPARQL endpoint or something similar) may respond to
     an application-supplied query about the resource by returning
     with an answer providing the information the the application
     needs.

Although in principle query subsumes observation [footnote: GET
requests can in principle be encoded as SPARQL queries] and is
philosophically better aligned to semantic web applications, we will
focus on the problem of observation for a number of reasons: It is
more concrete, we have more experience with it, and the need to make
it work well is more pressing.  In addition, query reduces to
observation in some cases: sometimes we can do observe a resource (the
same one or a different one) to obtain the information we need about
some resource.

For the most part we desire for our applications and documents to be
long-lasting and efficient, and to respect the communication policies
of the environment in which they're deployed (e.g. privacy rules).
While general solutions to these problems do not exist, we should be
able to formulate policies and methods that minimize them and help us
to adapt as difficulties arise.  The policies and methods discussed
here will apply to the minting of new URI's, server administration and
behavior, and tactics to be employed by applications to ensure that
observation can be carried out efficiently and with respect to policy,
over the long run.

The problems discussed here have nothing specifically to do with HCLS,
except that we seem to be suffering badly from it, and have been for a
number of years (the OMG LSID spec and resolver infrastructure
attempts to meet similar goals).  It also has little to do with the
semantic web or RDF per se.  Luckily, we are only addressing the
problems for ourselves: Solutions that we come up with need only work
for us.


Use scenarios

 1. Publishing - RDF is written to a static document that is then
    distributed and saved.  The RDF is meant to be used by an open
    ended set of applications, and is meant to be meaningful for a
    long time.

 2. Web application - a web application uses RDF triples internally
    somehow to generate web pages dynamically.  The RDF describes
    resources to which the generated pages will link, using URI's that
    can be used by any web browser.

 3. Web client - an application finds a resource mentioned in some
    triples and wants to access a representation for internal use
    (e.g. image processing); access to be accomplished using an
    available HTTP API such as java.net.URI

 4. Application with slaved browser - an application finds a GETtable
    resource mentioned in some triples and wants to transmit its URL
    to a nearby web browser for observation and display


Problem: Observing a resource

Given this definition, the situation we will consider is this: An
application has identified a resource, and desires to observe the
resource.

In RDF, resources are often identified by URI's.  There is no a priori
expectation on the web that a URI can be used as a locator, but
sometimes it can.  So an obvious idea is to use one of the resource's
URI strings [Footnote: Thanks to owl:sameAs, it is possible for a
resource to have more than one URI.] with a GET module (something that
can use HTTP and perhaps other protocols to do GET-like operations),
in the obvious way.  This direct approach can fail in a number of
ways:

 1. It may be that none of the resource's URI's can be directly used
    as a locator.  For example, the URI might be an info: or urn: URI,
    and the GET module used by the application may not know what to do
    with it.

 2. The URI may be a "broken link":
    . server gone, unavailable, or renamed
    . resource gone, unavailable, or renamed

 3. Use of the indicated GET module with that URI may not meet
    communication requirements:
    . communication link too slow
    . communication would not respect necessary local policy (don't go
      there)

 4. The response to the GET request may be "wrong" - a representation
    that is not a representation of the intended resource (e.g. a
    service might deliver an incorrect page instead of an HTTP
    redirect; or the domain name registration may have lapsed and the
    name claimed by an unrelated, possibly hostile, entity)

 5. It may be difficult to choose from among the resource's many
    URI's, or it may have no known URI at all (blank node)

We can guard against most of these problems by minting URI's carefully
in the first place or by arranging for a server or servers to maintain
accessibility of representations; and if neither of these is
sufficient, we can attempt to observe the resource by dereferencing a
new URI derived from the resource's original URIs and/or other
information that we have about the resource.  (The new URI does not
necessarily identify the resource; it is simply what we are using to
observe it.)

Following is some of the received wisdom:

  - Never mint a new URI for a resource when one already is in use. (5)

  - Tim B-L says don't mint URI's that are not locators (URL's). (1)
    Always endeavor to arrange for the named server to provide a
    either adequate access or valid redirection at each newly minted
    URI. (2)

  - Mint URI's whose hostname specifies a long-lived server that will
    maintain observability of the resource at the given URI in
    perpetuity.  Publishers, libraries, and universities are in good
    positions to do this.  Note that thanks to HTTP redirects the
    servers need not actually hold the resource (cf. purl.org); but
    they must be committed to ensuring that they direct clients to the
    resource wherever it is currently hosted. (2)
    
    [This is also good advice and should be followed when possible.
    Unfortunately it puts a heavy burden on those who would publish
    material containing URI's.  Because of this burden, and the near
    impossibility of guaranteeing permanent access, one can predict
    that some important links will become broken, no matter what.]

  - Non-locator URI's can sometimes be rewritten as locators.
    David Booth refers us to his article "Converting New URI Schemes
    or URN Sub-Schemes to HTTP", http://dbooth.org/2006/urn2http/.
    (1)

  - Mint LSID URI's, and configure each instance of each application
    to refer to an LSID resolver that knows about the LSID's being
    used.  Convert LSID's to HTTP URI's if necessary according to the
    any effective method. (2)  For performance and/or locality use in
    conjunction with a cache. (3)
    
    [This advice flatly contradicts Tim B-L's advice, which says
    don't use any URI that's not a URL.

    HCLS wiki page on LSID pros and cons:
    http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Tasks/URI_Best_Practices/LSID_Pros_%26_Cons]

  - Use a web cache such as Apache or Squid, and a proxy configuration
    on the client, to provide representations of otherwise unreachable
    resources, or representation access that is efficient and/or
    policy-abiding.  (Dan Connolly) (2, 3, 4, possibly 1)
    
    [This is a possible solution, for HTTP URI's at least.  See
    below.]


What would a good solution be like?

Since we cannot always control the way in which URI's are minted, and
we cannot in the long run protect against broken links, we will have
to either modulate the choice of which URI is the subject of the GET,
or modulate the apparatus that handles the GET.  In the latter case
the apparatus (a proxy, cache, or mapping service) usually has to
modulate the presented URI to obtain another URI for a subsequent GET
(or past GET, in the case of a cache), so the two approaches are
functionally equivalent, differing only in where we draw the line
around what we consider "the application".

Our problem - that of locating a resource, addressing the failure
modes of a direct GET - therefore can be solved if we can describe a
mapping from a resource to the "right" locator URI.  The locator can
be one of the resource's own URI's or one belonging to another
resource, and should be presented to the apparatus that will attempt a
GET.

The central observation illuminating what follows is that such
resource-to-locator mappings can and should be described.  An example
of information contributing to such a mapping is the assertion that a
resource that has a URI beginning with http://wrl.dec.com/titan/ can
be observed using a locator obtained by replacing this prefix with
http://research.hp.com/projects/titan/.

Further observations:

  - Knowledge about how to map resources to locators will often be
    idiosyncratic to place and time, and will usually be found in the
    hands of the individual users who care most about it.

  - The people who have mapping information aren't necessarily server
    or web cache administrators and they may not have access to
    administrators (supposing they exist at all).  Being able to map
    to a locator in a client application is important.

  - Mapping information changes all the time.  Submitting a work
    request to a server administrator for every change in the mapping
    is not practical, even when there is a server and an
    administrator.

  - Users will want to reuse mapping information between applications.

  - Users will want to share mapping information with one another in
    various ways (email, inclusion in documents / systems, archiving,
    etc).

All of these considerations point to the desirability of agreeing on
some way of writing down the information that configures the process
of mapping resources to locators.

It is important to agree on this "way or writing down" (notation) and
what it means, while being agnostic on its implementation.  Agreement
on implementation will be impossible given the diversity of platforms
in use: some of us will desire tight application integration, others
won't; some will prefer client side, some a proxy or service.  And we
are using a variety of operating systems and programming languages
with very little in common.

Let's assume that agreeing on a notation to describe the
resource-to-locator mapping is a goal.  There are many potential
approaches to notation design.  We could use an existing notation
(such as some subset of the Apache configuration file language), or
make up yet another "little language" [what do wild-type LSID
resolvers use internally?]; or we can employ some existing notation
that is general enough to represent information of this kind.

Fortunately, we have a general information-representation notation
readily at hand: RDF.  If we use RDF to represent access information,
we leverage the only language we already share, one that's easily
parsed and translated (e.g. into Apache configuration files).  In
addition, triple-represented information about a resource, such as its
RDF/OWL types, publisher, etc., may be of use in mapping a resource to
a locator, addressing problems (1) and (5) in ways that are not
possible for a facility that is not semantic-web-aware.  Finally,
using RDF opens the way to employing inference in the selection of
locator and in describing contracts that might hold between the
application and the server that will deliver the representation.


Proposal: A resource observation ontology.

Alan has described a sketch of a resource observation ontology in his
portion of the presentation "Ontology-based URI Resolution"
http://tinyurl.com/y5tmud [file name
RuttenbergURLResolutionInOntology.ppt, Oct 6 2006].  In the above
introduction I have tried to lead the reader toward this solution by
considering the problem from first principles.

The presentation uses "InformationResource" for what I call
"observable resource", "UnchangingInformationResource" for what I
call "stable resource", and "retrieve" or "get" for what I call
"observe".

[TBD: Present the ideas in stages to show status of each additional
bit of complexity: essential, optimization, serendipitous,
speculative.  Provide developers with a migration path, with easy
first steps.]

[TBD: Document the OWL ontology.  Figure out what to recommend now.]

I will summarize the idea here, but for details please see the .ppt
file.

  - Represent access information (information that helps you to access
    representations of a resource) in RDF according to an ontology

  - Allow it to interact with application-level information

  - Kinds of information that could be represented using such an ontology:

    . Retrieval [observation] methods: direct; URI rewrite (perhaps
      prefix-based a la D Booth, or regexp-based a la Apache
      RewriteRule); SPARQL; web service

    . Contracts with servers, e.g. that a resource is stable, or that
      change will not happen soon / frequently

    . Representation type information - so that you can predict what
      you'll receive should you do a GET

    . Authentication information

    . [Extra] Relations among resources: e.g. relate a resource (or
      class of resources) to a GETtable resource (or class of same)
      that contains a description of it

  - Represent resource metadata such as version, DC, etc. in
    RDF and use it somehow in resolution

  - Don't share bare URI's; provide mapping information when you
    communicate a document containing URI's.  You get to choose
    whether the mapping information resides inside the document that
    mentions the URI, or is carried independently of that document in
    an application or site configuration.

  - [Extra] Client-side content-type awareness can be used as a more tasteful
    alternative to content negotiation (choice among variants)

  - [Extra] We have good ways to talk about versioning

Why OWL?

  - OWL can express rich properties and relations, e.g. resolution
    policies that apply to all objects of a given type.

  - OWL makes application of resolution tactics automatic,
    predictable, uniform (across applications), and error-free.

One might argue that you need an OWL engine to interpret resolution
information represented in this way, and not all applications have
an OWL engine.  However, reasoning over the ontology requires only a
small fragment of OWL - certainly not all of OWL DL.  [TBD: articulate
which fragment, and implement it in the reference implementation.]

[To be written: How the reference implementation works; how to develop
other implementations.]

[Relate to biozen, BFO, FOAF, etc.]


Discussion

[Issue: short-term vs. long-term locators: for immediate presentation
to GET or a browser vs. for storage.  Short-term locators might even
be http://localhost/ URI's (well, these are not really URI's since
they're not global); a long-term locator might be an http: URI stored
on a web page, but different from a preferred or canonical URI, which
might be e.g. a URN.]

See Tim's slides and other documents for his take on URI's, e.g.
http://dig.csail.mit.edu/2007/Talks/0108-swuri-tbl/

[Explain why the ontology satisfies the stated requirements.]

[LSID's are OK but not adequate and not even all that helpful]

[This paragraph is out of place everywhere]
As mentioned above, access to representations could be deployed
outside the application, inside a mapping service or web cache.  The
application would still need to decide which URI is to be presented to
the service, and the quality of service could suffer for not having
information known only to the application.  There is thus a tradeoff
between integration (exploitation of all available information) and
modularity (ignorance and/or replication of resolution-related
information).

Observing a resource is just one aspect of the communication and
coordination problems faced by semantic web applications.  In general,
one has a resource and seeks to know more about it.  If you observe a
resource, you learn what one of its representations was at the time of
the observation, and you may be able to use that information to reason
about it.  But other information about the resource, such as its
change history, stability, authorship, and so on, may not reside in
any representation; or else the information in the representation may
be suspect; or the resource may not be observable at all.  In this
case one needs to obtain information of a different kind from
different sources - information about it, not from it.

Information about a resource may be available from many different
sources - for example, unspecified information about the
non-information-resource foaf:name can be found by consulting the
resource foaf: [spell out URI], some non-information-resources have
URL's that return 303 "see other" redirects leading to information,
and so on.  The most direct source of information in general is an RDF
endpoint, which allows a client to ask what it wants to find out,
rather than accepting an uncharacterized wad of potentially irrelevant
information.  Different RDF endpoints may provide different kinds of
information about the same resource.  Applications are therefore faced
with the problem of choosing RDF endpoints and the queries to perform
on them as a function of which resources and properties are of
interest to the application.

Current practice for locating query endpoints may include manual
configuration of locations of "triple warehouses".  What else are
people doing, or wanting to do?

It seems likely that a resource ontology and standard representation
for information about how to use RDF endpoints would be beneficial.
This problem is beyond the scope of this report, but we look forward
to any developments along these lines.


Recommendations

- Use this ontology (or a fragment of it) to represent information
  helpful in observing resources, including resource-to-location
  mapping rules and contractual expectations

- Convey mapping information where appropriate (make 'closures')

- Transmit type information too [Alan: justify]

- "Execute" mapping information to obtain a locator, and use that locator
  instead of attempting a direct GET

  [describe exactly what execution entails]

  [there may be more than one way to execute depending on what's going
  to be using the locator]

- It doesn't matter what URI you use (LSID, info:, etc), as long as
  adequate information is available to allow mapping to a good
  location

- It doesn't matter whether the mapping occurs "in the application" or
  "in the web cache / mapping service" - your choice


Acknowledgments: Chris Hanson, Tim Berners-Lee, Dan Connolly


See also:

http://www.w3.org/2001/tag/issues.html#httpRange-14