Best Practices for Resource Identification and Access to Representations Jonathan Rees (with Alan Ruttenberg and Matthias Samwald) 12 February 2007 Abstract The Semantic Web Health Care and Life Sciences interest group [citation] would like to agree among itself on conventions for identifying and accessing web resources. This document seeks to surface the problems and possible solutions in this area, and (in a future draft) to make recommendations targeted for the HCLS community. Status of this document Version status: Let's call this draft '0.5', derived from version '0.1' which was posted on the HCLS wiki on 2006-02-01. Still very rough. Not yet adequately reviewed by AR or MS. I will convert this to HTML and format it as a W3C "Interest Group Technical Note" when it is closer to being done. I've written much of this for myself, and once a few more people have seen it and there is some more progress, I should be able to excise much of the text. Parts of this document are based on Alan Ruttenberg's presentation at the HCLS face-to-face meeting in October 2006: http://tinyurl.com/y5tmud Terminology I rely on definitions from "Architecture of the WWW, Volume One" [http://www.w3.org/TR/webarch/], in particular: - Uniform Resource Identifier (URI): A global identifier in the context of the World Wide Web. - Resource: Anything that might be identified by a URI. - Representation: Data that encodes information about resource state. - Dereference a URI: Access a representation of the resource identified by the URI. I introduce the following terms, since they will be needed: - Observe a resource: Access a representation of the resource. (I'm sorry that the term is somewhat pretentious, so I will change or remove it if there are objections. It would be nice to use 'access' but unfortunately one accesses representations, not resources.) - Locator: a specification (usually a URI) that is expected to be used successfully to access a representation of some resource. I'll introduce these terms, but won't use them: - Observable resource: A resource that may be observed. For example, http://news.google.com/ identifies an observable resource. [Alan and most of the world would probably like to call this an "information resource", but this would disagree with the web architecture document's definition, which requires conveyance of the resource's "essential characteristics".] - Nonnegotiable resource: An observable resource that has a consistent content type, and is therefore the outcome of, or otherwise not subject to, content negotiation. Example: http://xmlns.com/foaf/0.1/index.rdf, but not http://xmlns.com/foaf/0.1/ - Stable resource: A resource that possesses only one representation; that is, every observation of the resource accesses the same representation, independent of time and other variables. Example: data:text/plain,example but not http://news.google.com/ It is safe to say in this case that the resource *is* the representation. Whether for this concept we require bit-for-bit identity, identity relative to the representation's MIME type (whatever that means), or some looser notion of stability, is open for debate. I will avoid the following terms: - Information resource: A resource which has the property that all of its essential characteristics can be conveyed in a message. -- There is vagueness in, and disagreement on, the definition (e.g. the architecture document disagrees with TAG resolution httpRange-14 http://www.w3.org/2001/tag/issues.html#httpRange-14). - URL: A URI that provides a means of locating a resource by describing its primary access mechanism [definition from RFC 3986]. Not used in "Architecture of the WWW, Volume One". Overview HCLS members are in the process of creating semantic web documents and building a variety of semantic web applications. These applications identify, reason about, and use resources. A resource is ordinarily identified by a URI. URI's are transmitted between applications and stored in various locations. If two semantic web applications identify a resource by the same URI, then they have the opportunity to exchange information about the resource. In a distributed setting, we will often need to locate or discover another application (a server or service) that can provide information and services related to the resource. Although there is no end to the ways an application can relate to a service, two important cases are the following: 1. Observation: We use something resembling HTTP GET to access a representation of the resource. For example, if the resource is the one with URI http://news.google.com/, we might want to obtain an HTML page giving the day's news. 2. A server (a SPARQL endpoint or something similar) may respond to an application-supplied query about the resource by returning with an answer providing the information the the application needs. Although in principle query subsumes observation [footnote: GET requests can in principle be encoded as SPARQL queries] and is philosophically better aligned to semantic web applications, we will focus on the problem of observation for a number of reasons: It is more concrete, we have more experience with it, and the need to make it work well is more pressing. In addition, query reduces to observation in some cases: sometimes we can do observe a resource (the same one or a different one) to obtain the information we need about some resource. For the most part we desire for our applications and documents to be long-lasting and efficient, and to respect the communication policies of the environment in which they're deployed (e.g. privacy rules). While general solutions to these problems do not exist, we should be able to formulate policies and methods that minimize them and help us to adapt as difficulties arise. The policies and methods discussed here will apply to the minting of new URI's, server administration and behavior, and tactics to be employed by applications to ensure that observation can be carried out efficiently and with respect to policy, over the long run. The problems discussed here have nothing specifically to do with HCLS, except that we seem to be suffering badly from it, and have been for a number of years (the OMG LSID spec and resolver infrastructure attempts to meet similar goals). It also has little to do with the semantic web or RDF per se. Luckily, we are only addressing the problems for ourselves: Solutions that we come up with need only work for us. Use scenarios 1. Publishing - RDF is written to a static document that is then distributed and saved. The RDF is meant to be used by an open ended set of applications, and is meant to be meaningful for a long time. 2. Web application - a web application uses RDF triples internally somehow to generate web pages dynamically. The RDF describes resources to which the generated pages will link, using URI's that can be used by any web browser. 3. Web client - an application finds a resource mentioned in some triples and wants to access a representation for internal use (e.g. image processing); access to be accomplished using an available HTTP API such as java.net.URI 4. Application with slaved browser - an application finds a GETtable resource mentioned in some triples and wants to transmit its URL to a nearby web browser for observation and display Problem: Observing a resource Given this definition, the situation we will consider is this: An application has identified a resource, and desires to observe the resource. In RDF, resources are often identified by URI's. There is no a priori expectation on the web that a URI can be used as a locator, but sometimes it can. So an obvious idea is to use one of the resource's URI strings [Footnote: Thanks to owl:sameAs, it is possible for a resource to have more than one URI.] with a GET module (something that can use HTTP and perhaps other protocols to do GET-like operations), in the obvious way. This direct approach can fail in a number of ways: 1. It may be that none of the resource's URI's can be directly used as a locator. For example, the URI might be an info: or urn: URI, and the GET module used by the application may not know what to do with it. 2. The URI may be a "broken link": . server gone, unavailable, or renamed . resource gone, unavailable, or renamed 3. Use of the indicated GET module with that URI may not meet communication requirements: . communication link too slow . communication would not respect necessary local policy (don't go there) 4. The response to the GET request may be "wrong" - a representation that is not a representation of the intended resource (e.g. a service might deliver an incorrect page instead of an HTTP redirect; or the domain name registration may have lapsed and the name claimed by an unrelated, possibly hostile, entity) 5. It may be difficult to choose from among the resource's many URI's, or it may have no known URI at all (blank node) We can guard against most of these problems by minting URI's carefully in the first place or by arranging for a server or servers to maintain accessibility of representations; and if neither of these is sufficient, we can attempt to observe the resource by dereferencing a new URI derived from the resource's original URIs and/or other information that we have about the resource. (The new URI does not necessarily identify the resource; it is simply what we are using to observe it.) Following is some of the received wisdom: - Never mint a new URI for a resource when one already is in use. (5) - Tim B-L says don't mint URI's that are not locators (URL's). (1) Always endeavor to arrange for the named server to provide a either adequate access or valid redirection at each newly minted URI. (2) - Mint URI's whose hostname specifies a long-lived server that will maintain observability of the resource at the given URI in perpetuity. Publishers, libraries, and universities are in good positions to do this. Note that thanks to HTTP redirects the servers need not actually hold the resource (cf. purl.org); but they must be committed to ensuring that they direct clients to the resource wherever it is currently hosted. (2) [This is also good advice and should be followed when possible. Unfortunately it puts a heavy burden on those who would publish material containing URI's. Because of this burden, and the near impossibility of guaranteeing permanent access, one can predict that some important links will become broken, no matter what.] - Non-locator URI's can sometimes be rewritten as locators. David Booth refers us to his article "Converting New URI Schemes or URN Sub-Schemes to HTTP", http://dbooth.org/2006/urn2http/. (1) - Mint LSID URI's, and configure each instance of each application to refer to an LSID resolver that knows about the LSID's being used. Convert LSID's to HTTP URI's if necessary according to the any effective method. (2) For performance and/or locality use in conjunction with a cache. (3) [This advice flatly contradicts Tim B-L's advice, which says don't use any URI that's not a URL. HCLS wiki page on LSID pros and cons: http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Tasks/URI_Best_Practices/LSID_Pros_%26_Cons] - Use a web cache such as Apache or Squid, and a proxy configuration on the client, to provide representations of otherwise unreachable resources, or representation access that is efficient and/or policy-abiding. (Dan Connolly) (2, 3, 4, possibly 1) [This is a possible solution, for HTTP URI's at least. See below.] What would a good solution be like? Since we cannot always control the way in which URI's are minted, and we cannot in the long run protect against broken links, we will have to either modulate the choice of which URI is the subject of the GET, or modulate the apparatus that handles the GET. In the latter case the apparatus (a proxy, cache, or mapping service) usually has to modulate the presented URI to obtain another URI for a subsequent GET (or past GET, in the case of a cache), so the two approaches are functionally equivalent, differing only in where we draw the line around what we consider "the application". Our problem - that of locating a resource, addressing the failure modes of a direct GET - therefore can be solved if we can describe a mapping from a resource to the "right" locator URI. The locator can be one of the resource's own URI's or one belonging to another resource, and should be presented to the apparatus that will attempt a GET. The central observation illuminating what follows is that such resource-to-locator mappings can and should be described. An example of information contributing to such a mapping is the assertion that a resource that has a URI beginning with http://wrl.dec.com/titan/ can be observed using a locator obtained by replacing this prefix with http://research.hp.com/projects/titan/. Further observations: - Knowledge about how to map resources to locators will often be idiosyncratic to place and time, and will usually be found in the hands of the individual users who care most about it. - The people who have mapping information aren't necessarily server or web cache administrators and they may not have access to administrators (supposing they exist at all). Being able to map to a locator in a client application is important. - Mapping information changes all the time. Submitting a work request to a server administrator for every change in the mapping is not practical, even when there is a server and an administrator. - Users will want to reuse mapping information between applications. - Users will want to share mapping information with one another in various ways (email, inclusion in documents / systems, archiving, etc). All of these considerations point to the desirability of agreeing on some way of writing down the information that configures the process of mapping resources to locators. It is important to agree on this "way or writing down" (notation) and what it means, while being agnostic on its implementation. Agreement on implementation will be impossible given the diversity of platforms in use: some of us will desire tight application integration, others won't; some will prefer client side, some a proxy or service. And we are using a variety of operating systems and programming languages with very little in common. Let's assume that agreeing on a notation to describe the resource-to-locator mapping is a goal. There are many potential approaches to notation design. We could use an existing notation (such as some subset of the Apache configuration file language), or make up yet another "little language" [what do wild-type LSID resolvers use internally?]; or we can employ some existing notation that is general enough to represent information of this kind. Fortunately, we have a general information-representation notation readily at hand: RDF. If we use RDF to represent access information, we leverage the only language we already share, one that's easily parsed and translated (e.g. into Apache configuration files). In addition, triple-represented information about a resource, such as its RDF/OWL types, publisher, etc., may be of use in mapping a resource to a locator, addressing problems (1) and (5) in ways that are not possible for a facility that is not semantic-web-aware. Finally, using RDF opens the way to employing inference in the selection of locator and in describing contracts that might hold between the application and the server that will deliver the representation. Proposal: A resource observation ontology. Alan has described a sketch of a resource observation ontology in his portion of the presentation "Ontology-based URI Resolution" http://tinyurl.com/y5tmud [file name RuttenbergURLResolutionInOntology.ppt, Oct 6 2006]. In the above introduction I have tried to lead the reader toward this solution by considering the problem from first principles. The presentation uses "InformationResource" for what I call "observable resource", "UnchangingInformationResource" for what I call "stable resource", and "retrieve" or "get" for what I call "observe". [TBD: Present the ideas in stages to show status of each additional bit of complexity: essential, optimization, serendipitous, speculative. Provide developers with a migration path, with easy first steps.] [TBD: Document the OWL ontology. Figure out what to recommend now.] I will summarize the idea here, but for details please see the .ppt file. - Represent access information (information that helps you to access representations of a resource) in RDF according to an ontology - Allow it to interact with application-level information - Kinds of information that could be represented using such an ontology: . Retrieval [observation] methods: direct; URI rewrite (perhaps prefix-based a la D Booth, or regexp-based a la Apache RewriteRule); SPARQL; web service . Contracts with servers, e.g. that a resource is stable, or that change will not happen soon / frequently . Representation type information - so that you can predict what you'll receive should you do a GET . Authentication information . [Extra] Relations among resources: e.g. relate a resource (or class of resources) to a GETtable resource (or class of same) that contains a description of it - Represent resource metadata such as version, DC, etc. in RDF and use it somehow in resolution - Don't share bare URI's; provide mapping information when you communicate a document containing URI's. You get to choose whether the mapping information resides inside the document that mentions the URI, or is carried independently of that document in an application or site configuration. - [Extra] Client-side content-type awareness can be used as a more tasteful alternative to content negotiation (choice among variants) - [Extra] We have good ways to talk about versioning Why OWL? - OWL can express rich properties and relations, e.g. resolution policies that apply to all objects of a given type. - OWL makes application of resolution tactics automatic, predictable, uniform (across applications), and error-free. One might argue that you need an OWL engine to interpret resolution information represented in this way, and not all applications have an OWL engine. However, reasoning over the ontology requires only a small fragment of OWL - certainly not all of OWL DL. [TBD: articulate which fragment, and implement it in the reference implementation.] [To be written: How the reference implementation works; how to develop other implementations.] [Relate to biozen, BFO, FOAF, etc.] Discussion [Issue: short-term vs. long-term locators: for immediate presentation to GET or a browser vs. for storage. Short-term locators might even be http://localhost/ URI's (well, these are not really URI's since they're not global); a long-term locator might be an http: URI stored on a web page, but different from a preferred or canonical URI, which might be e.g. a URN.] See Tim's slides and other documents for his take on URI's, e.g. http://dig.csail.mit.edu/2007/Talks/0108-swuri-tbl/ [Explain why the ontology satisfies the stated requirements.] [LSID's are OK but not adequate and not even all that helpful] [This paragraph is out of place everywhere] As mentioned above, access to representations could be deployed outside the application, inside a mapping service or web cache. The application would still need to decide which URI is to be presented to the service, and the quality of service could suffer for not having information known only to the application. There is thus a tradeoff between integration (exploitation of all available information) and modularity (ignorance and/or replication of resolution-related information). Observing a resource is just one aspect of the communication and coordination problems faced by semantic web applications. In general, one has a resource and seeks to know more about it. If you observe a resource, you learn what one of its representations was at the time of the observation, and you may be able to use that information to reason about it. But other information about the resource, such as its change history, stability, authorship, and so on, may not reside in any representation; or else the information in the representation may be suspect; or the resource may not be observable at all. In this case one needs to obtain information of a different kind from different sources - information about it, not from it. Information about a resource may be available from many different sources - for example, unspecified information about the non-information-resource foaf:name can be found by consulting the resource foaf: [spell out URI], some non-information-resources have URL's that return 303 "see other" redirects leading to information, and so on. The most direct source of information in general is an RDF endpoint, which allows a client to ask what it wants to find out, rather than accepting an uncharacterized wad of potentially irrelevant information. Different RDF endpoints may provide different kinds of information about the same resource. Applications are therefore faced with the problem of choosing RDF endpoints and the queries to perform on them as a function of which resources and properties are of interest to the application. Current practice for locating query endpoints may include manual configuration of locations of "triple warehouses". What else are people doing, or wanting to do? It seems likely that a resource ontology and standard representation for information about how to use RDF endpoints would be beneficial. This problem is beyond the scope of this report, but we look forward to any developments along these lines. Recommendations - Use this ontology (or a fragment of it) to represent information helpful in observing resources, including resource-to-location mapping rules and contractual expectations - Convey mapping information where appropriate (make 'closures') - Transmit type information too [Alan: justify] - "Execute" mapping information to obtain a locator, and use that locator instead of attempting a direct GET [describe exactly what execution entails] [there may be more than one way to execute depending on what's going to be using the locator] - It doesn't matter what URI you use (LSID, info:, etc), as long as adequate information is available to allow mapping to a good location - It doesn't matter whether the mapping occurs "in the application" or "in the web cache / mapping service" - your choice Acknowledgments: Chris Hanson, Tim Berners-Lee, Dan Connolly See also: http://www.w3.org/2001/tag/issues.html#httpRange-14