URI Resolution: Finding Information About a Resource Jonathan Rees, Alan Ruttenberg, Matthias Samwald Version status: Very rough, still somewhat in outline form. Not yet reviewed by AR or MS. I will convert this to HTML and format it as a W3C "Interest Group Technical Note" when it is closer to being done. Problem statement Problem: An application has its hands on a URI, and needs to learn more about the resource named by the URI. Two kinds of information are important: particular representations of the resource, if the resource is an information resource; and RDF-encoded information about the resource, regardless of whether the resource is or is not an information resource. (Recall that according to HTTP dogma, "resource" is an abstract notion; a GET request returns a representation of an information resource, not the resource itself.) For a given resource R, often an information resource Q is available that holds the information about R that the application needs (possibly along with other information). A common case is a self-describing resource, i.e. R = Q. Without much loss of generality, we can take the finding-information problem to be that of going from R's URI to the resolvable URL of an information resource Q that either describes, or is, R. Other methods of obtaining information about R (such as a SOAP call or SPARQL query) may either be cast as HTTP requests or are so similar to HTTP requests that they do not introduce significant new issues. As the process of finding and using information (including resource representations) is automated, performance is often a serious issue; some URL's for appropriate information resources will be served quickly enough, and others won't. There may be other constraints dictating whether a URL is suitable for use, such as security properties of the network link used to fetch the representation. So here is the concise statement of what I'll call (somewhat misleadingly, but for reasons of inertia) the URI Resolution Problem: Given a URI for a resource R, obtain if possible a URL for an information resource Q that provides desired information about R and/or a representation of R, such that Q may be accessed using a communication link that has adequate performance and privacy. This problem has nothing specifically to do with HCLS, except that we seem to be the ones suffering the most pain from it. It also has little to do with the semantic web per se, except to the extent that the information we want to use is encoded in RDF. Example: An RDF file is composed using URL's that all resolve nicely. When years later someone tries to use the file, some of these same URL's are broken due to acquisitions, web site reorganizations, and changes of administration. All the linked resources are available, just under different URL's. How to make the user's application work without having to rewrite the RDF? Why is this hard? - Non-URL URI (scheme not understood by applications, e.g. info:, mailto:) - Broken link: . server gone or renamed . resource gone or renamed - Not-so-good URL: . communication link too slow . communication link not secure - Not-so-good content behind the URL: . resource R has no useable representation (e.g. not RDF) . R is too big . response to request is not a representation of the intended resource (e.g. http://www.ihmc.us/users/phayes/PatHayes) . R doesn't contain the information about R that's needed by the application ("metadata" exists but is elsewhere) What is the received wisdom? - Don't mint non-URL URI's. (TimBL) [good as far as it goes, but we may not be in a position to choose] - Mint URL's whose hostname specifies a long-lived server that will maintain the resource at the given URL in perpetuity. (Publishers, libraries, and universities are in good positions to do this.) [good as for as it goes, but user may not be in control, or may find quality name management to be beyond his/her grasp] - Use a web cache such as Apache or Squid, and a proxy configuration on the client, to deliver the correct content when a URL is presented that can't or shouldn't be used directly. (Dan Connolly) [this is a possible solution... see below] - Use LSID's. LSID resolvers are very similar to web caches in that an intermediate server is deployed to map URIs. [requires maintenance of an LSID resolver; not all problematic URI's are LSID's] - If the type of the representation is unuseable, use content negotiation and/or GRDDL to get the right type of resource. [can Alan say more about why he dislikes content negotiation?] - If the server replies 303 See Other, follow the link in the response to get information about resource. [obscure hack but worth a try] (see http://www.w3.org/2001/tag/issues.html#httpRange-14) - To relate a non-information-resource to information about it, mint URI's of the form http://example.org/foo#bar to name the resource, with the convention that the URI http://example.org/foo will name an information resource that describes it. [obscure hack, probably too late to take hold, e.g. ontology http://xmlns.com/foaf/0.1/ doesn't use #] What would a good solution be like? Observation: We need information in order to find information. - Knowledge about how to resolve a URI ('resolution information') will often be idiosyncratic to a particular point of use, and will usually be found in the hands of the individual users who care most about it. - The people who have resolution information aren't necessarily server or web cache administrators. [Client side is important. LSID resolvers and web caches are not very appropriate, and reliance on them will hinder advancement of SW.] - Resolution information changes all the time. [Submitting a work request to a server administrator is not practical, even when there is a server and an administrator.] - There will inevitably be a way (or some ways) to express resolution information to the software that's able to use it. - Users will want to use the same resolution information with multiple applications. - Users will want to share resolution information with one another in various ways (email, inclusion in documents / systems, etc). Received languages for configuring existing URI mappers/resolvers include Apache configuration files (e.g. the RewriteRule directive), SQUID configuration files, and LSID resolver configuration files [need to research these]. Proposal: A URI resolution ontology. The premise here is that we're dealing with semantic web applications here, and we think RDF is a good knowledge representation language, so let's use RDF to represent resolution information. - Kinds of information that could be represented using such an ontology: . InformationResource vs. NotAnInformationResource . Lifetime expectation information, e.g. doesn't change, expires; cf. HTTP Cache-Control: and other headers . Retrieval methods: direct; URI transformation; SPARQL; web service . Client-side content-type awareness and "content negotiation" (choice among variants) . Properties: Version description, DC, MD5, ... . Relations among resources: e.g. relate resource to information resource that describes it - Don't share bare URI's; provide resolution information. You get to choose whether the resolution information resides inside the document that uses the URI, or is carried independently of that document. - OWL can express rich properties and relations, e.g. resolution policies that apply to all objects of a given type. - OWL makes application of resolution tactics automatic, predictable, uniform (across applications), and error-free. - OWL-based resolution information could be used directly by an application, by a client-side web cache (e.g. local SQUID installation), or by a shared web cache. - Disadvantage: you need an OWL engine to interpret resolution information represented in this way, and not all applications have an OWL engine. [so why not get one and link it in?] [Discussion - how would we develop and deploy such a thing?] [Related issue: versioning.] See Tim's slides and other documents for his take on URI's, e.g. http://dig.csail.mit.edu/2007/Talks/0108-swuri-tbl/ Acknowledgments: Chris Hanson, Tim Berners-Lee, Dan Connolly