HTTPURIUseCases

From W3C Wiki
Revision as of 17:00, 3 June 2012 by Jrees (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

See also TagIssue57Home

These are use cases for consideration when looking at [proposed changes] to JAR's draft.

In each use cases there is a "sender" writing a "message" (usually something written in RDF, for inclusion in email, a triple store, an HTML or RDF document, triple store, etc.) that they want to be understood by a "receiver". The sender wants the receiver to be able to use content provisioned using some URI in understanding the message in a way that the sender would like. The use cases differ in what meaning is to be conveyed by the message (i.e. what the sender intends) and how it relates to the provisioned content.

In all of the use cases we assume that the meaning - that is, what the sender wants to convey to the receiver - is fixed. The choice to be determined by particular proposals is either in how this meaning is to be expressed or in how the URI is to be selected and provisioned.

In one set of use cases, the content provisioning is fixed, and what is variable across proposals is how sender is to express the intended meaning. Depending on the use case and proposal, the intended meaning can be expressed using the original URI U, by using a second URI that can be discovered by examining the retrieval response (headers, content, etc.), or through some "inconvenient" mode of expression such as a blank node or a sender-defined URI (maybe a hash URI or tag: URI).

In another set of use cases, the mode of expression is fixed ahead of time, and what is variable across proposals is how the content is to be provisioned so that the receiver can discover the sender's meaning.

  • There is an important question regarding the scope of the issue. The pain so far is only felt by people writing RDF. The question simply doesn't come up in HTML (minus RDFa, meta, and rel=) because semantics comes mainly from what browsers do and browsers don't do much reasoning other than for cache control. People who care about non-RDF URI-based structured data formats have not really been part of the discussion and so we don't have much data about these formats.

So the method question is, do we start with the general question, and take RDF as an example, or do we say the scope is RDF, and maybe we can generalize later? I have taken the latter approach, which suggests this is not a TAG or URI issue at all, but rather an RDF issue.

Legacy content

L) Legacy content

Any proposal that does not explain current use of the <U> in the following

  • owl:imports <U>
  • rdfs:isDefinedBy <U>
  • rdfs:seeAlso <U>
  • xhv:license <U>

deprecates a large amount of existing content.

That is: We need to consider cases where the manner of expression (i.e. choice of URI and context) and the provisioned content are fixed by what we find today on the Web, and try to make sure that the proposal helps explain how the receiver is to understand what they currently understand.

Obviously current processors that understand these statements as intended will not break. And probably future processors and content will use these properties according to current tradition. But any proposal that does not account for current practice will be opaque to future developers outside an "inner circle" and does not meet accepted standards of specification transparency.

Fixed provisioning, variable expression

A) Refer to naive document

In this use case the sender wants to refer to the content so that they can say something about it. The sender does not want to incur the obligation to scrape RDFa from the content, either because they are not expert in doing so or because they really care about the content as observed (by the sender and receiver), not about what any RDFa (if present) says.

What is variable in this case is the way the sender expresses this: the choice of URI or other term.

It should be possible, in RDF, to make comments about an existing document which contains no special metadata apart from a normal HTTP response. For example,

Homework: how long is
<http://www.gutenberg.org/files/2701/2701-h/2701-h.htm> ?

Someone might be tempted to write the following if they wanted to communicate the title and author of content retrieved using the given URI:

 <http://www.gutenberg.org/files/2701/2701-h/2701-h.htm>
 foaf:maker [ foaf:name "Herman Melville"];
 dc:title "Moby Dick";
 vagueontology:length 208855.

[Warning to 'curl' or 'wget' users! "The Project Gutenberg Web Site is for human (non-automated) users only.

   Any perceived use of automated tools to access our web site
   will result in a temporary or permanent block of your IP address or subnet."]

(Contrast with a different use case, where they would want to talk about the mass and length of the eponymous fictional whale:

<http://www.gutenberg.org/files/2701/2701-h/2701-h.htm>
a db:Sperm_whale; phys:massKg 50000;
vagueontology:length 25.

)

Question for proposals comparison: The mutual goal for the sender and receiver is for the sender's meaning to be understood by the receiver. Given this it is important what they write. What they "have to do" to achieve correct communication depends on their prior agreement (the proposal details). So, before the sender can write this (the above dc:title RDF snippet) or whatever is prescribed by the proposal, do they have to do a fetch and look at the response? (Or, what parts of the response do they have to look at, and how hard?) (The receiver would presumably have to do the same kind of thing.)

Example 2 (probably better than example 1):

<http://www.jenitennison.com/blog/node/167>
 a sioc:Post;
 foaf:maker [ foaf:name "Jeni Tennison"];
 dc:title "Precious Snowflakes".

where the intent is to talk about the content retrieved using that URI (the blog post), not to its subject, which might or might not have a foaf:maker and/or dc:title.

Under a 'Content opt-in' proposal where there is no opt-in signal (or 'content opt-out' where there is an opt-out signal), the sender who would have otherwise liked to write the above RDF would have to express their meaning differently ("inconveniently"):

@prefix w: <https://www.w3.org/2001/tag/2012/04/issue57>.
[]
  w:contentUri "http://www.jenitennison.com/blog/node/167" ;
  a sioc:Post ;
  foaf:maker [ foaf:name "Jeni Tennison"];
  dc:title "Precious Snowflakes".

Example 3: Referring to the naive document in spite of what any embedded RDFa might say:

<http://www.flickr.com/photos/70365734@N00/6141289487/> foaf:maker [foaf:name "Dan Appelquist"].

The sender in this example wants to say something about the comments found in the HTML page retrieved using that URI. They do not wish to say something about the image, which does not have that person as its foaf:maker.

(This is in conflict with the embedded RDFa found on that page, which seems to use the URI to refer to the image (which was not made by Dan Appelquist), not to the web page containing the comments. It is useful to be able to talk about the content without having to look for, scrape out, and understand the embedded RDFa.)

! This use case highlights the problem with any opt-out mechanism, including 303. It boils down to a question of how hard the sender has to work to figure out whether they can use the URI to refer to the content (and how hard the receiver has to work to figure out whether the sender has used the URI to refer to the content)

  • The lowest bar is to say the URI always refers to the content, even if it's found by following a 303
  • Next, we ask the sender and receiver to check for 2xx, but not to look at the headers (httpRange-14)
  • Next, we ask them to look at the headers, but not the content (Proposal 25)
  • Next, we ask them to read RDF in the content to see what it says (Look for Contradiction, No Longer Implies)
  • Next, we say it's never understood to be the content, so you should give up before you start (Always Description, Primary Topic, Retract, Don't Use http:)

G) Refer to primary topic of a page

There is a page with an obvious primary topic, and the sender wants to refer to that primary topic (in a way that the receiver will understand, according to the proposal). For example, the sender might want to refer to Pat Winston, whose home page is http://people.csail.mit.edu/phw/ . The temptation is to write

<http://people.csail.mit.edu/phw/> foaf:name "Patrick Winston".

This use case is not directly supported by any of the change proposals received and has not been raised recently in discussion, so it will not be considered further.

Variable expression and provisioning

B) Refer to a document from RDFa within

This is similar to the A) use case, but now the RDF that the sender writes is being included in the content ("linkee"). This is similar to a "same document reference". Note that the content may be composed using a content management system and therefore there may be no point of coordination between all the various "senders" contributing RDF to the content (i.e. various CMS modules may use the URI in different ways; there is no single "URI owner").

Example expressed using content-always or content opt-out (from [RDFa]):

This document is licensed under a 
<a xmlns:xhv="http://www.w3.org/1999/xhtml/vocab#"
 rel="xhv:license"
 href="http://creativecommons.org/licenses/by/3.0/">
 Creative Commons License
</a>.

To specify the license on a document within it as in the Creative Commons

<> xhv:license <http://creativecommons.org/licenses/by/3.0/>.

The above is how we write things now, but other ways to write this (another URI, or other manner of expression) are easy to imagine.

Under any proposal that is not "always content", CC's documentation would have to change, either to specify that "opt in" is required (or "opt out" precluded) or to say that the subject of the copyright statement has to be expressed "inconveniently". In either case the CMS would have to change what it does.

Fixed meaning and expression, variable provisioning

The assumption in this set of use cases is that the way the sender must express the intended meaning is fixed ahead of time as being a hashless http: URI. Where this assumption comes from (i.e. why a hash URI is not being used) is not clear, but it may have to do with backward compatibility with deployed content, sunk cost, historical accident, or a certain predetermined philosophy of how linked data "must work" (i.e. that hash URIs cannot advance some particular goal either for technical reasons or for social reasons).

The goal in this case is for the "linkee" to provision discovery of the specification and/or some RDF relating to the intended meaning from the URI, so that the receiver who doesn't already know what the URI means (i.e. what meaning the sender intends to convey, per prior agreement, with this use of the URI) can figure it out. The receiver requiring discovery might be either a human (using a browser) or an RDF-aware web client (using some proposed discovery protocol). The choice between human and machine discovery client leads to distinct use cases.

C) Refer to the subject of a document from within it using RDFa

This for example is what the Open Graph Protocol does. Typical use in [[1]] violates HR14a by using syntax which is designed for talking about the document.

<html prefix="og: http://ogp.me/ns#">
<head>
<title>The Rock (1996)</title>
<meta property="og:title" content="The Rock" />
<meta property="og:type" content="video.movie" />
<meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
<meta property="og:image" content="rock.jpg" />
...
</head>
...
</html>

(To see the corresponding RDF, apply an RDFa distiller to this page: http://www.imdb.com/title/tt0117500/ )

  • Under "no change" a 303 would have to be used.
  • Under Proposal 25 it could be made to work with the addition of a Document: header.
  • Under "look for contradiction" it would work as is as long as the properties were understood as being incompatible with the URI referring to the retrieved content.
  • Under "no longer implies" a describedby link would be required.
  • Some of the other proposals provide other approaches (.well-known, etc.).
  • Under "retract" or "don't use http:" there is no understanding of what <http://www.imdb.com/title/tt0117500/> means; the sender would have to write [:descriptionUri "http://www.imdb.com/title/tt0117500/"].

Example 2: Flickr

http://www.flickr.com/photos/70365734@N00/6141289487/in/photostream contains

<http://www.flickr.com/photos/70365734@N00/6141289487>
    xhv:license <http://creativecommons.org/licenses/by/2.0/deed.en> .

where the license applies to the image (linked from the landing page), not the landing page itself (which contains material to which the copyright notice doesn't apply).

D) Use a hashless URI to refer to the single subject not the page

A need expressed by Ian Davis. (also Harry Halpin "death of linked data if we don't do this"?) Again, for some reason it is a foregone conclusion that the sender's meaning must be expressed as a hashless http: URI.

Example 1: the City of London. Senders are instructed to use the following URI to refer to the City of London:

<http://statistics.data.gov.uk/id/local-authority-district/00AA>

The choice of URI is a given in this use case, and we are looking at what a server providing "discovery service" for the URI would have to do under various proposals.

Under the 2005 TAG advice (HR14) discovery has to be deployed using 303:

$ curl -I http://statistics.data.gov.uk/id/local-authority-district/00AA
HTTP/1.1 303 See Other
Content-Type: text/html; charset=UTF-8
Date: Wed, 04 Apr 2012 13:58:03 GMT
Location: http://statistics.data.gov.uk/doc/local-authority-district/00AA
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.4
Connection: keep-alive

Example 2 (imaginary)

<EiffelTower> a ex:Tower, ex:Building; foaf:name "Eiffel Tower".

Under "content opt-in" without any opt-in, a 200 response can be used instead.

Alternatively, the URIs to be deployed could be hash URIs instead of hashless URIs.

Example 3 (from Hugh Glaser):

"Please don't come back with something that does not allow, or even encourage, sites like Facebook to offer RDF back in return for:

curl -L -H Accept:application/rdf+xml https://www.facebook.com/hugh.glaser

Challenge: Try telling me what to put in sameAs.org for the LD URI for you on Facebook."

(Facebook could use hash URIs.)

E) An ontology of many URIs all hashless

Similar to D. Documentation or extant practice fixes the URI and the meaning the URI is supposed to express (usually an RDF property or class), so we assume that the sender would like to express this meaning and have the receiver understand the URI as expressing it. The linkee would like to support some discovery protocol that would help the sender to have this intended meaning be understood.

Example: Dublin Core

The Dublin Core ontology has many terms, none of which use the normal #

@prefix dc: <http://purl.org/dc/elements/1.1/>.

The currrent situation uses a 302 that is not in compliance with RFC 2616 (which forbids a hash URI as a Location: target):

$ curl -I http://purl.org/dc/elements/1.1/title
HTTP/1.1 302 Moved Temporarily
Date: Wed, 04 Apr 2012 12:42:08 GMT
Server: 1060 NetKernel v3.3 - Powered by Jetty
Location: http://dublincore.org/2010/10/11/dcelements.rdf#title
Content-Type: text/html; charset=iso-8859-1
X-Purl: 2.0; http://localhost:8080
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Length: 286

Possible alternatives:

  • Use 303 (2005 TAG recommendation)
  • Proposal 25 with "content opt out"
  • Content opt-in with no "opt in"

Example: FOAF

@prefix foaf:     <http://xmlns.com/foaf/0.1/> .

This uses the TAG recommended 303:

$ curl -I http://xmlns.com/foaf/0.1/knows
HTTP/1.1 303 See Other
Date: Wed, 04 Apr 2012 12:45:43 GMT
Server: Apache/2.2.14 (Ubuntu)
Access-Control-Allow-Origin: *
Location: http://xmlns.com/foaf/spec/
Vary: Accept-Encoding
Content-Type: text/html; charset=iso-8859-1

The issue with 303 is that in the absence of caching two round-trips are used for each ontology term, when an agent looks up a set of terms in the ontology, even when in fact only one document needs to be loaded.

J) Naive linked data on hosting service

This use case is sketched here and here.

The assumption is that it is important to allow provisioners to deploy new linked data URIs

  • without using hash (because some provisioners won't appreciate any reason to use hash)
  • without using special status codes such as 303 (because many provisioners will be using hosting services that don't support this)
  • without requiring a second GET as with 303 or Link: (because the performance hit is intolerable)
  • without requiring deployment of a second URI (because this will be perceived as unnecessary and inconvenient)
  • without special headers (because many provisioners will be using hosting services that don't support this)

Compatibility and sunk cost are not arguments here, as the URIs in question are all new.

Under this special combination of circumstances, probably either URIs never refer to content, or any opt-in or opt-out signal must reside in the content.

  • httpRange-14 and Proposal 25 fail here
  • No Longer Implies works - the provisioner can opt in by including a statement &ltU> describedby [] in the content.
  • Look For Contradiction would work the provisioner can opt out of representation-is-content by including a statement recognizably incompatible with that.
  • Nobody submitted a discovery proposal that relies on a mapping from the original URI http://host/path to another URI, say http://host/.well-known/meta/path, where http://host/path is a non-2xx (e.g. 404?), but this is a possible approach.

Other use cases

F) Ontology of many terms, with hash

The use of hash and 303 is largely uncontroversial so this use case doesn't discriminate between proposals (except for "don't use http: URIs" proposals).

Examples are RDFS, OWL, sioc, etc.

H) Changing Information (monotonicity)

This use case tries to expose what happens when a linkee initially provides no information and then later changes their pages to adopt whatever convention is recommended. The examples are about the URI http://www.amazon.com/gp/product/B004TRXX7C which is a page that contains information about the novel Moby Dick, specifically a Kindle edition of that novel.

1) A sender has a crawler that performs a specialist parse of Amazon pages and makes that information available within a quadstore which is then available for users to query. The sender wants to ensure that the information within the quadstore is traceable back to the original page from which the information was gleaned, and wants to include information about both the edition and the novel in general.

2) After (1) has happened, Amazon decide to start to publish RDF about the products that they offer. They decide that URIs like http://www.amazon.com/gp/product/B004TRXX7C are best used to represent the web pages that they offer, and therefore want to publish statements like:

 <http://www.amazon.com/gp/product/B004TRXX7C>
   a schema:WebPage ;
   xhv:license </license> ;
   .

as well as other information using separate URIs about particular editions and the novel in general. They want to do this simply through adding RDFa within their existing web pages.

The data source described at (1) still contains more information than that published by Amazon. An application writer now wants to combine the information from the repository at (1) with the information Amazon have published.

3) After (1) has happened (ignoring (2)), Amazon decide to start to publish RDF about the products that they offer. They decide that URIs like http://www.amazon.com/gp/product/B004TRXX7C are best used to represent the particular editions of the books that they sell, and therefore want to publish statements like:

 <http://www.amazon.com/gp/product/B004TRXX7C>
   a schema:Book ;
   schema:numberOfPages 398 ;
   schema:datePublished "2011-03-24"^^xsd:date ;
   .

as well as other information using separate URIs about particular web pages and the novel in general. They want to do this simply through adding RDFa within their existing web pages.

The data source described at (1) still contains more information than that published by Amazon. An application writer now wants to combine the information from the repository at (1) with the information Amazon have published.

4) After (1) has happened (ignoring (2) and (3)), Amazon decide to start to publish RDF about the products that they offer. They decide that URIs like http://www.amazon.com/gp/product/B004TRXX7C are best used to represent the novels, with owl:sameAs relationships between any such URIs that provide information about the same novel, and therefore want to publish statements like:

 <http://www.amazon.com/gp/product/B004TRXX7C>
   a schema:Book ;
   schema:creator <http://www.amazon.com/Herman-Melville/e/B000AQ29JY> ;
   schema:datePublished "1851-10-18"^^xsd:date ; 
   .

as well as other information using separate URIs about particular web pages and editions. They want to do this simply through adding RDFa within their existing web pages.

The data source described at (1) still contains more information than that published by Amazon. An application writer now wants to combine the information from the repository at (1) with the information Amazon have published.

K) Aggregation of RDF harvested from diverse sources

A client may gather RDF from multiple sources and combine them into a single place (RDF graph, triple store). If a URI is not used consistently across the sources, the resulting aggregation may not be internally consistent.

M) HTTP consistency

This use case attempts to raise the issue of whether conclusions drawn according to the consequences of a particular proposal would impinge on the correct operation of an HTTP client.

Here we assume that the mode of expression is a hashless URI of the sort that one might use with the HTTP protocol (typically an http: URI). If it is something else, then the question of HTTP consistency doesn't come up. The meaning that the sender wants to express is not of much importance, but in the example it is supposed that at least one of the URIs is meant to refer to Chicago.

Suppose that a web browser (or other HTTP agent) is seen as being responsible for displaying correct representations of resources. That is, navigation to a URI U must lead to a correct representation of the resource identified by U being displayed. (This seems like a plausible interpretation of the HTTP specification.)

Suppose that we have two URIs yielding inequivalent representations describing Chicago, e.g.

 <http://en.wikipedia.org/wiki/Chicago> :displayableRepresentation _:rep1.
 <http://www.cityofchicago.org/city/en.html> :displayableRepresentation _:rep2.

with _:rep1 inequivalent to _:rep2 (different content), i.e. not the following:

 <http://en.wikipedia.org/wiki/Chicago> :displayableRepresentation _:rep2.
 <http://www.cityofchicago.org/city/en.html> :displayableRepresentation _:rep1.

Now suppose that a browser has navigated to the first URI (resource) and displayed and cached a representation _:rep1. Now it is requested that the browser navigate to the second URI, i.e. to display a representation of the other resource.

It would definitely not be OK for the browser to display _:rep1 in response to a request to navigate to the second URI, since _:rep1 is not a displayable representation of <http://www.cityofchicago.org/city/en.html>.

That is, HTTP compatibility requires that the two URIs identify distinct resources.

A proposal that would admit the possibility that (according to RDF found in one or both representations) they identify the same resource, which might be expressed in RDF as

<http://en.wikipedia.org/wiki/Chicago> owl:sameAs
   <http://www.cityofchicago.org/city/en.html>.

would fail this criterion.

N) Reconciling incompatible uses (polysemy)

A use case from Larry Masinter. A receiver, B, receives messages M1 and M2 from senders A1 and A2. A hashless retrieval-enabled URI U occurs in both messages, but is used with different meanings in the two messages. What does C do to understand M1 and M2?

The differences in meaning could have to do with the messages being composed at different times, due to different information or specifications being observed or available to A1 and A2, different interpretations of the same information, and so on.

In thinking about this, remember that we are comparing candidate community agreements, so that in this scenario, if the agreement is P, then A1, A2, and B all agree to P.

This use case challenges the way in which the question is framed, and is difficult to use as an evaluation criterion.

Missing use cases

  • A use case that would be a critique of "punning" - maybe having to do with consistency with upper ontologies such as SUMO, or type-based analysis such as Tabulator?
  • A use case that surfaces difficulties arising from the ephemeral nature of http: URI meaning
  • Something relating to read/write web, see http://www.w3.org/2012/ldp/charter