]>
Editors DRAFT This document has been developed for discussion by the The content of this document is intended for discussion and does NOT necessarily represent a consensus position of the TAG.
The
An The terms MUST, MUST NOT, SHOULD, and SHOULD NOT are used in this document in accordance with Publication of this finding does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time.
Please send comments on this finding to the publicly archived TAG mailing list
Web-based software uses URIs to designate resources.
The authority who creates a URI is responsible for assuring that it is associated with the intended resource,
and that operations targeted to the URI manipulate or return the appropriate data.
Many URI schemes offer a flexible structure that can also be used to carry additional information,
called metadata, about the resource.
Such metadata might include
the title of a document,
the creation date of the resource, the MIME media type that is likely
to be returned by an HTTP GET, a digital signature usable to
verify the integrity or authorship of the resource content, or hints
about URI assignment policies that would allow one to guess the URIs
for related resources.
This finding addresses several questions regarding such metadata in URIs: What information about a resource can or should be embedded
in its URI? What metadata can be In what circumstances is it appropriate to use information from a URI as a hint as to the nature of a resource or its representations?
The TAG has earlier published a finding
This section uses simple examples to illustrate some
issues that arise when encoding metadata in URIs, or
when relying on information gleaned from such URIs.
Consider Martin, who is using a Web-based bug tracking system to
investigate some software problems. He sees a bug report which says:
The bug tracking system is built to show examples just as they are entered
into the system, so for http://example.org/bugdata/brokenfile.xml it returns
a stream of (poorly formed) XML with Content-Type
Unfortunately, Martin uses a browser that incorrectly attempts to infer the
format of the returned data from the URI suffix.
Keying on the ".xml" in the URI, it launches
an XML renderer for what should have been plain text.
When Martin attempts to view the faulty file,
he sees instead a browser error saying that the
erroneous XML could not be displayed.
Constraint:
Web software MUST NOT depend on the correctness of metadata inferred
from a URI, except when the encoding of such metadata is documented
by applicable standards and specifications.
Such standards and specifications include pertinent Web and Internet
RFCs and Recommendations such as
Martin's browser is in error because its inference that the URI suffix provides file type metadata is not provided for by normative Web specifications or, we may assume, in documentation from the assignment authority.
A correctly written browser would have shown the faulty XML
as text, or might conceivably have shown a
warning about the apparent mismatch between the type inferred from
the URI and
the returned Content-Type.
(Martin's browser is also ignoring TAG finding
"Authoritative Metadata"
Note that the constraint refers to There is certain metadata that Martin or his browser
can reliably determine from the URI.
For example, the URI conveys that the http scheme has been used,
and that attempts to access the resource should be directed to the IP address
returned from the DNS resolution of the string "example.org".
These conclusions are supported by normative specifications such as Bob is walking down a street, and he sees an advertisement
on the side of a bus:
Bob goes home and types the URI into his browser, which does indeed
display for him a Chicago weather forecast.
Bob then realizes that he'll be visiting Boston, and he guesses that a Boston weather page might be available at a similar URI:
He types that into his browser and reads the response that comes back.
Bob is using the original URI for more than its intended purpose,
which is to identify the Chicago weather page.
Instead, he's inferring from it information about the structure of a Web site
that, he guesses, might use a uniform naming convention for the
weather in lots of cities.
So, when Bob tries the Boston URI, he has to be prepared
for the possibility that his
guess will prove wrong:
Web architecture does not guarantee that the retrieved page, if there is one,
has the weather for Boston, or indeed that it contains
any weather report at all.
Even if it does,
there is no assurance that it is current weather, that it
is intended for reliable use by consumers, etc.
Bob has seen an advertisement listing just the Chicago URI, and that is the only one that the URI authority has
warranted will be a useful weather report.
Still, the ability to explore the Web informally and experimentally is very
valuable, and Web users act on such guesses about URIs all the time.
Many authorities
facilitate
such flexible use of the Web by assigning URIs in an orderly and predictable
manner.
Nonetheless, in the example above,
Bob is responsible for determining whether
the information returned is indeed what he needs.
Good Practice:
Guess information from URIs only when the consequences of
an incorrect guess are acceptable.
Bob would not have had to guess the Boston weather URI if the
authority had documented its URI assignment policy.
Assignment authorities have no obligation to provide such documentation,
but it can be a useful way of advertising in bulk the URIs for
a collection of related resources. For example, an advertisement might
read:
Reading that advertisement, Bob can reasonably assume that
weather reports are available by substituting specific
city names into the URI pattern
HTML forms
A browser receiving this form, or Bob if he
views the source of the form, is assured that
the assigning authority is supporting an entire class of URIs of the form:
The same HTML Form is also a computer program, executable by the
browser, that prompts for and retrieves representations for all such URIs,
and the English text in the form assures Bob that these are indeed
for weather reports.
Bob is not guessing the encoding of the URI
or the nature of the resources referenced —
he is acting on authoritative information provided by
the assignor of the URIs.
He can assume not just that he will get weather reports for certain cities,
but that no URIs in the class correspond to anything other than
weather reports (though some may correspond to no resource at all).
Bob could, with this assurance, write his own software to construct
and use such URIs to retrieve weather reports.
Of course, the typical Web user would neither directly inspect
the URIs nor write software to build them,
but would instead type in city names and push the handy "Get the weather"
button on his or her browser screen.
Note that the example carefully specifies that the
HTML form is sourced from the same authority as the
individual weather URIs that the form queries.
In fact, it is also common for
the
In the examples in
Often, metadata is encoded into a URI not primarily for the benefit of users,
but to facilitate management of the resources themselves. For example,
assume that the administrators at example.org have established
a policy of assigning URIs based on the media types
of representations:
all GIF images are named with URIs ending in ".gif", and all JPEG images
are named with URIs ending in ".jpeg", and so on.
Although text/plain
.
That Content-Type should cause a properly configured browser
to show Martin the erroneous text just as it was recorded:
<?xml version="1.0">
<PetList>
<Dog>Rover</Dog>
http://example.org/weather/
.
Moreover, the advertisement claims that the weather information
obtainable at those URIs is "the best", so Bob can
assume that the weather reports are trustworthy and current.
http://example.org/weatherfinder
might offer
a city lookup page containing the following HTML form fragment:
<FORM ACTION="http://example.org/cityweather" METHOD="GET">
For what city would you like a weather report:
<INPUT TYPE="TEXT" NAME="city">?
<INPUT TYPE="SUBMIT" VALUE="Get the weather">
</FORM>
http://example.org/cityweather?city=
ACTION
attributes in HTML forms to refer to URIs
from other authorities.
In such cases, it is the provider of the form rather than the
assigning authority for the queried URIs who is responsible for the claims
made in the form.
In particular, users (and software)
should check the origin of HTML forms before depending on the URI assignment
patterns that they appear to imply.
Of course, you can always use such a form to perform a query and see
what comes back;
what you can't do is blame the assignment authority if the generated
URIs either don't
resolve (status code 404)
or return representations that don't
match the expectations established when reading the
form (you got a football score instead of a weather report).
<Files ~ ".*\.gif">
ForceType 'image/gif'
</Files>
<Files ~ ".*\.jpg">
ForceType 'image/jpeg'
</Files>
Even if it does not document this policy publicly, example.org's own Web servers can safely depend on it.
Good Practice: URI assignment authorities and the Web servers deployed for them may benefit from an orderly mapping from resource metadata into URIs.
In addition to filename-based conventions, authorities may choose to base URIs on database keys, customer identifiers, or other information that makes it easy to associate a URI with information pertinent to the corresponding resource. Such encodings are both useful and common on the Web, but there can also be drawbacks to including such information in URIs. Some of those problems are discussed in the three sections immediately below.
URIs optimized for use by the assignment authority may sometimes be inconvenient for resource users. Consider Mary who is walking down the street, and who sees the same weather advertisement as Bob:
Like Bob,
Mary is pleased to learn about a valuable Web site, and she finds
that the URI itself is quite easy to both to remember and to
type into her browser.
This is because,
in addition to the required scheme and authority components, the URI
is based on the word
The next day, Mary sees another advertisement reading:
Mary is annoyed, because the URI is both
difficult to remember and hard to transcribe accurately.
She guesses that the authority has assigned this URI for its own
convenience (see
Good Practice: URIs intended for direct use by people should be easy to understand, and should be suggestive of the resource actually named.
Note that the second
URI might be based on a database key
that facilitates efficient access to the weather data at the server
(see
URIs should generally not encode metadata that will change, regardless of whether the encoding policy is established to benefit URI assignment authorities, resource users, or both. Consider a web site that organizes document URIs according to the documents' lead author or editor. Thus, the documents:
http://example.org/documents/editor/BobSmith/document1 http://example.org/documents/editor/BobSmith/document2
are named for their editor, Bob Smith. Bob retires, and Mary Jones takes over as editor for document1. If the URI is changed to encode her name, then existing links break, but if the URI is not changed, the naming policy is violated. By encoding into the URI metadata that will change, the authority has put itself in a difficult position.
Good Practice: Resource metadata that will change SHOULD NOT be encoded in a URI.
Indeed, RDF statements about the resource, headers returned with representations (e.g. Content-Type) or metadata embedded in the representations themselves (e.g. HTML <META> tags) are all better alternatives for conveying such volatile metadata about the resource.
A bank establishes a URI assignment policy in which account numbers are encoded directly in the URI. For example, the URI http://example.org/customeraccounts/456123 accesses information for account number 456123. A malicious worker at an Internet Service Provider notices these URIs in his traffic logs, and determines the bank account numbers for his Internet customers. Furthermore, if access controls are not properly in place, he might be able to guess the URIs for other accounts, and to attempt to access them.
Good Practice: URI assignment authorities SHOULD NOT put into URIs metadata that is to be kept confidential.
Bob wants to find pictures of his favorite movie star, and he goes looking on the Web. He is worried though, because has heard that some web sites may try to damage his computer, so Bob tries to be careful. In particular, he inspects each URI in his browser's status bar before following links to the pictures. One URI he sees is:
http://example.org/BigStar.jpeg
Knowing that JPEG images are usually safe to view, Bob follows the link.
Unfortunately, the Web site is indeed malicious, and the file served for that URI is a program of media type application/octet-stream
, not the expected image/jpeg
. Bob's browser runs the
program, which installs a variety of undesirable software on his machine.
The malicious Web site has not in fact violated any normative specifications,
but destructive behavior is always a misuse of the Web.
Thus, the primary fault in this scenario rests with the web site
administrators who served an executable that was intended to damage Bob's machine,
and which was named with a URI designed to mislead users.
In practice, suffixes of URIs are often used to encode the type of a resource or its representation, and users know that. Users therefore have expectations for the content associated such URIs, whether or not those expectations are supported
by normative specifications.
URIs ending in the four characters ".jpeg" are indeed
conventionally used to serve representations of media type image/jpeg
, even though no normative specification provides for such a correspondence, and even though there may over time be reasons to serve alternative representations for those same URIs.
Such correspondences are convenient for a variety of reasons, and users
(and, unfortunately, much Web software that's been deployed in error) are
tempted to rely on them.
A user like Bob might wish for a browser that would quietly reject
content that did not conform to his expectations for URI assignment.
Unfortunately, for the reasons explained in
Note that in the particular case of executable content, as in this example,
many browsers do issue warnings regardless of the particular URI used for retrieval.
This is because even executables named with obvious URIs such as http://example.org/myprogram.exe
can be malicious.
Such warnings regarding executables are useful, and if Bob had employed such a browser he might have avoided damaging his machine.
Such URI-independent warnings are, however, beyond the scope of this finding, which is concerned specifically with
inferences that Bob or his browser make from the URI itself.
The principle conclusions of this finding are:
It is legitimate for assignment authorities to encode static identifying properties of a resource, e.g. author, version, or creation date, within the URIs they assign. This may contribute to the unique assignment of URIs. It may also contribute to the use of efficient mechanisms for dereferencing resources within origin servers e.g. use of database keys within URIs.
Assignment authorities may publish specifications detailing the structure and semantics of the URIs they assign. Other users of those URIs may use such specifications to infer information about resources identified by URI assigned by that authority.
The ability to explore and experiment is important to Web users. Users therefore benefit from the ability to infer either the nature of the named resource, or the likely identity of other resources, from inspection of a URI. Such inferences are reliable only when supported by normative specifications or by documentation from the assignment authorities. In other cases, users are responsible for the consequences of any incorrect inferences.
People and software using URIs assigned outside of their own authority should make as few inferences as possible about a resource based on its identity. The more dependencies a piece of software has on particular constraints and inferences, the more fragile it becomes to change and the lower its generic utility.