<?xml version='1.0'?>

<!DOCTYPE spec PUBLIC "-//W3C//DTD Specification V2.6//EN" "http://www.w3.org/2002/xmlspec/dtd/2.6/xmlspec.dtd"
[
  <!-- ================================================================ -->
  <!ENTITY draft.day "08">
  <!ENTITY draft.month "02">
  <!ENTITY draft.monthname "February">
  <!ENTITY draft.year "2008">
  <!ENTITY iso6.doc.date "&draft.year;-&draft.month;-&draft.day;">
  <!ENTITY http-ident "http://www.w3.org/2001/tag/doc/selfDescribingDocuments">
]>



<spec w3c-doctype='wd' role='editors-copy'>
<header>
<title>The Self-Describing Web</title>
<w3c-designation>&http-ident;-&iso6.doc.date;</w3c-designation>
<w3c-doctype>Draft Tag Finding</w3c-doctype>
<pubdate><day>&draft.day;</day>
<month>&draft.monthname;</month>
<year>&draft.year;</year>
</pubdate>
<publoc>
<loc href='&http-ident;-&iso6.doc.date;.html'>&http-ident;-&iso6.doc.date;</loc>
</publoc>
<altlocs>
<loc href='&http-ident;-&iso6.doc.date;.xml'>XML</loc>
</altlocs>
<latestloc>
<loc href='&http-ident;.html'>&http-ident;</loc>
</latestloc>
<prevlocs>
<loc href="http://www.w3.org/2001/tag/doc/selfDescribingDocuments-2007-05-24.html">http://www.w3.org/2001/tag/doc/selfDescribingDocuments-2007-05-24</loc>, <loc href="http://www.w3.org/2001/tag/doc/selfDescribingDocuments-2007-02-25.html">http://www.w3.org/2001/tag/doc/selfDescribingDocuments-2007-02-25</loc>
</prevlocs>
<authlist>
<author><name>Noah Mendelsohn</name>
<affiliation>IBM Corp.</affiliation>
<email href='mailto:Noah_Mendelsohn@us.ibm.com'>Noah_Mendelsohn@us.ibm.com</email></author>
</authlist>
<copyright>
<p>
<loc href='http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Copyright'>Copyright</loc> &#xA9; 2006, 2007, 2008
<loc href='http://www.w3.org/'>W3C</loc><sup>&#xAE;</sup>
(<loc href='http://www.lcs.mit.edu/'>MIT</loc>,
<loc href='http://www.inria.fr/'>INRIA</loc>,
<loc href='http://www.keio.ac.jp/'>Keio</loc>),
All Rights Reserved. W3C
<loc href='http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Legal_Disclaimer'>liability</loc>,
<loc href='http://www.w3.org/Consortium/Legal/ipr-notice-20000612#W3C_Trademarks'>trademark</loc>,
<loc href='http://www.w3.org/Consortium/Legal/copyright-documents-19990405'>document use</loc>, and
<loc href='http://www.w3.org/Consortium/Legal/copyright-software-19980720'>software licensing</loc>
rules apply.
</p></copyright>

<abstract>
<p>
The Web is designed to support flexible exploration of information, by human users and by automated agents.
For such exploration to be productive, 
information published by many different sources and for a variety of
purposes must be comprehensible to a wide range of Web client software.
HTTP and other Web technologies can be used to deploy resources that are 
<emph>self-describing</emph>, in the sense that only widely available information is necessary for understanding them.
Starting with a URI, there is a standard algorithm that a user agent
can apply to retrieve and interpret a representation of such resources.
Furthermore, when such self-describing resources are linked together, the Web as a whole can support reliable,
ad hoc discovery of information.
This finding describes how document formats, markup conventions, attribute values, and other data formats can be designed to facilitate the deployment of self-describing Web content.</p>
</abstract>

<status>


<p>This document has been produced by the <loc href='/2001/tag/'>W3C
Technical Architecture Group (TAG)</loc>. It is an editor's draft that has not been approved by the TAG, and it includes revisions motivated by <loc href="http://www.w3.org/2001/tag/2007/06/01-minutes#item02">discussions</loc> held at the <loc href="http://www.w3.org/2001/tag/2007/05/29-agenda">June 2007 Face to Face Meeting of the TAG</loc> .</p>

<p><loc href='/2001/tag/findings'>Additional TAG findings</loc>, both
accepted and in draft state, may also be available. The TAG may 
incorporate this and other findings into 
future versions of the  <bibref ref='AWWW'/>.</p>

<p>The capitalized terms
<rfc2119>MUST</rfc2119>, <rfc2119>SHOULD</rfc2119>, and
<rfc2119>SHOULD NOT</rfc2119> are used in this document
in accordance with <bibref ref='rfc2119'/>.</p>

<p>Please send comments on this finding to the publicly archived TAG
mailing list <loc href='mailto:www-tag@w3.org'>www-tag@w3.org</loc>
(<loc href='http://lists.w3.org/Archives/Public/www-tag/'>archive</loc>).</p>

</status>
<pubstmt>
<p>World-Wide Web Consortium,
Draft TAG Finding, 2005.</p>
</pubstmt>
<sourcedesc>
<p>Created in electronic form.</p>
</sourcedesc>
<langusage>
<language id='EN'>English</language>
</langusage>
<revisiondesc>
<slist>
<sitem>2002-04-30: Published draft</sitem>
</slist>
</revisiondesc>
</header>
<body>


<!-- *********************************************** -->
<!--                  INTRODUCTION                   -->
<!-- *********************************************** -->
  
<div1 id='Introduction'>
<head>Introduction</head>
<p>
The World Wide Web has at least two characteristics that distinguish it from many other shared information spaces:
<olist>
<item><p>The Web is global: the documents on the Web are contributed by and accessed by a very large number of users.</p></item>
<item><p>Supporting ad-hoc exploration is a goal of the Web.  Users must therefore be able to get
useful information from documents prepared by people whom they don't know, and with whom they have not coordinated in advance.</p></item>
</olist>
The chapters below explain in more detail how the following techniques
can be used to create, deploy and access  <emph>self-describing Web resource representations</emph> that can be correctly interpreted using only widely available information:

<ulist>

<item><p>Documents used as Web resource representations should be encoded using
standard formats such as <code>application/xhtml+xml</code> and <code>image/jpeg</code>, and deployed using HTTP.</p></item>

<item><p>Each representation should include standard machine-readable indications, such as HTTP Content-type headers, XML encoding declarations, etc., of the
standards and conventions used to encode it.
</p></item>

<item><p>
Machine-processable specifications for interpreting new
formats should be provided on the Web,
and linked from representations that use the formats.
Examples of linkable specifications include OWL ontologies, RDDL documents, GRDDL transformations, etc.
By following links to such specifications, user agents can dynamically obtain information needed
to process new representation formats.
</p></item>

<item><p>A standard HTTP-based algorithm is used to deploy, retrieve and interpret self-describing Web resource representations.</p></item>

</ulist>

Furthermore, when self-describing representations are linked together,
the Web as a whole can support reliable,
ad hoc discovery of information.
</p>
<p role="practice"><a name="GPNSelfDesc" id="GPNSelfDesc"></a>
<em>Good Practice:</em> 
Web resource representations SHOULD be <emph>self-describing</emph>.
</p>
<p>
The sections below discuss in more detail the
techniques needed to create self-describing
content for the Web, how to extend the Web with
new formats that are themselves self-describing,
and how a standard HTTP-based algorithm enables
users to retrieve and interpret such
self-describing representations.
</p>
</div1>

<!-- *********************************************** -->
<!--               STANDARD HTTP ALGORITHM           -->
<!-- *********************************************** -->
  
<div1 id='algorithm'>
<head>The Web's Standard Retrieval Algorithm</head>
<p>HTTP is the most widely deployed protocol on the Web, and it is designed
to facilitate the deployment of self-describing Web resource representations.
Indeed, there is a standard algorithm that a user agent can employ
to obtain and interpret the representation of any Web resource
that's accessible using the HTTP protocol.
Consider the following example, which is representative of
many simple Web interactions:
</p>
<p>
Bob is reading a Web page using his browser.  On the page is a link, and Bob is
interested in seeing what the link points to, so he clicks on it.
Bob has had no previous contact with the owner of the referenced resource, and
his browser has not been specially configured for access to it.
The steps taken by Bob's browser when he clicks the link illustrate
a typical path through the standard  retrieval algorithm of the Web
(readers unfamiliar with the HTTP protocol may find it useful to consult
either <bibref ref="HTTP"/>, or one of the many HTTP introductions available on the Web).
Assume that the link Bob clicks is to <code>http://example.com/todaysnews</code>.  When he clicks it, his browser:
<ul>
<li><p>from the <code>http:</code> at the beginning of the URI, determines that the http scheme has been used &#8212; this tells the browser that a representation retrieved using the HTTP protocol is authoritative
</p></li>
<li><p>looks up DNS name <code>example.com</code> to determine the associated IP address</p></li>
<li><p>opens a TCP stream to port 80 at the IP address determined above</p></li>
<li><p>formats an HTTP GET request for resource <code>/todaysnews</code>, and sends that to the server:
<pre>
GET /todaysnews HTTP/1.1
Host: example.com
User-Agent: TAG Sample HttpClient v1.0
Accept: */*
Accept-language: en-us
</pre>
</p></li>
<li><p>reads this response from the server:
<pre>
HTTP/1.1 200 OK
Date: Tue, 28 Aug 2007 01:49:33 GMT
Server: Apache
Content-Type: application/xhtml+xml

&lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
&lt;html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
&lt;head>
&lt;title>Today's news&lt;/title>
&lt;/head>
&lt;body>
&lt;h1>Today's News: Oh boy!!&lt;/h1>
[HTML FOR NEWS REPORT HERE]
&lt;/body>
&lt;/html>
</pre>
</p></li>
<li><p>from the status code (200) determines that the request has been successfully processed</p></li>
<li><p>inspects the returned <code>Content-Type</code> and determines that it is  <code>application/xhtml+xml</code>, a standard media type that the browser supports</p></li>
<li><p>passes  the entity-body to its HTML rendering engine, which uses
the explicit markup in the HTML to determine the title of the page (Today's News), the rest of the document's structure, and so on &#8212; the browser presents the page to Bob</p></li>
</ul>
</p>
<p>
Neither Bob nor his browser has any advance knowledge of the nature of the
resource, yet the browser successfully retrieves a representation,
determines its format, and interprets it for him.
The link could have been to an <code>image/jpeg</code> picture, an <code>application/xml</code> file, or a document containing RDF triples in <code>application/rdf+xml</code>.  Bob's browser could in each case
determine the format.
Indeed, as Bob continues to browse the Web,
his browser is able to determine the format of each representation that's retrieved, and can determine how to present it to him.
</p>

<p>
The example above shows how HTTP enables the deployment of self-describing Web resources.
Imagine if, instead, the link had been to <code>ftp://example.com/todaysnews</code>.  
Although Bob's browser could easily open an FTP connection to retrieve a file,
there would be no
way for it to reliably determine the nature of the information received.
Even if the URI were <code>ftp://example.com/todaysnews.html</code> the browser
would be guessing if it assumed that the file's contents were HTML,
since no normative specification
ensures that data from ftp URIs ending
in <code>.html</code> is in any particular format.
</p>

<p role="practice"><a name="GPNUseHTTP" id="GPNUseHTTP"></a>
<em>Good Practice:</em> 
Use the HTTP protocol to deploy self-describing resources.
</p>
<p>
<!-- empty paragraph to keep good practice box from messing up the indentation of the heading to follow -->
</p>
</div1>

<!-- *********************************************** -->
<!--                     USE STANDARDS               -->
<!-- *********************************************** -->

<div1 id='standards'>
<head>Use of widely deployed standards and formats</head>
<p>
Successful communication depends on the supplier and the consumer(s) of a document having a
shared understanding of the information conveyed, and that in turn requires at least some
shared assumptions about the form in which the information is represented.
The simplest way to achieve this is if the media type,
the document encoding, and any other conventions used for the
representation are standards and  are widely
deployed.

</p>

<p>
Consider Susan, who buys a new digital camera.
The software supplied with her camera uploads photos to the Web
using the widely-deployed <code>image/jpeg</code> media type,
and her Web server correctly labels served representations
with that Content-Type.
Millions of user agents deployed around the world are preconfigured to
display Susan's photographs and to extract metadata
such as camera settings from them.
Search engines are likely to index them in helpful ways too.
</p>
<p>
Now consider instead Mary who buys a different camera,
with software that does <emph>not</emph> use
widely deployed Web formats.
Indeed, the camera's manufacturer has invented a
new "raw" file format that takes advantage of the camera's special features.
The provided photo management software not only uses that format
locally, it also uploads photos to Mary's Web server in that same form.
Indeed, it even uploads a <loc href="http://httpd.apache.org/docs/2.2/configuring.html#htaccess">.htaccess file</loc>,
configuring the server to label served representations with the
proprietary Content-Type <code>image/x-fancyrawphotoformat</code>.
In this example, there are no outright violations of Web architecture, but
the decision to use an uncommon and proprietary media type is unfortunate.
No existing Web user agents recognize the <code>image/x-fancyrawphotoformat</code> media
type, search engine spiders are unlikely to extract useful information from
pictures in that format, and so on.
Unlike Susan's, which can be viewed by almost anyone, Mary's photos are
at best useful to a few people who have the proprietary software needed
to decode them.
</p>


<p role="practice"><a name="GPNWidelyDep" id="GPNWidelyDep"></a>
<em>Good Practice:</em> 
Web resource representations SHOULD be published using widely deployed standards.
</p>
<p>
<!-- empty paragraph to keep good practice box from messing up the indentation of the heading to follow -->
</p>
</div1>

<!-- *********************************************** -->
<!--         CREATING NEW FORMATS & STANDARDS        -->
<!-- *********************************************** -->

<div1 id='extensible'>
<head>Creating new formats and standards</head>
<p>The techniques described above apply in the many cases where widely
deployed media types such as <code>image/jpeg</code> are sufficient, but
the Web is used for a broad and continually growing
range of information.
No fixed set of formats and standards can fully meet the need to
encode all such information for machine processing.
Of course, ways can be found to convey almost any information using standard
media types.
An employment record, for example, can be transmitted as either <code>text/plain</code> or <code>application/xhtml+xml</code>.
The resulting document may be quite suitable for browsing, but it
might not facilitate automated discovery of the
employee's name, his or her date of hire, and so on.
To meet such needs, new standards must be created, e.g. for 
marking up the names and dates.
Similarly, the need may arise to use  new values for
individual fields such as <code>rel</code> attributes on HTML <code>link</code> elements.   
</p>
<p>So, although the Web requires self-describing documents that 
can be understood using only widely deployed standards,
there is also
a continual need for new formats and encoding conventions.
How can new formats and encodings be deployed in a manner that
is self-describing?
The following sections explore ways of creating new formats and
encoding conventions that maximize interoperability with existing
Web infrastructure, and that can be used to create self-describing documents.
</p>

<!-- *********************************************** -->
<!--                  USE BASE LAYERS                -->
<!-- *********************************************** -->

<div2 id='stablelayers'>
<head>Use existing URI Schemes, Protocols, and Media Types</head>
<p>
Innovations can be introduced to the Web at many different architectural layers.  For example:
<ul>
<li><p>New URI schemas can be introduced</p></li>
<li><p>New transfer protocols can be deployed</p></li>
<li><p>New media types can be introduced</p></li>
<li><p>New namespace-qualified markup can be defined for XML</p></li>
<li><p>New RDF properties and ontologies can be defined for the Semantic Web</p></li>
</ul>
</p>
<p>
Often, a given capability could in principle be deployed at any of several different layers.
For example, new sorts of content, such as movies, could be made available using new URI schemes and/or with new protocols, but doing so would require updating hundreds of millions of user agents, servers, proxies, and so on to understand these changes to the core mechanisms of the Web.
Usually it is preferable to leverage the existing core mechanisms of the Web, such as http-scheme URIs and the HTTP protocol, as these are widely deployed.
Indeed, one should usually leverage as many existing layers of the Web's architecture as is practical when introducing new function.
</p>
<p role="practice"><a name="GPNUseHTTP" id="GPNUseHTTP"></a>
<em>Good Practice:</em> 
When extending the Web with new formats and functions, use existing URI schemes, protocols, and media types  wherever practical.
</p>
<p>
One way to do this is to use URI-based extensibility within existing media types, as described in the sections below.  
</p>
</div2>

<!-- *********************************************** -->
<!--                URI-BASED EXTENSION              -->
<!-- *********************************************** -->

<div2 id='URIbasedextension'>
<head>URI-based Extensibility</head>
<p>
Many documents, particularly those that convey machine-readable data or messages, encode
information using specifications that are
specialized to particular purposes.
Such specifications may cover details of particular data formats 
such as lists of customers or inventory records,
experimental results of scientific
experiments, listings for television shows,
details of university course offerings, information about
molecular structures or drug tests, etc.
Because of the great variety and number of such formats and their specifications, it's not practical
to assume that even most of them will be directly implemented by typical Web user agents.
Instead, the Web provides means by which the necessary specifications
can be discovered, and to a significant degree implemented, dynamically
and automatically.
This is done by:

<ulist>
<item><p>ensuring that every specification, and
in many cases each markup tag or data value used, 
is identified with a URI</p></item>

<item><p>ensuring that such URIs are used in the instance either directly as data values or tag names, or else to identify the encodings used</p></item>

<item><p>making available as Web representations
of each such URI the information needed to dynamically interpret
the information in the instance
</p></item>
</ulist>

In other words, it should be possible to discover from each Web
representation the conventions used to encode it, and particularly 
in cases where
those conventions are not widely deployed, to find within the
representations links to specifications, ontologies and/or
programs necessary for
interpreting the representation.
So, just as the Web may be used to 
dynamically discover a great wealth of resources, it can
also be used to dynamically discover the specifications,
ontologies, or programs
needed to interpret the representations of those resources.
</p>
<p>Example: The Atom Syndication Format <bibref ref="ATOM"/> is 
an XML-based format for syndicating information about blogs
and other Web resources.
ATOM entries can include <code>&lt;atom:link></code> elements such as
the following:
</p> 
<pre>
&lt;entry>
  &lt;title>An interesting picture&lt;/title>
  <strong>&lt;link <emph>rel="enclosure"</emph> type="image/jpeg" length="12345"
        href="http://example.org/interestingPic"/></strong>
    &lt;content type="xhtml" xml:lang="en"
             xml:base="http://example.org/">
      &lt;div xmlns="http://www.w3.org/1999/xhtml">
        &lt;p>&lt;[Update: Here's an interesting picture.]&lt;/p>
      &lt;/div>
    &lt;/content>
  &lt;/link>
&lt;/entry>
</pre>
<p>The link elements identify external resources, in this case an
<code>image/jpeg</code> photograph.
Furthermore, each link can carry a <code>rel</code> attribute that
specifies the relationship between the linked resource and the ATOM entry that
links it.
In the example above, the relationship is specified as <code>enclosure</code>
which, according to the ATOM specification, indicates that the linked
photograph may have been too large for inline processing with the rest
of the feed. </p>
<p>What's of interest for this finding is the fact that values of the <code>rel</code> attribute are URIs (actually <bibref ref="IRI"/>s, which are the internationalized form of URIs), or else the values can be mapped to URIs.
This means that anyone, anywhere can invent a new sort of link relationship,
can assign a URI to identify that relationship, and can use that value
in the <code>rel</code> attribute.  For example:
</p>
<pre>
&lt;entry>
  &lt;title>An interesting picture&lt;/title>
  &lt;link <strong>rel="http://example.org/SomeNewATOMRelationship"</strong>
        type="image/jpeg" length="12345"
        href="http://example.org/interestingPic"/>
    &lt;content type="xhtml" xml:lang="en"
             xml:base="http://example.org/">
      &lt;div xmlns="http://www.w3.org/1999/xhtml">
        &lt;p>&lt;[Update: Here's an interesting picture.]&lt;/p>
      &lt;/div>
    &lt;/content>
  &lt;/link>
&lt;/entry>
</pre>
<p>
Furthermore, anyone doing this can (and indeed should) provide information
about that new relationship via HTTP from the assigned URI.
For convenience, the ATOM specification also provides that short form names such as <code>enclosure</code> in the first example can be registered with IANA, 
and ATOM provides a deterministic mapping to a URI for each of these.
These URIs are formed by prepending the fixed base URI <code>http://www.iana.org/assignments/relation/</code> to the short form.
Thus, the first example above is in fact using the relationship <code>http://www.iana.org/assignments/relation/enclosure</code>.
</p>
<p>These examples show how URIs used as data values allow for
distributed assignment of new values.
More importantly for this finding, the use of URIs for such values provides
the opportunity for information about those values to be discovered
dynamically on the Web.</p>
<p role="practice"><a name="GPNDynamicDesc" id="GPNDynamicDesc"></a>
<em>Good Practice:</em> 
Web representations SHOULD link to 
the information needed to support automatic processing of 
those representations.
</p>
<p>
The following sections explain how a number of Web technologies can be applied to achieve such
dynamic integration of new Web representation formats.  
</p>
</div2>
<!-- *********************************************** -->
<!--		          RDF                        -->
<!-- *********************************************** -->

<div2 id="RDF">
<head>RDF and the Self-Describing Web</head>
<p>
RDF <bibref ref="RDF"/> plays an important and distinguished role as the preferred technology for
creating self-describing Web data resources, and for integrating representations rendered using
other technologies such as XML.
The result is a single, global self-describing Semantic Web that integrates not only resources
that are themselves built or represented using RDF, but also the other Web resources to which
that RDF links, as well as those that can be mapped to RDF using technologies such as <bibref ref="GRDDL"/> .
Readers unfamiliar with RDF should consult the RDF primer <bibref ref="RDFPrimer"/> as a prerequisite to understanding the discussion below.
</p>

<p>
Each RDF statement is a triple consisting of a subject, a predicate (typically the identifier for a property, or for a relationship between two Web resources), and an object (the value of the property or the referent of the relationship).
The subject, the predicate, and often the object as well,
are themselves identified by URIs, 
enabling the dynamic discovery introduced in <specref ref="URIbasedextension"/> above &#8212; if 
a user agent has no built in knowledge of some particular RDF subject,
relationship, or object,
it can often use the URI to retrieve the
information necessary for processing.
Indeed, RDF's Schema <bibref ref="RDFSchema"/> and OWL Ontology technologies <bibref ref="OWL"/>
together offer
a standard, machine-processable means of describing particular uses of RDF.
They provide the standard means by which
software can discover the the relationships between RDF statements (e.g. that two seemingly
differing predicates are the "<code>owl:sameAs</code>" each other), or other information needed for
processing the RDF.
</p>

<p>
Consider Amy, who uses an RDF-enabled user agent to retrieve an RDF/XML
document containing the following element:
</p>
<pre>
&lt;rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:employeeData="http://example.org/EmployeeInformation#">

  &lt;employeeData:employee rdf:about="http://example.org/Employees#BobSmith">
    &lt;employeeData:name>Bob Smith&lt;/employeeData:name>
    &lt;employeeData:email rdf:resource="mailto:BobSmith@example.org"/>
  &lt;/employeeData:employee>

&lt;/rdf:RDF>
</pre>

<p>
The user agent is general purpose, and although it has
rules for certain commonly used ontologies, it
has no built in code to handle the <code>employeeData</code> properties
in the above example.
To dynamically acquire the necessary function, the agent does an HTTP
GET for <code>http://example.org/EmployeeInformation</code>.  
The GET returns an OWL ontology, from which the agent discovers
that <code>http://example.org/EmployeeInformation#email</code>
is <code>rdfs:subPropertyOf</code> the 
<code>http://www.w3.org/2001/vcard-rdf/3.0#email</code> property,
one that the agent recognizes as designating a person's e-mail address.
The agent offers Amy the option to send e-mail to Bob Smith.
Amy's browser has, in an important sense, automatically extended
itself for processing the employee data.
</p>
<p role="practice"><a name="RDFGPN" id="RDFGPN"></a>
<em>Good Practice:</em> 
Information provided directly in RDF, or information for which automated means can be used to
discover corresponding RDF, contributes to the self-describing Semantic Web.
</p>
<p>Because its model is uniform, because all of its
self-description is provided in the
same model as the data itself, and because all RDF information is linked
into the Web as a whole, RDF provides uniquely
powerful facilities for dynamic
integration of a self-describing Web.
Therefore,
it's particularly important that information not originally
supplied in an RDF-specific
format be convertible into RDF.  
The sections below discuss two means of doing this:  the first
shows how RDFa can integrate HTML documents into the Semantic Web,
and the second illustrates the use of GRDDL to extract
RDF from XML documents.
</p>

<div3 id="UsingRDFa">
<head>Using RDFa to produce self-describing HTML</head>
<p>
<bibref ref="RDFa"/> is a W3C draft Recommendation for embedding Semantic Web statements into ordinary HTML Web pages.
This example illustrates how RDFa can integrate HTML into the self-describing Semantic Web:
</p>
<p>
Mary is exploring the Web using a browser that has been
enhanced with capabilities for interpreting RDFa.
Her browser knows to look through each Web page that she browses, picking out information
from the RDFa, and helping her to use it.  For example, the page might contain the following HTML,
which represents an <bibref ref="RDFVCard"/>-style contact listing.  (This example is adapted from one
in <bibref ref="RDFa"/>):
</p>
<pre id="vCardExample">
    &lt;p class="contactinfo" 
          xmlns:contact="http://www.w3.org/2001/vcard-rdf/3.0#"
          about="http://example.org/staff/joseph">
        My name is
        &lt;span property="contact:fn">
            Joseph Smith
        &lt;/span>
        I'm a
        &lt;span property="contact:title">
            distinguished web engineer
        &lt;/span>
        at
        &lt;a rel="contact:org" href="http://example.org">
            Example.org
        &lt;/a>.
        You can contact me
        &lt;a rel="contact:email" href="mailto:joe@example.org">
            via email
        &lt;/a>.
    &lt;/p>
</pre>

<p>
Even though this document is of media type <code>application/xhtml+xml</code>,
which is not a member of the RDF family
of media types, an RDFa-enabled user agent can extract RDF from this document.
This document conveys as RDF a set of semantic Web statements about the Web resource
<code>http://example.org/staff/joseph</code>.  The predicates are all named with the
same base URI <code>http://www.w3.org/2001/vcard-rdf/3.0#</code>, for which the
shorthand prefix <code>contact</code> is established in the HTML.
Using this syntax, the RDFa carries triples for relationships such as the
full name of the contact
(<code>http://www.w3.org/2001/vcard-rdf/3.0#fn</code>), which is <code>Joseph Smith</code>,
the e-mail address (<code>http://www.w3.org/2001/vcard-rdf/3.0#email</code>) which is
<code>mailto:joe@example.org</code>,
and so on.
</p>
<p>
An RDFa-enabled user agent can extract these triples and use them
to help Mary work with the data they contain,
or to integrate with 
other Semantic Web information.  
Indeed RDF is designed for such use because,
as discussed above in <specref ref="RDF"/>,
Semantic Web triples are inherently self-describing.
If a user agent needs more information about the processing of
the email triple it can, like Amy's user agent, do an HTTP GET
<code>http://www.w3.org/2001/vcard-rdf/3.0</code>
and use the results to
get more information.
With luck, that information will lead the agent
to automatically discover that,
in the example,
<code>mailto:joe@example.org</code> can indeed
be used to send mail to the person
named <code>Joseph Smith</code>.
The browser can then offer Mary the option to send e-mail to Joe,
or to add Joe to
her address book.
</p>
<p role="practice"><a name="RDFaGPN" id="RDFaGPN"></a>
<em>Good Practice:</em> 
RDFa SHOULD be used to make information conveyed in HTML self-describing.
</p>
<p>
Note: at this time, drafts of the <bibref ref="RDFa" /> specification
are available, but the media-type registration for HTML itself has not
been updated to reflect RDFa.
As described in TAG Finding <bibref ref="AuthoritativeMetadata"/>,
conventions like RDFa are normative only if provided for in the
applicable specification for the media-type in which they are used.
Thus, for RDFa to be fully integrated with <specref ref="algorithm"/>, the HTML
and/or XHTML media-type registrations must be be updated.
Use of RDFa is in any case encouraged in the interim until that
happens.
</p>
<!-- empty para helps formatting after GPN -->
<p/>
</div3>

<div3 id="GRDDLchap">
<head>Using GRDDL to bridge from XML to RDF</head>
<p>RDFa provides a standard means of encoding RDF information in XHTML
documents,
but many other XML variants lack that capability.
Furthermore, RDFa requires explicit encoding of each triple in the XHTML
instance, and that may in some cases be impractical.
<bibref ref="GRDDL"/> provides a standard means of extracting useful
RDF statements (triples) from a broad range of XML document formats.
Each GRDDL-enabled XML document links to a transformation that,
when applied to the document, produces RDF triples.
Typically, the same GRDDL transformation can be used
on entire families of similar XML documents.
</p>
<p>
For example, assume that Albert uses a GRDDL-enabled
user agent to retrieve an XML document containing the following fragment:</p>

<pre>
&lt;employees xmlns="http://example.org/employeeNS">
  &lt;employee name="Bob Smith">
    &lt;email>BobSmith@example.org&lt;/email>
  &lt;/employee>
&lt;/employees>
</pre>

<p>Note that, unlike the earlier examples, this is neither in HTML nor in RDF;
we can assume that <code>http://example.org/employeeNS</code> is a namespace
created by some particular business for use in its own busines documents.
Albert's agent has no built in knowledge of this namespace, and so can
not do much with it. 
Now assume that Albert instead retrieves a different document.
Most of the markup and data in it is identical to the first,
but this document is
GRDDL enabled:
</p>

<pre>
&lt;employees xmlns="http://example.org/employeeNS"
              <strong>xmlns:grddl="http://www.w3.org/2003/g/data-view#"</strong>
              <strong>grddl:transformation=</strong>
                    <strong>"http://example.org/GRDDL_For_employeeNS.xsl></strong>
  &lt;employee name="Bob Smith">
    &lt;email>BobSmith@example.org&lt;/email>
  &lt;/employee>
&lt;/employees>
</pre>

<p>
Albert's user agent is GRDDL aware, so
it transforms the <code>&lt;employees></code> information
to RDF using the supplied <code>GRDDL_For_employeeNS.xsl</code>
transformation.
If Albert is lucky, that transformation produces RDF triples
that the agent understands, or that the agent can dynamically discover how to
process using the techniques described above in <specref ref="RDF"/>.
As in the earlier examples, Albert's user agent offers to send mail
to Bob Smith.
</p>

<p role="practice"><a name="GRDDLGPN" id="GRDDLGPN"></a>
<em>Good Practice:</em> 
GRDDL SHOULD be used to integrate XML documents into the
self-describing Semantic Web.
</p>
<!-- empty para helps formatting after GPN -->
<p/>
</div3>

</div2>


<div2 id="XMLSpecs">
<head>Self-describing XML documents</head>
<p>
Section <specref ref="GRDDLchap"/> described how GRDDL can be used
to integrate XML documents with the RDF-based Semantic Web.
This section describes a related technique for
creating new, self-describing XML formats.
</p>
<p>
Given that a Web document is of media type
<code>application/xml</code>, or in the family of
media types <code>application/____+xml</code>,
recursive processing from the root element down may be applied to
determine not just the overall nature of the document,
but also the meaning in context
of its sub-elements.
Doing this, however, requires understanding of the semantics of
each named element.
Although a few specific XML variants such as
<code>application/xhtml+xml</code> may be directly supported by some
user agents, no user agent can build in support for the ever growing
set of XML languages used for Web representations.
This section describes how namespace documents, discoverable from the
XML tag names in the markup, can be used to make such languages 
self-describing, and to enable automated processing of them.
</p>
<p>
When XML namespaces are used <bibref ref="XMLNamespaces"/>, each XML element is named with a <loc href="http://www.w3.org/TR/xml-names11/#ns-qualnames">qualified name</loc>, consisting of a prefix and a local name.  In the following example, the root element has the qualified name <code>&lt;inventory:inventoryItem></code>:
</p>
<pre id="xmlex2">
   &lt;inventory:inventoryItem 
        xmlns:inventory="http://example.org/inventoryNamespace">
     &lt;inventory:itemNumber>
         87354
     &lt;/inventory:itemNumber>
     &lt;inventory:quantityAvailable>
         152
     &lt;/inventory:quantityAvailable>
   &lt;/inventory:inventoryItem>
</pre>
<p>
Qualified names map to <loc href="http://www.w3.org/TR/xml-names11/#dt-expname">expanded names</loc> such as <code>{http://example.org/inventoryNamespace,inventoryItem}</code>, comprised of a namespace name URI (<code>http://example.org/inventoryNamespace</code>) and a local name (<code>inventoryItem</code>).
The namespace name URI serves at least two roles:  the most obvious and the most widely understood is to distinguish expanded names in one namespace from those in another;  the other role, and the one that's most important for purposes of this finding, is that it provides Web identification for the namespace itself.
The namespace is a Web resource, and like any other resource, it can and should provide representations using HTTP.
<emph>A user agent processing an XML document can retrieve descriptions of the namespaces used in that document, and
can use that retrieved information to determine how to correctly process the XML markup.</emph>
The W3C TAG is currently working on a finding that will describe best practices for creating such representations of namespaces.
Drafts of the finding, which are available at
<bibref ref="NamespaceDocuments"/>,
recommend the use of <bibref ref="RDDL"/> as a preferred means of
documenting namespaces.
RDDL is itself extensible, but it is commonly used to suggest XML Schemas (in any of several languages), XSLT Stylesheets, etc. that are usable with markup from the namespace being described.
</p>
<p>
Example: assume that user Bob is browsing the Web, and that he follows a link to a resource that returns the XML above as its representation.
Bob's browser uses <specref ref="algorithm"/> to retrieve the representation,
to determine its character encoding, and to discover that its Content-type
is <code>application/inventory+xml</code>.
Of course, it's very unlikely that Bob's browser has built in knowledge of the inventory XML language, but the Content-type makes clear <bibref ref="XMLMediaType"/> that
the representation can be interpreted as XML with Namespaces.
The root element tag is from namespace <code>http://example.org/inventoryNamespace</code>, which uses the http scheme,
so Bob's browser does an HTTP GET from that URI.
What comes back is a
RDDL document containing the following <code>&lt;rddl:resource></code> element:
</p>
<pre id="RDDLexample" >
&lt;rddl:resource
   xlink:role="http://www.w3.org/1999/XSL/Transform"
   xlink:arcrole="http://www.w3.org/1999/xhtml"
   xlink:href="http://example.org/InventoryToBrowsableHTML.xslt"
   xlink:title="Transform Inventory XML to HTML for Browsing">
&lt;/rddl:resource>
</pre>
<p>
This designates a stylesheet (<code>http://example.org/InventoryToBrowsableHTML.xslt</code>) that can be
applied to format the inventory XML as HTML &#8212; the
browser automatically retrieves and applies
the stylesheet, producing HTML that is
rendered on the screen.
Without any manual intervention from Bob, his browser automatically displays the inventory record in a format that's convenient to read and print.
Bob's browser may also be enabled for XML validation, in which case it can look in the RDDL for a link to a schema to be used for validating inventory markup, and can use that to check the document.
</p>

<p>
Bob's browser has, like Amy's in the RDF example shown earlier,
extended
itself for processing of the inventory markup language.
Unless the RDDL provides a link to one or more executable
program that processes inventory records,
it's unlikely that Bob's browser can automatically
discover <emph>everything</emph> that one might reasonably
want to know about processing inventory
markup.
Still, even the limited automatic function described above is very useful,
and RDDL is an extensible framework that can
be easily adapted to provide new kinds of information about namespaces.
Note that because RDDL documents are themselves XML, GRDDL can be applied
to derive RDF statements from them (see <specref ref="GRDDLchap"/>).
In this way, self-describing XML documents can be integrated with
the self-describing Semantic web.  
<bibref ref="NamespaceDocuments"/> describes this technique in more detail.
</p>

</div2>

</div1>

<div1 id='conclusions'>
<head>Conclusions</head>
<p>
Ad hoc exploration of the Web is possible only if
resource representations are self-describing.
Using the techniques described above and starting with an http- or https-scheme
URI, a user agent can proceed step by step to retrieve a representation,
reliably discover the conventions that have been used to encode it,
and if necessary, dynamically find instructions for processing it.
Those who invent new document formats, new markup tags, or new conventions
for encoding particular data values should use the techniques described
above to make those formats self-describing.
When these techniques are used, and when self-describing
representations are linked together, the Web as a whole
can support reliable, ad hoc discovery of information.
</p>

</div1>

<div1 id='references'>
<head>References</head>

<blist>
<bibl id="ATOM" href="http://tools.ietf.org/html/rfc4287">M. Nottingnam, R. Sayre (Eds.) <titleref>RFC 4287: The Atom Syndication Format</titleref>. December 2005</bibl>
<bibl id="AuthoritativeMetadata" href="http://www.w3.org/2001/tag/doc/mime-respect">R. Fielding, I. Jacobs, <titleref>Authoritative Metadata</titleref>. W3C Technical Architecture Group Finding, April, 2006.</bibl>
<bibl id='AWWW' href='http://www.w3.org/TR/webarch/'>I.Jacobs, 
N. Walsh, <titleref>Architecture of the World Wide Web</titleref>.
W3C. December, 2004.</bibl>
<bibl id="GRDDL" href="http://www.w3.org/TR/grddl/">D. Connolly,  <title>Gleaning Resource Descriptions from Dialects of Languages (GRDDL)</title>, W3C Candidate Recommendation, May, 2007</bibl>
<bibl id='HTTP' href='http://www.ietf.org/rfc/rfc2616.txt'>J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee <titleref>RFC 2616: Hypertext Transfer Protocol - HTTP/1.1</titleref>. June 1999</bibl>
<bibl id="IRI" href="http://www.ietf.org/rfc/rfc3987.txt">, M. Duerst, M. Suignard <titleref>RFC 3987: Internationalized Resource Identifiers (IRIs)</titleref>. January 2005</bibl>
<bibl id='LeastPower' href='http://www.w3.org/2001/tag/doc/leastPower'>T. Berners-Lee, N. Mendelsohn B. Adida, M. Birbeck <titleref>The Rule of Least Power</titleref>. W3C Technical Architecture Group Finding, February, 2006</bibl>
<bibl id="MetadataInURI" href="http://www.w3.org/2001/tag/doc/metaDataInURI-31">N. Mendelsohn, S. Williams, <titleref>The use of Metadata in URIs</titleref>. W3C Technical Architecture Group Finding, January, 2007.</bibl>
<bibl id="NamespaceDocuments" href="http://www.w3.org/2001/tag/doc/nsDocuments/">N. Walsh, <titleref>Associating Resources with Namespaces</titleref>. W3C Technical Architecture Group Draft Finding, December, 2005.</bibl>
<bibl id='OWL' href='http://www.w3.org/TR/owl-features/'>D. McGuinness, F. van Harmelen (Eds.) <titleref>OWL Web Ontology Language
Overview </titleref>. W3C Recommendation, February 2004.</bibl>
<bibl id='RDDL' href='http://www.rddl.org/'>J. Borden, T. Bray, <titleref>Resource Directory Description Language (RDDL)</titleref>. W3C. February, 2002.</bibl>
<bibl id='RDF' href='http://www.w3.org/TR/rdf-concepts/'>G. Klyne, J. Carroll (Eds.) <titleref>Resource Description Framework (RDF):
Concepts and Abstract Syntax</titleref>. W3C Recommendation, February 2004.</bibl>
<bibl id='RDFVCard' href='http://www.w3.org/TR/vcard-rdf'>R. Ianella <titleref>Representing vCard Objects in RDF/XML</titleref>. W3C Note, February 2001.</bibl>
<bibl id='RDFPrimer' href='http://www.w3.org/TR/rdf-primer/'>F.Manola, E. Miller (Eds.) <titleref>RDF Primer</titleref>.  W3C Recommendation, February 2004.</bibl>
<bibl id='RDFSchema' href='http://www.w3.org/TR/rdf-schema/'>D. Birckley, R.V. Guha (Eds.) <titleref>RDF Vocabulary Description Language 1.0: RDF Schema</titleref>. W3C Recommendation, February 2004.</bibl>
<bibl id='RDFa' href='http://www.w3.org/TR/xhtml-rdfa-primer/'>B. Adida, M. Birbeck <titleref>RDFa Primer 1.0: Embedding RDF in XHTML</titleref>. W3C. (working draft) March, 2007.</bibl>

<bibl id='XMLMediaType' href='http://www.ietf.org/rfc/rfc3023.txt'>M. Murata, S. St. Laurent, D. Kohn <titleref>RFC 3023: XML Media Types</titleref>. January 2001</bibl>

<bibl id="XMLNamespaces" href="http://www.w3.org/TR/xml-names11/">T. Bray, D. Hollander, A. Layman, R. Tobin, <titleref>Namespaces in XML 1.1</titleref>. W3C, August, 2006 (2nd Edition).</bibl>

</blist>
</div1>
</body>

 <back>
    <div1>
      <head>Change log (will be deleted before final publication)</head>
      <slist>
         <sitem>6-Dec-2005 [NRM]: initial version</sitem>
      </slist>
      <slist>
         <sitem>25-Feb-2007 [NRM]: trying to get it good enough to circulate</sitem>
      </slist>
<div2 id="ChangeMay242007">
<head>Changes in 24 May 2007 Edition</head>
<ulist>
<item><p>Changed title to "Self-describing Web"</p></item>
<item><p>New discussion of discovery of specs, role of RDF, etc.</p></item>
<item><p>Extensive editorial work.</p></item>
</ulist>
</div2>
<div2 id="ChangeFeb2008">
<head>Changes in February 2008 Edition</head>
<ulist>
<item><p>Major rewrite to take account of formal review from June 2007 (Google) F2F, and also informal comments made during hallway discussions at Sept. 2007 (Southampton) F2F. Changes include:</p></item>
<item><p>Story about "you reading this document" is gone.</p></item>
<item><p>Standard retrieval algorithm for Web added</p></item>
<item><p>Rearranged TOC and heading structure</p></item>
</ulist>
</div2>
    </div1>
 </back>

</spec>
