<?xml version='1.0'?>
<!-- PUT THIS BACK AS THE DOCTYPE WHEN ONLINE AGAIN -->

<!DOCTYPE spec PUBLIC "-//W3C//DTD Specification V2.6//EN" "http://www.w3.org/2002/xmlspec/dtd/2.6/xmlspec.dtd"
[
  <!-- ================================================================ -->
  <!ENTITY draft.day "25">
  <!ENTITY draft.month "02">
  <!ENTITY draft.monthname "February">
  <!ENTITY draft.year "2007">
  <!ENTITY iso6.doc.date "&draft.year;-&draft.month;-&draft.day;">
  <!ENTITY http-ident "http://www.w3.org/2001/tag/doc/selfDescribingDocuments">
]>



<spec w3c-doctype='wd' role='editors-copy'>
<header>
<title>The Importance of Self-Describing Documents</title>
<w3c-designation>&http-ident;-&iso6.doc.date;</w3c-designation>
<w3c-doctype>Draft Tag Finding</w3c-doctype>
<pubdate><day>&draft.day;</day>
<month>&draft.monthname;</month>
<year>&draft.year;</year>
</pubdate>
<publoc>
<loc href='&http-ident;-&iso6.doc.date;.html'>&http-ident;-&iso6.doc.date;</loc>
</publoc>
<altlocs>
<loc href='&http-ident;-&iso6.doc.date;.xml'>XML</loc>
</altlocs>
<latestloc>
<loc href='&http-ident;.html'>&http-ident;</loc>
</latestloc>
<prevlocs>
</prevlocs>
<authlist>
<author><name>Noah Mendelsohn</name>
<affiliation>IBM Corp.</affiliation>
<email href='mailto:Noah_Mendelsohn@us.ibm.com'>Noah_Mendelsohn@us.ibm.com</email></author>
</authlist>
<copyright>
<p>
<loc href='http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Copyright'>Copyright</loc> &#xA9; 2005
<loc href='http://www.w3.org/'>W3C</loc><sup>&#xAE;</sup>
(<loc href='http://www.lcs.mit.edu/'>MIT</loc>,
<loc href='http://www.inria.fr/'>INRIA</loc>,
<loc href='http://www.keio.ac.jp/'>Keio</loc>),
All Rights Reserved. W3C
<loc href='http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Legal_Disclaimer'>liability</loc>,
<loc href='http://www.w3.org/Consortium/Legal/ipr-notice-20000612#W3C_Trademarks'>trademark</loc>,
<loc href='http://www.w3.org/Consortium/Legal/copyright-documents-19990405'>document use</loc>, and
<loc href='http://www.w3.org/Consortium/Legal/copyright-software-19980720'>software licensing</loc>
rules apply.
</p></copyright>

<abstract>
<p>
The use of self-describing document and data formats has proven valuable in many computing systems, but self-description is particularly important on the World Wide Web.  
This finding describes the characteristics of self-describing Web documents, techniques for creating them, and sets out the reasons that they are of particular value to the Web. 
</p>
</abstract>

<status>


<p>This document has been produced by the <loc href='/2001/tag/'>W3C
Technical Architecture Group (TAG)</loc>.
This finding addresses TAG issue XXXX (to be opened).
</p>
<p>This version of the document is a very preliminary sketch of a 
possible finding.
Basically, I got interested in this issue when we discussed it in 
Edinburgh in 2005, and wanted a place to set down some ideas.
That turned into this very rough sketch of a finding.
</p>

<p><loc href='/2001/tag/findings'>Additional TAG findings</loc>, both
accepted and in draft state, may also be available. The TAG may 
incorporate this and other findings into 
future versions of the  <bibref ref='AWWW'/>.</p>

<p>The terms <rfc2119>MUST</rfc2119>, <rfc2119>SHOULD</rfc2119>, and
<rfc2119>SHOULD NOT</rfc2119> are used in this document
in accordance with <bibref ref='rfc2119'/>.</p>

<p>Please send comments on this finding to the publicly archived TAG
mailing list <loc href='mailto:www-tag@w3.org'>www-tag@w3.org</loc>
(<loc href='http://lists.w3.org/Archives/Public/www-tag/'>archive</loc>).</p>

</status>
<pubstmt>
<p>World-Wide Web Consortium,
Draft TAG Finding, 2005.</p>
</pubstmt>
<sourcedesc>
<p>Created in electronic form.</p>
</sourcedesc>
<langusage>
<language id='EN'>English</language>
</langusage>
<revisiondesc>
<slist>
<sitem>2002-04-30: Published draft</sitem>
</slist>
</revisiondesc>
</header>
<body>

<div1 id='Introduction'>
<head>Introduction: Why are self-describing documents important?</head>

<p>
Electronic documents are used on the World Wide Web as a means of communication. 
Successful communication depends on the creator and the consumer(s) of a document having a shared understanding of the information conveyed, and that in turn requires at least some shared assumptions about the form in which the information is represented.
Consider this finding, which you are now reading.  If you have a printed copy, then you and the author have implicitly agreed to communicate in English.  You have agreed that the English is set down using traditional typographical conventions, with the usual 26 letter alphabet and other symbols used to represent the words, punctuation, and so on.  
You are also depending on some shared assumptions about document structure, such as the use of a title to set an overall theme for the document, hierarchical sections used to reflect semantic structure, white space to set off paragraphs and so on.
In other respects, the document is self-describing.  Given the simple and widely shared assumptions about alphabet, typography and so on, it is possible for a reader with no additional knowledge to discover essentially
the full intended content of this finding.
   </p>
<p>
The World Wide Web has at least two characteristics that distinguish it from many other shared information spaces:
<olist>
<item><p>The Web is global.</p></item>
<item><p>Web architecture dictates that <emph>any</emph> user agent may at any time GET and attempt to interpret representations for <emph>any</emph> resource.</p></item>
</olist>
The second point is often misunderstood;
while it is true that certain resources are intended primarily for a narrow audience, the correct operation of search engine spiders, optimistic web caches and much other Web software depends on the ability to retrieve and work with even those seemingly more private sources of information.   
Not only must retrieval be safe,  it is essential that consumers of such documents be able to unambiguously and correctly interpret them, or failing that, to reliably determine that the document is one that cannot in fact be understood.
</p>
<p>
As we'll see in the next section, this implies
that the correct and complete interpretation of Web documents should, to the extent practical,
depend only on widely used standards, conventions and languages (including both natural languages and computer languages.)
Certain other characteristics also contribute not only to the self-description of individual documents,
but also to the ability of software to dynamically discover the information necessary for
interpretation of those documents.
The remainder of this finding explores some more detailed issues relating to the creation and sharing of self-describing documents on the Web.</p>
<p>
<emph>GOOD PRACTICE:</emph> Resource representations should, to the extent practical, be self-describing.
</p>
</div1>
<div1 id='technical'>
<head>Technical characteristics of self-describing Web documents</head>
<p>Just as certain shared assumptions were required for a reader to correctly understand
the markings comprising the printed form of this finding, 
the sender and receiver of a Web document must share some assumptions
if the bit streams representing the document are to be correctly interpreted.
Such assumptions may be set down in the form of W3C Recommendations, IETF Requests For Comments (RFCs),
standards for particular industries, and so on.
They may also be embodied in private agreements, or may in fact
not be formally set down at all. 
Insofar as the necessary specifications are widely understood, then the chances
greatly increase that document will be interpretable by a wide range of software and human consumers.
</p>
<p>
Again using this document as an example: it is usually served on the 
Web as a sequence of bits (octets) using the HTTP protocol,
labeled with
the media type application/xhtml+xml, and encoded using one
of the common Unicode encodings (UTF-8).
An XML document type declaration allows one to reliably determine that the
document is is marked up using HMTL 4.0 (Transitional), 
the lang="EN" attribute indicates that prose in the document is in English,
and so on (if you're reading this document online,
you may wish to use your browser's View Source feature to examine some of these declarations  -- except that this version is still text/html -- argh!!.)
Accordingly, software which is written to these widely understood conventions can discover
the overall structure of this document, the location of links, the characters comprising the
prose, etc.  Both search engines and human readers know to interpret the characters as English,
and indeed user agents can automatically signal the availability of an English-language version.
In these respects, the electronic form of the document is also self-describing.</p>
<p>
More compact encodings of this document are possible,
but they might well depend on assumptions that are less widely shared.
For example, instead of all the detailed information on the title page above, one might have written:
"Usual title stuff for TAG finding on self-description written by Noah in February."
For another member of the TAG, this sentence might have sufficed
to convey most of the information in the title page. 
He or she might have known that only one person named Noah had ever served on the TAG, and correctly guessed him to be the author.
The copyright
might have been inferred, the links to various W3C sites are well-known,
and the overall structure of title pages is common to most TAG findings.
The resulting title page would indeed be much more compact.
Unfortunately, it would not reliably convey the full intended information to most readers on the Web, only to those with very specialized information.
Thus, the compact form is not sufficiently self-describing to be widely useful;  its correct interpretation depends on assumptions that are not broadly shared.
</p>

</div1>

<div1 id='dynamic'>
<head>Dynamic discovery of specifications</head>
<p>THIS IS A PLACEHOLDER FOR A MORE SUBSTANTIVE SECTION TO BE WRITTEN</p>
<p>The sections above motivate the need for Web documents to depend, to the extent possible,
on widely deployed specifications.
Many documents, particularly those that convey machine-readable data or messages, encode detailed
information using specifications that may be specialized to particular purposes.
These may cover details of particular data formats (how a phone number is represented), 
how a message is to be processed (perhaps as an atomic transaction), secured, etc.
Because of the great variety and number of such formats and specifications,
and because new versions of such specifications are deployed
often (e.g. a new phone number format), it's not practical
to assume that even most of them will be directly implemented by typical Web user agents.
A variety of Web technologies are available that allow for unambiguous labeling of the
specifications being used.
Furthermore, when such labels are URIs (or when, as with many XML Qualified Names, they can be mapped to URIs), it may be possible to dynamically discover on the Web the logic or code needed
to understand the content in question.
</p>
Examples to be supplied:
<ul>
<li>SOAP headers identified with QNames: software to be used in processing those headers can be determined unambiguously, and mustUnderstand="false" let's you know when the rest of the message
can be trusted even if spec. the header itself is not known. </li>
<li>RDF, in which predicates are URIs, and so information needed for dealing with 
a predicate can be discovered dynamically on the Web.</li>
</ul>
<p>
</p>
</div1>
<div1 id='xmlFunctions'>
<head>Self-describing XML Documents</head>
<p>
XML documents with namespace-qualified elements are a widely used means of creating self-describing
Web documents.
Given that a Web document is of media type application/xml, standard rules may be applied to
determine not just the overall nature of the document, but also the meaning in context
of its sub-elements.
The TAG has opened an issue <loc href="http://www.w3.org/2001/tag/issues.html?type=1#xmlFunctions-34" xlink:actuate="onRequest" xlink:type="simple" xlink:show="replace">xmlFunctions-34</loc> and
is preparing an associated finding on the recursive interpretation of XML documents.
</p>

</div1>
<div1 id="ToDo">
<head>TO DO</head>
<p>Things to do to clean up this finding.</p>
<ulist>
<item><p>Dirk and Nadia stories?</p></item>
<item><p>Explain how self description allows one to detect erroneous retrieval of the wrong resource.</p></item>
<item><p>Examples of XML documents with cryptic element names, spelled-out element names, and namespace-qualified names</p></item>
<item><p>Must understand and partial understanding</p></item>
<item><p>Role of metadata in bootstrapping.</p></item>
</ulist>

</div1>


<div1 id='references'>
<head>References</head>

<blist>
<bibl id='AWWW' href='http://www.w3.org/TR/webarch/'>I.Jacobs, 
N. Walsh, <titleref>Architecture of the World Wide Web</titleref>.
W3C. December, 2004.</bibl>

</blist>
</div1>
</body>

 <back>
    <div1>
      <head>Change log</head>
      <slist>
         <sitem>6-Dec-2005 [NRM]: initial version</sitem>
      </slist>
      <slist>
         <sitem>25-Feb-2007 [NRM]: trying to get it good enough to circulate</sitem>
      </slist>
    </div1>
 </back>

</spec>
