The Importance of Self-Describing Documents

&http-ident;-&iso6.doc.date;

Draft Tag Finding

&draft.day; &draft.monthname; &draft.year; &http-ident;-&iso6.doc.date; XML &http-ident; Noah Mendelsohn IBM Corp. Noah_Mendelsohn@us.ibm.com

The use of self-describing document and data formats has proven valuable in many computing systems, but self-description is particularly important on the World Wide Web. This finding describes the characteristics of self-describing Web documents, techniques for creating them, and sets out the reasons that they are of particular value to the Web.

This document has been produced by the W3C Technical Architecture Group (TAG). This finding addresses TAG issue XXXX (to be opened).

This version of the document is a very preliminary sketch of a possible finding. Basically, I got interested in this issue when we discussed it in Edinburgh in 2005, and wanted a place to set down some ideas. That turned into this very rough sketch of a finding.

Additional TAG findings, both accepted and in draft state, may also be available. The TAG may incorporate this and other findings into future versions of the .

The terms MUST, SHOULD, and SHOULD NOT are used in this document in accordance with .

Please send comments on this finding to the publicly archived TAG mailing list www-tag@w3.org (archive).

World-Wide Web Consortium, Draft TAG Finding, 2005.

Created in electronic form.

English 2002-04-30: Published draft

Introduction: Why are self-describing documents important?

Electronic documents are used on the World Wide Web as a means of communication. Successful communication depends on the creator and the consumer(s) of a document having a shared understanding of the information conveyed, and that in turn requires at least some shared assumptions about the form in which the information is represented. Consider this finding, which you are now reading. If you have a printed copy, then you and the author have implicitly agreed to communicate in English. You have agreed that the English is set down using traditional typographical conventions, with the usual 26 letter alphabet and other symbols used to represent the words, punctuation, and so on. You are also depending on some shared assumptions about document structure, such as the use of a title to set an overall theme for the document, hierarchical sections used to reflect semantic structure, white space to set off paragraphs and so on. In other respects, the document is self-describing. Given the simple and widely shared assumptions about alphabet, typography and so on, it is possible for a reader with no additional knowledge to discover essentially the full intended content of this finding.

The World Wide Web has at least two characteristics that distinguish it from many other shared information spaces:

The Web is global.

Web architecture dictates that any user agent may at any time GET and attempt to interpret representations for any resource.

The second point is often misunderstood; while it is true that certain resources are intended primarily for a narrow audience, the correct operation of search engine spiders, optimistic web caches and much other Web software depends on the ability to retrieve and work with even those seemingly more private sources of information. Not only must retrieval be safe, it is essential that consumers of such documents be able to unambiguously and correctly interpret them, or failing that, to reliably determine that the document is one that cannot in fact be understood.

As we'll see in the next section, this implies that the correct and complete interpretation of Web documents should, to the extent practical, depend only on widely used standards, conventions and languages (including both natural languages and computer languages.) Certain other characteristics also contribute not only to the self-description of individual documents, but also to the ability of software to dynamically discover the information necessary for interpretation of those documents. The remainder of this finding explores some more detailed issues relating to the creation and sharing of self-describing documents on the Web.

GOOD PRACTICE: Resource representations should, to the extent practical, be self-describing.

Technical characteristics of self-describing Web documents

Just as certain shared assumptions were required for a reader to correctly understand the markings comprising the printed form of this finding, the sender and receiver of a Web document must share some assumptions if the bit streams representing the document are to be correctly interpreted. Such assumptions may be set down in the form of W3C Recommendations, IETF Requests For Comments (RFCs), standards for particular industries, and so on. They may also be embodied in private agreements, or may in fact not be formally set down at all. Insofar as the necessary specifications are widely understood, then the chances greatly increase that document will be interpretable by a wide range of software and human consumers.

Again using this document as an example: it is usually served on the Web as a sequence of bits (octets) using the HTTP protocol, labeled with the media type application/xhtml+xml, and encoded using one of the common Unicode encodings (UTF-8). An XML document type declaration allows one to reliably determine that the document is is marked up using HMTL 4.0 (Transitional), the lang="EN" attribute indicates that prose in the document is in English, and so on (if you're reading this document online, you may wish to use your browser's View Source feature to examine some of these declarations -- except that this version is still text/html -- argh!!.) Accordingly, software which is written to these widely understood conventions can discover the overall structure of this document, the location of links, the characters comprising the prose, etc. Both search engines and human readers know to interpret the characters as English, and indeed user agents can automatically signal the availability of an English-language version. In these respects, the electronic form of the document is also self-describing.

More compact encodings of this document are possible, but they might well depend on assumptions that are less widely shared. For example, instead of all the detailed information on the title page above, one might have written: "Usual title stuff for TAG finding on self-description written by Noah in February." For another member of the TAG, this sentence might have sufficed to convey most of the information in the title page. He or she might have known that only one person named Noah had ever served on the TAG, and correctly guessed him to be the author. The copyright might have been inferred, the links to various W3C sites are well-known, and the overall structure of title pages is common to most TAG findings. The resulting title page would indeed be much more compact. Unfortunately, it would not reliably convey the full intended information to most readers on the Web, only to those with very specialized information. Thus, the compact form is not sufficiently self-describing to be widely useful; its correct interpretation depends on assumptions that are not broadly shared.

Dynamic discovery of specifications

THIS IS A PLACEHOLDER FOR A MORE SUBSTANTIVE SECTION TO BE WRITTEN

The sections above motivate the need for Web documents to depend, to the extent possible, on widely deployed specifications. Many documents, particularly those that convey machine-readable data or messages, encode detailed information using specifications that may be specialized to particular purposes. These may cover details of particular data formats (how a phone number is represented), how a message is to be processed (perhaps as an atomic transaction), secured, etc. Because of the great variety and number of such formats and specifications, and because new versions of such specifications are deployed often (e.g. a new phone number format), it's not practical to assume that even most of them will be directly implemented by typical Web user agents. A variety of Web technologies are available that allow for unambiguous labeling of the specifications being used. Furthermore, when such labels are URIs (or when, as with many XML Qualified Names, they can be mapped to URIs), it may be possible to dynamically discover on the Web the logic or code needed to understand the content in question.

Examples to be supplied:

SOAP headers identified with QNames: software to be used in processing those headers can be determined unambiguously, and mustUnderstand="false" let's you know when the rest of the message can be trusted even if spec. the header itself is not known.
RDF, in which predicates are URIs, and so information needed for dealing with a predicate can be discovered dynamically on the Web.

Self-describing XML Documents

XML documents with namespace-qualified elements are a widely used means of creating self-describing Web documents. Given that a Web document is of media type application/xml, standard rules may be applied to determine not just the overall nature of the document, but also the meaning in context of its sub-elements. The TAG has opened an issue xmlFunctions-34 and is preparing an associated finding on the recursive interpretation of XML documents.

TO DO

Things to do to clean up this finding.

Dirk and Nadia stories?

Explain how self description allows one to detect erroneous retrieval of the wrong resource.

Examples of XML documents with cryptic element names, spelled-out element names, and namespace-qualified names

Must understand and partial understanding

Role of metadata in bootstrapping.

References I.Jacobs, N. Walsh, Architecture of the World Wide Web. W3C. December, 2004. Change log 6-Dec-2005 [NRM]: initial version 25-Feb-2007 [NRM]: trying to get it good enough to circulate