The Self-Describing Web

Draft Tag Finding 24 May 2007

This version:
http://www.w3.org/2001/tag/doc/selfDescribingDocuments-2007-05-24
Latest version:
http://www.w3.org/2001/tag/doc/selfDescribingDocuments
Previous version:
http://www.w3.org/2001/tag/doc/selfDescribingDocuments-2007-02-25
Editor:
Noah Mendelsohn, IBM Corp. <Noah_Mendelsohn@us.ibm.com>

This document is also available in these non-normative formats: XML.


Abstract

The Web is designed to support flexible exploration of information, by human users and by automated agents. For such exploration to be productive, information published by many different sources and for a wide variety of purposes must be comprehensible to a wide variety of Web client software. This finding suggests that there are three strategies that, used in combination, can ensure such flexible interoperability: 1) where practical, resource representations should be encoded using widely deployed standards; 2) where such widely deployed standards are not sufficient, the encodings used should themselves be described in machine readable form on the Web, using RDF, RDDL, or other standard description systems; and 3) in all cases, each representation should carry information such as media-types, character encoding labels, RDFa, links to specifications, etc. sufficient to support automatic determination of the standards and other specifications necessary for correct interpretation. To the extent that these guidelines are observed, individual documents become self-describing, in the sense that only widely available information is necessary for understanding them. Furthermore, when such documents are linked together, the Web as a whole can support reliable, ad hoc discovery of information. This finding discusses in more detail the techniques needed to create such a self-describing Web.

Status of this Document

This document is an editors' copy that has no official standing.

This document has been produced by the W3C Technical Architecture Group (TAG). This finding addresses TAG issue XXXX (to be opened).

This version is an editor's draft and has not been approved by the TAG. It has been prepared for discussion at the June 2007 Face to Face Meeting of the TAG, and it is intended in part to address comments made at the March 2006 Face to Face Meeting of the TAG.

Additional TAG findings, both accepted and in draft state, may also be available. The TAG may incorporate this and other findings into future versions of the [AWWW].

The terms MUST, SHOULD, and SHOULD NOT are used in this document in accordance with [rfc2119].

Please send comments on this finding to the publicly archived TAG mailing list www-tag@w3.org (archive).

Table of Contents

1 Introduction
2 Use of widely deployed standards and formats
3 Determining the format of a representation
4 Dynamic discovery using extensible specifications
    4.1 Self-describing XML documents
    4.2 RDF and the Self-Describing Web
    4.3 Using RDFa to produce self-describing HTML
    4.4 Using GRDDL to bridge from XML to RDF
5 Conclusion
6 Change Log
    6.1 Changes in 24 May 2007 Edition
7 References

Appendix

A Change log


1 Introduction

The World Wide Web has at least three characteristics that distinguish it from many other shared information spaces:

  1. The Web is global: the documents on the Web are contributed by and accessed by a very large number of users.

  2. Supporting ad-hoc exploration is a goal of the Web. Users must therefore be able to get useful information from documents prepared by people whom they don't know, and with whom they have not coordinated in advance.

  3. Web architecture dictates that any user agent may at any time issue a GET and attempt to interpret representations for any HTTP resource.

It seems fairly obvious that documents intended for a broad audience should be encoded using standard formats, because user agents, such as Web browsers, can provide built in support for such standards. What may be less clear are the importance of having each resource representation unambiguously indicate the conventions used to encode it, the possibilities for extending fixed representation standards by providing machine readable specifications on the Web, and the importance of using such approaches even for documents that are primarily targeted to a limited audience. Applying these approaches results in documents that are self-describing, in the sense that only widely available information is necessary for understanding them. Furthermore, when such documents are linked together, the Web as a whole can support reliable, ad hoc discovery of information. This finding discusses in more detail the techniques needed to create such a self-describing Web.

2 Use of widely deployed standards and formats

Electronic documents are used on the World Wide Web as a means of communication. Successful communication depends on the supplier and the consumer(s) of a document having a shared understanding of the information conveyed, and that in turn requires at least some shared assumptions about the form in which the information is represented. The simplest way to achieve this is if the document is encoded using widely deployed standards and conventions.

As an example, consider the document you are reading now. If you have a printed copy, then you and the author have implicitly agreed to communicate in English. You have agreed that the English is set down using traditional typographical conventions, with the usual 26 letter alphabet and other symbols used to represent the words, punctuation, and so on. You are also depending on some shared assumptions about document structure, such as the use of a title to set an overall theme for the document, hierarchical sections used to reflect semantic structure, white space to set off paragraphs and so on. In other respects, the document is self-describing. Given the simple and widely shared assumptions about alphabet, typography and so on, it is possible for a reader with no additional knowledge to discover essentially the full intended content of this finding.

If you are reading this document online using a Web browser, then you are benefiting from the fact that its electronic representation is also based on widely deployed standards: it is written in HTML, using the UTF-8 Unicode encoding, is served using the widely deployed HTTP protocol, and so on. Because so many agents on the Web are compatible with that representation, this document can be viewed in Web browsers, both on desktop machines and on mobile devices, it can be parsed and decoded by search engine crawlers, and so on. (See also the TAG Finding "The Rule of Least Power" [LeastPower] for a discussion of some other document characteristics that facilitate use of the information in this document.)

More compact encodings of this document are possible, but they might well depend on assumptions that are less widely shared. For example, instead of all the detailed information on the title page above, one might have written: "Usual title stuff for TAG finding on self-description written by Noah in May." For another member of the TAG, this sentence might have sufficed to convey most of the information in the title page. He or she might have known that only one person named Noah has ever served on the TAG, and correctly guessed him to be the author. The copyright might have been inferred, the links to various W3C sites are well-known, and the overall structure of title pages is common to most TAG findings. The resulting encoding would indeed be much more compact. Unfortunately, it would not reliably convey the full intended information to most readers on the Web, only to those with very specialized information. Thus, the compact form is not sufficiently self-describing to be widely useful; its correct interpretation depends on assumptions that are not broadly shared.

Good Practice

Good Practice: Web resource representations SHOULD, to the extent practical, be encoded using widely deployed standards.

3 Determining the format of a representation

Just as certain shared assumptions were required for a reader to correctly understand the markings comprising the printed form of this finding, the sender and receiver of a Web document must share some assumptions if the bit streams representing the document are to be correctly interpreted. It's not enough for the sender to know that standard formats or encodings were used; the receiver must be able to reliably discover which ones were chosen. The HTTP protocol and associated standards are designed to facilitate discovery of the encodings that have been used for each Web resource representation.

Again using this finding as an example: it is usually served on the Web as a sequence of bits (octets) using the HTTP protocol, labeled with the media type text/html and the associated character set (UTF-8). Indeed, if you're reading this document online, you may wish to use your browser's View Source or View Page Information (or similar) feature to examine some of these declarations. Here is a representative portion of the HTTP returned for one of the early drafts of this document (a few headers not pertinent to this discussion have been removed, and carriage returns have been added to the HTML to make it easier to read):

HTTP/1.1 200 OK
Date: Mon, 21 May 2007 22:55:45 GMT
Server: Apache/1.3.37 (Unix) PHP/4.4.7
Last-Modified: Mon, 26 Feb 2007 14:44:58 GMT
Content-Type: text/html; charset=utf-8

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="EN">
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>The Self-Describing Web</title>
...

Typically, such encoding or format information is applied in a layered manner after the representation is received. So, for example, the knowledge that UTF-8 has been used is necessary to interpret the octet stream as characters, and the discovery that media type text/html is used gives the receiver license to interpret the first of those characters as an HTML DOCTYPE declaration. (Note that if the same entity-body were served as text/plain, a user agent would be guessing if it "sniffed" the document to determine that the content could be processed as HTML, or if it tried to infer the type of such a resource from an ".html" suffix in the URI; as discussed in TAG Findings [AuthoritativeMetadata] and [MetadataInURI] such guessing is contrary to Web Architecture and is strongly discouraged. The Content-Type header is generally the appropriate means to determine the character encoding and media type of a representation retrieved using HTTP.) With knowledge that UTF-8 and media type text/html have been used, a receiving user agent can inspect the DOCTYPE to determine that in fact the document uses the 4.01 Transitional variant of HTML, and can parse the rest of the document to determine its tag structure. From the lang="EN" attribute on the HTML element it can reliably determine that the text was intended to be read as English, from the various HTML heading tags (e.g. <h1> and <h2>) it can determine the structuring of the document into sections, and from the <a> tags it can discover the links and the anchors in the document, and so on.

In short, a user agent can work step by step, starting with knowledge of the HTTP protocol and its headers, to determine the full intended interpretation of this example representation. This representation not only conforms to standards, it advertises the standards it uses so that a receiver can discover them. The Web representation of this document is in that sense self-describing. For the reasons discussed above, providing self-describing resource representations is essential if the Web is to be an information space that users and software agents can freely explore.

Good Practice

Good Practice: Web resource representations SHOULD, to the extent practical, be self-describing.

Note that the above wording takes account of the fact that HTTP headers, such as Content-Type, are considered to be part of the resource representation, even though they are not part of the HTML entity-body content; indeed, because the information needed to get started on finding the encodings used is found in common ways for all representations returned using HTTP, I.e. in standard headers such as Content-type, HTTP facilitates the creation and deployment of self-describing resources.

In many cases, such as in the example above, a small, bounded set of such standards is sufficient for representing the information in a document, but in others, more extensible conventions are needed. The following sections discusses how technologies such as XML, RDF, RDDL, GRDDL and others can be used to support the use of extensible, application-specific representation formats, and how user agents can dynamically discover information about the formats that have been used.

4 Dynamic discovery using extensible specifications

Dynamic discovery of specifications is necessary because of the ever changing nature of the information on the Web. Indeed, many documents, particularly those that convey machine-readable data or messages, encode detailed information using specifications that may be specialized to particular purposes. These may cover details of particular data formats such as list of customers or inventory records, experimental results of scientific experiments, listings for television shows, lists descriptions of universities or their course offerings, information about molecular structures or drug tests, etc. They may also provide new ways of representing document structure, graphical images or message control structures such as SOAP headers. Because of the great variety and number of such formats and their specifications, and because new versions of such specifications are deployed often, it's not practical to assume that even most of them will be directly implemented by typical Web user agents.

A variety of Web technologies are available that allow for unambiguous labeling of the specifications being used. Furthermore, when such labels are URIs (or when, as with many XML Qualified Names, they can be mapped to URIs), it may be possible to dynamically discover on the Web the logic or code needed to understand, or at least to do partial processing of the content in question. So, just as the Web may be used to dynamically discover a great wealth of resources, it can also be used to dynamically discover the specifications, ontologies, or programs needed to interpret the representations of those resources. Web representations that use such domain- or application-specific formats should link to the information needed to interpret them.

Good Practice

Good Practice: Representations that use application- or domain-specific formats SHOULD link to the information needed to support automatic processing of those formats. [Need to find a less clumsy wording for this one...Noah]

Of course, when the standards used in a representation are widely deployed, as with HTTP, ASCII, Unicode, XML and so on, there may be no need for a client to dynamically integrate support for those standards; as described above most Web user agents come with built in support for widely used protocols and formats, including HTTP and Unicode, media-types such as text/plain, text/html, image/jpeg, and so on. Indeed, even when extensibility is desired, it is generally necessary that each user agent provide built in support for at least some standards, which should in turn be usable to discover information about others. The following sections explain how a number of Web technologies can be applied to achieve such dynamic integration of new Web representation formats. First, we consider the automatic discovery of information needed to process namespace-qualified XML markup.

4.1 Self-describing XML documents

XML documents with namespace-qualified elements are a widely used means of creating self-describing Web documents. Given that a Web document is of media type application/xml, or in the family of media types application/____+xml, recursive processing from the root element down may be applied to determine not just the overall nature of the document, but also the meaning in context of its sub-elements. Doing, this, however, requires understanding of the semantics of each named element. Here we discuss one specific aspect of creating self-describing XML: the use of namespace documents that can be discovered automatically from the tag names used in the markup. Later sections of this finding describe some additional techniques for creating self-describing XML.

When XML namespaces are used [XMLNamespaces], each XML element is named with what is called a "Qualified Name", which consists of a prefix and a local name. For example:

   <inventory:itemNumber>87354</inventory:itemNumber>
   <inventory:quantityAvailable>152</inventory:quantityAvailable>

Here inventory is a prefix, and we see that it is used in the names of two elements, both of which presumably have to do with describing items in some business' inventory. The first element name has a local name itemNumber and the second has local name quantityAvailable. Not shown above, but necessary for these to be well formed XML, is that each prefix be bound to a URI, for which it is a shorthand. These bindings can be repeated on each element, or more conveniently, declared on a shared ancestor element such as the document's root:

   <inventory:inventoryItem 
        xmlns:inventory="http://example.org/inventoryNamespace">
     <inventory:itemNumber>
         87354
     </inventory:itemNumber>
     <inventory:quantityAvailable>
         152
     </inventory:quantityAvailable>
   </inventory:inventoryItem>

Although the element names are written using the prefix shorthand, the logical name of each element is a pair consisting of the namespace name URI, and the local name. The Namespaces in XML recommendation calls these pairs expanded names; for the example elements above, the namespace name is http://example.org/inventoryNamespace and the expanded names are {http://example.org/inventoryNamespace,inventoryItem}, {http://example.org/inventoryNamespace,itemNumber} and {http://example.org/inventoryNamespace,quantityAvailable}.

The namespace name URI serves at least two roles: the most obvious and the most widely understood is that it serves to distinguish expanded names in one namespace from those in another; the other role, and the one that's most important for purposes of this finding, is that it provides Web identification for the namespace itself. The namespace is a Web resource, and like any other resource, it can and should provide representations of itself using HTTP. A user agent processing an XML document can retrieve representations of the namespaces used in that document, and can use that retrieved information to determine how to correctly process the XML markup. The W3C TAG is currently working on a finding that will describe best practices for creating such representations of namespaces. Drafts of the finding are available at [NamespaceDocuments]. Most likely, the finding will recommend the use of [RDDL] as a preferred means of providing machine readable documentation of namespaces. RDDL is itself extensible, but it is commonly used to suggest XML Schemas (in any of several languages including the W3C XML Schema Language [Refs to be supplie]), XSLT Stylesheets, etc. that are usable with markup from the namespace being described.

Using the example above, let's assume that user Bob is browsing the Web, and that he follows a link to a resource that returns the XML above as its representation, using media type application/xml. Of course, it's very unlikely that Bob's browser has built in knowledge of the inventory XML language, but his browser probably can parse XML, and we assume that it also is aware of RDDL. When the inventory description comes back, the browser uses the techniques already described to determine the character encoding, the media type application/xml, and it discovers that the root element tag is from namespace http://example.org/inventoryNamespace. That namespace is identified by an http-scheme URI, so the brower does an HTTP GET and retrieves from the namespace resource a RDDL document.

...Need to put sample fragment of RDDL document here...

The RDDL document in turn suggests a stylesheet that can be applied to format the inventory XML as HTML; the browser automatically retrieves and applies the stylesheet, producing HTML that is rendered on the screen. Without any manual intervention from Bob, his browser automatically displays the inventory record in a format that's convenient to read and print. Bob's browser may also be enabled for XML validation, in which case it can look in the RDDL for a link to a schema to be used for validating inventory markup, and can use it to check the document that Bob has received. Bob's browser has, in an important sense, automatically extended itself for processing of the inventory markup language.

Unless the RDDL provides a link to one or more executable program that processes inventory records, it's unlikely that Bob's browser can automatically discover everything that one might reasonably want to know about processing inventory markup. Still, even the limited automatic function described above very useful, and RDDL is an extensible framework that can be easily adapted to provide new kinds of information about namespaces. The document Bob retrieved was self-describing: even information needed to correctly process markup specific to inventory management was available by following links that were provided in the document itself.

Typically a TAG finding would at this point include a good practice note, suggesting the use of RDDL or similar technologies to make XML documents on the Web self-describing; in this case, the details of such recommendations are likely to be provided in the TAG finding [ref to namespaces documents finding], and so they are not formally restated here. Note also that the TAG has opened an issue xmlFunctions-34 and is preparing an associated finding on the recursive interpretation of XML documents.

4.2 RDF and the Self-Describing Web

RDF [ref to RDF] plays an important and distinguished role as the preferred technology for creating self-describing Web data resources, and for integrating representations rendered using other technologies. The result is a single, global self-describing Semantic Web that integrates not only resources that are themselves built or represented using RDF, but also the other Web resources to which that RDF links. Readers unfamiliar with RDF should consult the [ref to RDF primer] as a prerequisite to understanding the discussion below.

Each RDF statement is a triple consisting of a subject, a predicate (typically the identifier for a property, or for a relationship between two Web resources), and an object (typically the value of the property or the referent of the relationship). Crucially, the subject and the predicate are themselves identified by URIs, enabling the same sort of dynamic discovery that we've already seen with namespace names — if a user agent has no built in knowledge of some particular RDF subject or relationship (or object if it's a URI), it can often use the URI to retrieve the information necessary for processing.

Indeed, RDF's Schema [ref to RDF schema] and OWL Ontology technologies [ref to OWL ontology] together offer a standard, machine-processable means of describing particular uses of RDF. Just as RDDL allowed Bob's browser to automatically discover the information needed to process the XML inventory vocabulary, RDF and OWL provide the standard means by which software can discover the the relationships between RDF statements (e.g. that two seemingly differing predicates are the "owl:sameAS" each other), or other information needed for processing the RDF.

RDF and its companion Semantic Web technologies ultimately provide much richer facilities for self-description than the combination of XML and RDDL. Because its model is uniform, because all of its self-description is provided in the same model as the data itself, and because all RDF information is linked into the Web as a whole, RDF provides uniquely powerful facilities for dynamic integration of a self-describing Web.

[[Need to add a Dirk/Nadia example here of why RDF is cooler than anything anyone's ever seen :-) ]]

Good Practice

Good Practice: Information provided directly in RDF, or information for which automated means can be used to discover corresponding RDF, contributes to the self-describing Semantic Web.

Because of RDF's unique role as the glue that binds the Web into a single, global self-describing framework, it's particularly important that information not originally supplied in RDF can be selectively made available in RDF. The two sections below discuss two examples: the first shows how RDFa can integrate HTML documents into the Semantic Web, and the second illustrates the use of GRDDL to extract RDF from XML documents.

4.3 Using RDFa to produce self-describing HTML

[RDFa] is a W3C draft Recommendation for embedding Semantic Web statements into ordinary HTML Web pages. This example illustrates how RDFa can integrate HTML into the self-describing Semantic Web:

Mary is exploring the Web using a browser that has been enhanced with capabilities for interpreting RDFa. Her browser knows to look through each Web page that she browses, picking out useful information from the RDFa, and helping her to use it. For example, the page might contain the following, which represents a VCard-style contact listing. (This example is adapted from one in [RDFa]):

    <p class="contactinfo" 
          xmlns:contact="http://www.w3.org/2001/vcard-rdf/3.0#"
          about="http://example.org/staff/joseph">
        My name is
        <span property="contact:fn">
            Joseph Smith
        </span>
        I'm a
        <span property="contact:title">
            distinguished web engineer
        </span>
        at
        <a rel="contact:org" href="http://example.org">
            Example.org
        </a>.
        You can contact me
        <a rel="contact:email" href="mailto:joe@example.org">
            via email
        </a>.
    </p>

Even though this document is of media type application/xhtml+xml, which is not a member of the RDF family of media types, an RDFa-enabled user agent can extract RDF from this document. This document conveys as RDF a set of semantic Web statements about the Web resource http://example.org/staff/joseph. The predicates are all named with the same base URI http://www.w3.org/2001/vcard-rdf/3.0#, for which the shorthand prefix contact is established in the HTML. Using this syntax, the RDFa carries triples for relationships such as the full name of the contact (http://www.w3.org/2001/vcard-rdf/3.0#fn), which is Joseph Smith, the e-mail address (http://www.w3.org/2001/vcard-rdf/3.0#email) which is mailto:joe@example.org, and so on.

An RDFa-enabled user agent can extract these triples and integrate them with other Semantic Web information. As discussed above in 4.2 RDF and the Self-Describing Web, such Semantic Web triples are inherently self-describing. If the user agent needs more information about the processing of the email triple, for example, it can do an HTTP GET to http://www.w3.org/2001/vcard-rdf/3.0#email and use the results to get more information. With luck, that information will lead it to automatically discover that, for example, mailto:joe@example.org can indeed be used to send mail to the person named Joseph Smith. The browser can then offer Mary the option to send e-mail to Joe, or to add Joe to her address book.

Good Practice

Good Practice: RDFa SHOULD be used to make information conveyed in HTML self-describing.

4.4 Using GRDDL to bridge from XML to RDF

To be supplied in next version of this finding: just as RDFa lets us get triples from HTML, GRDDL lets us get triples from XML variants.

Good Practice

Good Practice: GRDDL SHOULD be used to make information conveyed in XML self-describing.

5 Conclusion

The next draft of the finding will include a brief conclusion section summarizing the highlights of the points made above.

6 Change Log

6.1 Changes in 24 May 2007 Edition

  • Changed title to "Self-describing Web"

  • New discussion of discovery of specs, role of RDF, etc.

  • Extensive editorial work.

7 References

AuthoritativeMetadata
R. Fielding, I. Jacobs, Authoritative Metadata. W3C Technical Architecture Group Finding, April, 2006. (See http://www.w3.org/2001/tag/doc/mime-respect.)
AWWW
I.Jacobs, N. Walsh, Architecture of the World Wide Web. W3C. December, 2004. (See http://www.w3.org/TR/webarch/.)
GRDDL
D. Connolly, Gleaning Resource Descriptions from Dialects of Languages (GRDDL), W3C Candidate Recommendation, May, 2007 (See http://www.w3.org/TR/grddl/.)
LeastPower
T. Berners-Lee, N. Mendelsohn B. Adida, M. Birbeck The Rule of Least Power W3C Technical Architecture Group Finding, February, 2006dg="noahcomp" diff="add"w. (See http://www.w3.org/2001/tag/doc/leastPower.)
MetadataInURI
N. Mendelsohn, S. Williams, The use of Metadata in URIs. W3C Technical Architecture Group Finding, January, 2007. (See http://www.w3.org/2001/tag/doc/metaDataInURI-31.)
NamespaceDocuments
N. Walsh, Associating Resources with Namespaces. W3C Technical Architecture Group Draft Finding, December, 2005. (See http://www.w3.org/2001/tag/doc/nsDocuments/.)
RDDL
J. Borden, T. Bray, Resource Directory Description Language (RDDL). W3C. February, 2002. (See http://www.rddl.org/.)
RDFa
B. Adida, M. Birbeck RDFa Primer 1.0: Embedding RDF in XHTML W3C. (working draft) March, 2007. (See http://www.w3.org/TR/xhtml-rdfa-primer/.)
XMLNamespaces
T. Bray, D. Hollander, A. Layman, R. Tobin, Namespaces in XML 1.1. W3C, August, 2006 (2nd Edition). (See http://www.w3.org/TR/xml-names11/.)

A Change log

6-Dec-2005 [NRM]: initial version

25-Feb-2007 [NRM]: trying to get it good enough to circulate