/*]]>*/
This document is also available in these non-normative formats: XML .
Copyright © 2006, 2007, 2008 W3C ® ( MIT , INRIA , Keio ), All Rights Reserved. W3C liability , trademark , document use , and software licensing rules apply.
The Web is designed to support flexible exploration of information by human users and by automated agents. For such exploration to be productive, information published by many different sources and for a variety of purposes must be comprehensible to a wide range of Web client software, and to users of that software.
HTTP and other Web technologies can be used to deploy
resources resource representations that are self-describing , in the an important sense
that only widely available
self-describing : information
is necessary about the encodings used for understanding them. each
representation is provided explicitly within the
representation. Starting with a URI, there is a standard
algorithm that a user agent can apply to retrieve and interpret
a representation of such resources. representations. Furthermore, when representations can
be grounded in the Web
,by ensuring that specifications required to
interpret them are determined unambiguously based on the URI, and
that explicit references connect the pertinent specifications to
each other. Web-grounding reduces ambiguity as to what has been
published in the Web, and by whom. When such self-describing self-describing, Web-grounded resources are linked
together, the Web as a whole can support reliable, ad hoc discovery
of information.
This finding describes how document formats, markup conventions,
attribute values, and other data formats can be designed to
facilitate the deployment of self-describing self-describing, Web-grounded Web content.
This document is an editors' copy that has no official standing.
This document has been produced for the W3C Technical Architecture Group (TAG) . It is an editor's draft that has not been approved by the TAG, and it includes revisions motivated by discussions held at the May 2008 Face to Face Meeting of the TAG .
Additional TAG findings , both accepted and in draft state, may also be available. The TAG may incorporate this and other findings into future versions of the [AWWW] .
Please send comments on this finding to the publicly archived TAG mailing list www-tag@w3.org ( archive ).
1 Introduction
2 The Web's Standard Retrieval
Algorithm
3 Use of Widely Deployed Standards and
Formats
4 Creating New Formats and
Standards
4.1 Use Existing
URI Schemes, Protocols, and Media Types
4.2 URI-based
Extensibility
4.2.1 Example: The Atom Syndication Format
4.2.2 Example: Microformats
4.2.3 Self-describing XML Documents
5 RDF and the Self-Describing Semantic
Web
5.1 Using RDFa To
Produce Self-describing HTML
5.2 Using GRDDL to
Bridge From XML To RDF
6 Accountability and
Grounding Information in the Web
6.1 Grounding New
Specifications in the Web
7 Conclusions
7 8
References
The World Wide Web has at least two characteristics that distinguish it from many other shared information spaces:
The Web is global: the documents on the Web are contributed by and accessed by a very large number of users.
Supporting ad-hoc exploration is a goal of the Web. Users must therefore be able to get useful information from documents prepared by people whom they don't know, and with whom they have not coordinated in advance.
The chapters below explain in more detail how the following
techniques can be used to create, deploy and access
self-describing Web resource representations that can be
correctly interpreted using only widely available information:
Documents used as Web resource
representations should be encoded using widely deployed formats
such as text/html and image/jpeg , and deployed using
HTTP.
Each representation should include standard machine-readable indications, such as HTTP Content-type headers, XML encoding declarations, etc., of the standards and conventions used to encode it.
Documents used as Web resource
representations should, when practical, be encoded using widely
deployed formats such as text/html and
image/jpeg ,and deployed using HTTP.
Machine-processable specifications for interpreting new formats should be provided on the Web, and linked from representations that use the formats. Examples of linkable specifications include OWL ontologies, RDDL documents, GRDDL transformations, etc. By following links to such specifications, user agents can dynamically obtain information needed to process new representation formats.
Web resource representations should be grounded in the Web :i.e., the specifications required for their interpretation should be discoverable by recursively following links starting with the specification for URIs [URI] .
For integration with the Semantic Web, self-describing
representations should convey RDF triples, either directly in the
representation, by linking to the triples (perhaps using
<link> elements in HTML or the
link: header in HTTP), or by linking to
transformations using technologies such as GRDDL.
A standard HTTP-based algorithm is used to deploy, retrieve and interpret self-describing Web resource representations.
Furthermore, when self-describing representations are linked together, the Web as a whole can support reliable, ad hoc discovery of information.
Principle
Self-describing resources promote ad hoc discovery of information.
Good Practice
Web resource representations should be self-describing.
The sections below discuss in more detail the techniques needed to create self-describing content for the Web, how to extend the Web with new formats that are themselves self-describing, how to publish self-describing Semantic Web data, why it's important that interpretation of Web representations be grounded unambiguously in the core specifications of the Web, and how a standard HTTP-based algorithm enables users to retrieve and interpret self-describing resource representations.
HTTP is the most widely deployed protocol on the Web, and it is designed to facilitate the deployment of self-describing Web resource representations. Indeed, there is a standard algorithm that a user agent can employ to attempt to obtain and interpret the representation of any Web resource that is accessible using the HTTP protocol. Consider the following example, which is representative of many simple Web interactions:
Bob is reading a Web page which includes a link to
http://example.com/todaysnews . Bob has had no
previous contact with the owner of the referenced resource, and his
browser has not been specially configured for access to it. The
steps taken by Bob's browser when he clicks the link illustrate a
typical path through the standard retrieval algorithm of the Web
(readers unfamiliar with the HTTP protocol may find it useful to
consult either [HTTP] , or one of the many HTTP
introductions available on the Web). Bob's browser...
parses the URI and, from the http: at the
beginning, determines that the http scheme has been used — this
tells the browser that a representation retrieved using the HTTP
protocol is authoritative.
looks up the DNS name [DNS]
example.com to determine the associated IP address
opens a TCP stream to port 80 at the IP address determined above
formats an HTTP GET request for resource
/todaysnews , and sends that to the server:
GET /todaysnews HTTP/1.1 Host: example.com User-Agent: TAG Sample HttpClient v1.0 Accept: */* Accept-language: en-us
reads this response from the server: HTTP/1.1 200 OK Date: Tue, 28 Aug 2007 01:49:33 GMT
Server: Apache Content-Type: text/html <!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head> <title>Today's news</title> </head>
<body> <h1>Today's News: Oh boy!!</h1> [HTML FOR
NEWS REPORT HERE] </body> </html>
HTTP/1.1 200 OK Date: Tue, 28 Aug 2007 01:49:33 GMT Server: Apache Content-Type: text/html; charset=utf-8 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Today's news</title> </head> <body> <h1>Today's News: Oh boy!!</h1> [HTML FOR NEWS REPORT HERE] </body> </html>
from the status code (200) determines that the request has been
successfully processed, and that a representation of the resource
is available in the Content-Type and the
entity-body
inspects the returned Content-Type and determines
that it is UTF-8 encoded
text/html , a standard media type that the browser
supports
passes the entity-body to its HTML rendering
engine, which uses the markup in the HTML to determine the title of
the page (Today's News), the rest of the document's structure, and
so on — the browser presents the page to Bob
Neither Bob nor his browser has any advance knowledge of the
nature of the resource or the fact that its representation is
provided in HTML, yet the browser successfully retrieves the
representation, determines its format, and renders it for him. The
link could have been to an image/jpeg picture, an
application/atom+xml feed, or to a document containing
application/rdf+xml data. Bob's browser could in each
case determine the format. Indeed, as Bob continues to browse the
Web, his browser is able to determine the format of each
representation that is retrieved, and can determine how to present
it to him. This example shows how HTTP enables the deployment of
self-describing Web resources.
Consider instead a different example, in which Bob clicks on a
link to ftp://example.com/todaysnews . Although Bob's
browser can easily open an FTP connection to retrieve a file, there
is no way for the browser to reliably determine the nature of the
information received. Even if the URI were
ftp://example.com/todaysnews.html the browser would be
guessing if it assumed that the file's contents were HTML, since no
normative specification ensures that data from ftp URIs ending in
.html is in any particular format.
The Web's retrieval algorithm works best when used with the core suite of protocols and formats that are most widely deployed, and that are capable of supporting retrieval of self-describing representations. These core technologies include: DNS, HTTP 1.1, HTML 4, XML, as well as widely deployed image formats such as image/jpeg and image/gif. As discussed in 5 RDF and the Self-Describing Semantic Web , RDF, OWL and GRDDL are among the additional core technologies that enable self-describing Semantic Web content. A flow diagram illustrating more details of the Web's standard retrieval algorithm is provided in A Diagram of the Web's Retrieval Algorithm .
Successful communication depends on the supplier and the consumer(s) of a document having a shared understanding of the information conveyed, and that in turn requires at least some shared assumptions about the form in which the information is represented. The simplest way to achieve this is if the media type, the document encoding, and any other conventions used for the representation are standards and are widely deployed.
Consider Susan, who buys a new digital camera. The software
supplied with her camera uploads photos to the Web using the
widely-deployed image/jpeg media type, and her Web
server correctly labels served representations with that
Content-Type . Millions of user agents deployed around
the world are preconfigured to display Susan's photographs and to
extract metadata such as camera settings from them. Search engines
are likely to index them in helpful ways too.
Now consider instead Mary, who buys a different camera with
software that does not use widely deployed Web formats.
Indeed, the camera's manufacturer has invented a new "raw" file
format that takes advantage of the camera's special features. The
provided photo management software not only uses that format
locally, it also uploads photos to Mary's Web server in that same
form. Indeed, it even uploads a .htaccess
file , configuring the server to label served representations
with the proprietary Content-Type
image/x-fancyrawphotoformat . In this example, there
are no outright violations of Web architecture, but the decision to
use an uncommon, proprietary, unregistered and apparently
experimental media type is unfortunate. No existing Web user agents
recognize the image/x-fancyrawphotoformat media type,
search engine spiders are unlikely to extract useful information
from pictures in that format, and so on. Unlike Susan's, which can
be viewed by almost anyone, Mary's photos are at best useful to a
few people who have the proprietary software needed to decode
them.
Good Practice
Web resource representations should be published using widely deployed standards and formats.
The techniques described above apply in the many cases where
widely deployed media types such as image/jpeg are
sufficient, but the Web is used for a broad and continually growing
range of information. No fixed set of formats and standards can
fully meet the need to encode all such information for machine
processing. Of course, ways can be found to convey almost any
information using standard media types. An employment record, for
example, can be transmitted as either text/plain or
text/html . The resulting document may be quite
suitable for browsing, but it might not facilitate automated
discovery of the employee's name, his or her date of hire, and so
on. To meet such needs, new standards must be created, e.g. for
marking up the names and dates. Similarly, the need may arise to
use new values for individual fields such as rel
attributes on HTML link elements (see [TAGIssue51] ).
So, although the Web requires self-describing documents that can be understood using only widely deployed standards, there is also a continual need for new formats and encoding conventions. How can new formats and encodings be deployed in a manner that is self-describing? The following sections explore ways of creating new formats and encoding conventions that maximize interoperability with existing Web infrastructure, and that can be used to create self-describing documents.
Innovations can be introduced to the Web at many different architectural layers. For example:
New URI schemas schemes can be introduced
New transfer protocols can be deployed
New media types can be introduced
New namespace-qualified markup can be defined for XML
New RDF properties and ontologies can be defined for the Semantic Web
Often, a given capability could in principle be deployed at any of several different layers. For example, new sorts of content, such as movies, could be made available using new URI schemes and/or with new protocols, but doing so would require updating hundreds of millions of user agents, servers, proxies, and so on to understand these changes to the core mechanisms of the Web. Usually it is preferable to leverage the existing core mechanisms of the Web, such as http-scheme URIs and the HTTP protocol, as these are widely deployed. Indeed, one should usually leverage as many existing layers of the Web's architecture as is practical when introducing new function.
Good Practice
When extending the Web with new formats and functions, use existing URI schemes, protocols, and media types wherever practical.
One way to do this is to use URI-based extensibility within existing media types, as described in the sections below.
Many documents, particularly those that convey machine-readable data or messages, encode information using specifications that are specialized to particular purposes. Such specifications may cover details of particular data formats such as lists of customers or inventory records, results of scientific experiments, listings for television shows, details of university course offerings, information about molecular structures or drug tests, etc. Because of the great variety and number of such formats and their specifications, it's not practical to assume that even most of them will be directly implemented by typical Web user agents. Instead, the Web provides means by which the necessary specifications can be discovered, and to a significant degree implemented, dynamically and automatically. This is done by:
ensuring that every specification, and in many cases each markup tag name or data value used, is identified with a URI
ensuring that such URIs are used in the instance either directly as data values or tag names, or else to identify the encodings used
including in Web representations URIs that identify the specifications needed to interpret those representations
In other words, it should be possible to discover from each Web representation the conventions used to encode it, and particularly in cases where those conventions are not widely deployed, to find within the representations links to specifications, ontologies and/or programs necessary for interpreting the representation. So, just as the Web may be used to dynamically discover a great wealth of resources, it can also be used to dynamically discover the specifications, ontologies, or programs needed to interpret the representations of those resources.
Good Practice
Web representations should link to the information needed to support automatic processing of those representations.
The Atom Syndication Format [ATOM] is an
XML-based format for syndicating information about blogs and other
Web resources. ATOM entries can include
<atom:link> elements such as the following:
<entry>
<title>An interesting picture</title>
<link rel="enclosure" type="image/jpeg" length="12345"
href="http://example.org/interestingPic"/>
<content type="xhtml" xml:lang="en"
xml:base="http://example.org/">
<div xmlns="http://www.w3.org/1999/xhtml">
<p><[Update: Here's an interesting picture.]</p>
</div>
</content>
</link>
</entry>
The link elements identify external resources, in this case an
image/jpeg photograph. Furthermore, each link can
carry a rel attribute which specifies the relationship
between the linked resource and the ATOM entry that links it. In
the example above, the relationship is specified as
enclosure which, according to the ATOM specification,
indicates that the linked photograph may have been too large for
inline processing with the rest of the feed.
What's of interest for this finding is the fact that values of
the rel attribute are URIs (actually [IRI] s, which are the internationalized form of URIs),
or else the values can be mapped to URIs. This means that anyone,
anywhere can invent a new sort of link relationship, can assign a
URI to identify that relationship, and can use that value in the
rel attribute. For example:
<entry>
<title>An interesting picture</title>
<link rel="http://example.org/SomeNewATOMRelationship"
type="image/jpeg" length="12345"
href="http://example.org/interestingPic"/>
<content type="xhtml" xml:lang="en"
xml:base="http://example.org/">
<div xmlns="http://www.w3.org/1999/xhtml">
<p><[Update: Here's an interesting picture.]</p>
</div>
</content>
</link>
</entry>
Furthermore, anyone doing this can (and indeed should) provide
information about that new relationship via HTTP from the assigned
URI. For convenience, the ATOM specification also provides that
short form names such as enclosure in the first
example can be registered with IANA, and ATOM provides a
deterministic mapping to a URI for each of these. These URIs are
formed by prepending the fixed base URI
http://www.iana.org/assignments/relation/ to the short
form. Thus, the first example above is in fact using the
relationship
http://www.iana.org/assignments/relation/enclosure
.
This example shows how use of URIs for data values enables distributed assignment of new values. More importantly for this finding, the use of URIs for such values provides the opportunity for information about those values to be discovered dynamically on the Web.
[Microformats] provide a simple
means of marking up data in HTML Web pages. The presence of a
microformat is typically indicated by the appearance of an
identifying value such as vcard
in an HTML class attribute, and particular data items are usually
marked with other class values. For example, this hCard provides
contact information for the North American office of the W3C:
<div class="vcard">
<a class="fn org url" href="http://www.w3.org/">World Wide Web Consortium</a>
<div class="adr">
<span class="type">Work</span>:
<div class="street-address">32 Vassar Street</div>
<span class="locality">Cambridge</span>,
<abbr class="region" title="Massachusetts">MA</abbr>
<span class="postal-code">02139</span>
<div class="country-name">USA</div>
</div>
<div class="tel">
<span class="type">Work</span> +1-617-253-2613
</div>
<div class="tel">
<span class="type">Fax</span> +1-617-258-5999
</div>
</div>
In general, microformats such as hCard are not self-describing,
because there is no requirement in the HTML media type
specifications that class attribute values such as
vcard or type be interpreted per the
hCard specification. Indeed, lacking any specific indication that
the resource owner has intended this interpretation, it is
dangerous for clients to assume hCard semantics — there is a real
risk that some HTML Web pages use values like type ,
value or even in principle vcard for
other purposes.
Unlike some other microformats, hCard does provide an option for
deploying in a way that is self-describing. The hCard profile
specifies a value for the profile
attribute of the HTML 4.01 [HTML 4.01]
[HTML401]
<HEAD> element:
<head profile='http://www.w3.org/2006/03/hcard'>
and presence of this profile value indicates that class attributes can be reliably interpreted per the hCard specification. (Note, however, that there is ongoing discussion as to whether the profile attribute will be included as part of HTML 5, and if not, whether some other mechanism will be provided for signaling the use of extensions such as microformats.)
So, microformats are self-describing only when profiles or other
means licensed by a pertinent media type specification are used to
enable them. Unfortunately, few microformats have such profiles,
and even when profiles are available, evidence suggests that they
are not universally applied. User agents that infer the presence of
microformats without reliable indicators such as
<HEAD> element profiles are at risk of
extracting incorrect data from Web pages.
XML Namespaces [XMLNamespaces]
facilitate the creation of self-describing XML documents. Given
that a Web document is of media type application/xml ,
or in the family of media types application/____+xml ,
recursive processing from the root element may be applied to
determine not just the overall nature of the document, but also the
meaning in context of all sub-elements. Doing this, however,
requires understanding of the semantics of each named element.
Although a few specific XML variants such as
application/xhtml+xml may be directly supported by
some user agents, no user agent can build in support for the ever
growing set of XML languages used for Web representations. This
section describes how namespace documents, discoverable from the
XML tag names in the markup, can be used to make such languages
self-describing, and to enable automated processing of them.
When XML namespaces are used, each XML element is named with a
qualified
name , consisting of a prefix and a local name. In the
following example, the root element has the qualified name
<inventory:inventoryItem> :
<inventory:inventoryItem
xmlns:inventory="http://example.org/inventoryNamespace">
<inventory:itemNumber>
87354
</inventory:itemNumber>
<inventory:quantityAvailable>
152
</inventory:quantityAvailable>
</inventory:inventoryItem>
Qualified names map to expanded names
such as
{http://example.org/inventoryNamespace,inventoryItem}
, comprised of a namespace name URI (
http://example.org/inventoryNamespace ) and a local
name ( inventoryItem ). The namespace name URI serves
at least two roles: the most obvious and the most widely understood
is to distinguish expanded names in one namespace from those in
another; the other role, and the one that is most important for
purposes of this finding, is that it provides Web identification
for the namespace itself. The namespace is a Web resource, and like
any other resource, it can and should provide representations using
HTTP. A user agent processing an XML document can retrieve
descriptions of the namespaces used in that document, and can use
that retrieved information to determine how to correctly process
the XML markup. The TAG Finding "Associating Resources with
Namespaces" [NamespaceDocuments]
, recommends the use of [RDDL] as a preferred
means of documenting namespaces. RDDL is itself extensible, but it
is commonly used to suggest XML Schemas (in any of several
languages), XSLT Stylesheets, etc. that are usable with markup from
the namespace being described.
Example: assume that user Bob is browsing the Web, and that he
follows a link to a resource that returns the XML above as its
representation. Specifically, Bob's browser uses 2 The Web's Standard Retrieval Algorithm to
retrieve the representation, to determine its character encoding,
and to discover that its Content-type is
application/inventory+xml . Of course, it's very
unlikely that Bob's browser has built in knowledge of the inventory
XML language, but the Content-type makes clear [XMLMediaType] that the representation can be
interpreted as XML with Namespaces. The root element tag is from
namespace http://example.org/inventoryNamespace ,
which uses the http scheme, so Bob's browser does an HTTP GET from
that URI. What comes back is a RDDL document containing the
following <rddl:resource> element:
<rddl:resource xlink:role="http://www.w3.org/1999/XSL/Transform" xlink:arcrole="http://www.w3.org/1999/xhtml" xlink:href="http://example.org/InventoryToBrowsableHTML.xslt" xlink:title="Transform Inventory XML to HTML for Browsing"> </rddl:resource>
This designates a stylesheet (
http://example.org/InventoryToBrowsableHTML.xslt )
that can be applied to format the inventory XML as HTML — the
browser automatically retrieves and applies the stylesheet,
producing HTML that is rendered on the screen. Without any manual
intervention from Bob, his browser automatically displays the
inventory record in a format that is convenient to read and print.
Bob's browser may also be enabled for XML validation, in which case
it can look in the RDDL for a link to a schema to validate the
inventory markup.
Bob's browser has, in an important sense, extended itself for processing of the inventory markup language. Unless the RDDL provides a link to one or more executable programs that process inventory records, it's unlikely that Bob's browser can automatically discover everything that one might reasonably want to know about processing inventory markup. Still, even the limited automatic function described above is very useful, and RDDL is an extensible framework that can be easily adapted to provide new kinds of information about namespaces. Note that because RDDL documents are themselves XML, GRDDL can be applied to derive RDF statements from them (see 5.2 Using GRDDL to Bridge From XML To RDF ). In this way, self-describing XML documents can be integrated with the self-describing Semantic web. [NamespaceDocuments] describes this technique in more detail.
RDF [RDF] provides an interoperable means of publishing and linking self-describing Web data resources, and for integrating representations rendered using other technologies such as XML. The result is a single, global self-describing Semantic Web that integrates not only resources that are themselves built or represented using RDF, but also the other Web resources to which that RDF links, as well as those that can be mapped to RDF using technologies such as [GRDDL] . Readers unfamiliar with RDF should consult the RDF primer [RDFPrimer] as a prerequisite to understanding the discussion below.
Each RDF statement is a triple consisting of a subject, a
predicate (typically the identifier for a property, or for a
relationship between two Web resources), and an object (the value
of the property or the referent of the relationship). The subject,
the predicate, and often the object as well, are themselves
identified by URIs, enabling the dynamic discovery introduced in
4.2 URI-based Extensibility
above — if a user agent has no built in knowledge of some
particular RDF subject, relationship, or object, it can often use
the URI to retrieve the information necessary for processing.
Indeed, RDF's Schema [RDFSchema] and OWL
Ontology technologies [OWL] together offer a
standard, machine-processable means of describing relationships
between RDF statements, e.g. that two seemingly differing
predicates are the " owl:sameAs " each other.
As described in 2 The Web's Standard Retrieval Algorithm , the principal purpose of the Web's core retrieval algorithm is to obtain self-describing representations of Web resources. For the self-describing Semantic Web, the algorithm is extended to achieve a more particular goal: to directly obtain RDF triples that represent or indirectly obtain RDF triples that describe the referenced resource.
Consider Amy, who uses an RDF-enabled user agent to retrieve an RDF/XML document containing the following element:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:employeeData="http://example.org/EmployeeInformation#">
<employeeData:employee rdf:about="http://example.org/Employees#BobSmith">
<employeeData:name>Bob Smith</employeeData:name>
<employeeData:email rdf:resource="mailto:BobSmith@example.org"/>
</employeeData:employee>
</rdf:RDF>
The user agent is general purpose, and although it has rules for
certain commonly used ontologies, it has no built in code to handle
the employeeData properties in the above example. To
dynamically acquire the necessary function, the agent does an HTTP
GET for http://example.org/EmployeeInformation . The
GET returns an OWL ontology, from which the agent discovers that
http://example.org/EmployeeInformation#email is
rdfs:subPropertyOf the
http://www.w3.org/2001/vcard-rdf/3.0#email property,
one that the agent recognizes as designating a person's e-mail
address. The agent offers Amy the option to send e-mail to Bob
Smith. Amy's browser has, like Bob's in the example above,
automatically extended itself for processing the employee data.
Good Practice
Representations provided directly in RDF, or those for which automated means can be used to discover corresponding RDF, contribute to the self-describing Semantic Web.
Because its model is uniform, because all of its self-description is provided in the same model as the data itself, and because all RDF information is linked into the Web as a whole, RDF provides uniquely powerful facilities for dynamic integration of a self-describing Web. Therefore, it's particularly important that information not originally supplied in an RDF-specific format be convertible into RDF. The sections below discuss two means of doing this: the first shows how RDFa can integrate HTML documents into the Semantic Web, and the second illustrates the use of GRDDL to extract RDF from XML documents.
[RDFa] is a W3C draft Recommendation for embedding Semantic Web
statements into XHTML Web pages (see also [RDFaSyntax] [RDFaSyntaxandProcessing] ). This The following
example illustrates how RDFa can integrate HTML into the
self-describing Semantic Web:
Mary is exploring the Web using a browser that has been enhanced
with capabilities for interpreting RDFa. Her browser knows to look
through each XHTML Web page that she browses, picking out
information from the RDFa, and helping her to use it. For example,
the page might contain the following HTML, which represents
an [RDFVCard] -style a FOAF-based contact listing. (This example is
adapted from one in [RDFa] ):
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"><html xmlns="http://www.w3.org/1999/xhtml"version="XHTML+RDFa 1.0"version="XHTML+RDFa 1.0" xml:lang="en"> <head> <title>FOAF/RDFa Demo</title> </head> <body><p class="contactinfo" xmlns:contact="http://www.w3.org/2001/vcard-rdf/3.0#" about="http://example.org/staff/joseph"> My name is <span property="contact:fn"> Joseph Smith </span> I'm a <span property="contact:title"> distinguished web engineer </span> at <a rel="contact:org" href="http://example.org"> Example.org </a>. You can contact me <a rel="contact:email" href="mailto:joe@example.org"> via email </a>.<div typeof="foaf:Person" xmlns:foaf="http://xmlns.com/foaf/0.1/" about="http://example.org/staff/Alice"> <p property="foaf:name"> Alice Birpemswick </p><p> Email: <a rel="foaf:mbox" href="mailto:alice@example.com">alice@example.com</a> </p> <p> Phone: <a rel="foaf:phone" href="tel:+1-617-555-7332">+1 617.555.7332</a> </p> </div> </body> </html>
Even though this document is of media type
application/xhtml+xml [XHTMLMediaType] , which is not a member of
the RDF family of media types, an RDFa-enabled user agent can
extract RDF from this document. This document conveys as RDF a set
of semantic Web statements about the Web resource . The
predicates are all named with the same base URI http://example.org/staff/joseph http://example.org/staff/Alice , for which the
shorthand prefix http://www.w3.org/2001/vcard-rdf/3.0# http://xmlns.com/foaf/0.1/ is established in the HTML.
Using this syntax, the RDFa carries triples for relationships such
as the full name of the contact ( contact
foaf
), which is http://www.w3.org/2001/vcard-rdf/3.0#fn
http://xmlns.com/foaf/0.1/#name , the e-mail
address ( Joseph Smith
Alice Birpemswick
) which is http://www.w3.org/2001/vcard-rdf/3.0#email
http://xmlns.com/foaf/0.1/#mbox , and
so on.mailto:joe@example.org
mailto:alice@example.com
An RDFa-enabled user agent can extract these triples and use
them to help Mary work with the data they contain, or to integrate
with other Semantic Web information. Indeed RDF is designed for
such use because, as discussed above in 5
RDF and the Self-Describing Semantic Web , Semantic Web
triples are inherently self-describing. If a user agent needs more
information about the processing of the email triple it can, like
Amy's user agent, do an HTTP GET for URI , and use the
results to get more information. With luck, that information will
lead the agent to automatically discover that, in the example,
http://www.w3.org/2001/vcard-rdf/3.0 http://xmlns.com/foaf/0.1/ can
indeed be used to send mail to the person named mailto:joe@example.org
mailto:alice@example.org . The browser can then offer Mary the
option to send e-mail to Joseph Smith Alice
BirpemswickJoe,
Alice, or to add Joe Alice to her
address book.
Good Practice
RDFa should be used to make information conveyed in HTML self-describing.
For this example document to be
self-describing, the pertinent media type and the specifications on
which it depends must provide for the use of RDFa in XHTML; at the
time of this writing, they do not. Those who are working on
RDFa specifications have suggested that the
specification for the XHTML namespace will soon be updated to
provide explicitly for the use of RDFa in XHTML. When this
happens, documents such as the one shown above will be are
self-describing when served with
using the Content-type
application/xhtml+xml . The specification
for that media type , since
therefers to the
specification for the XHTML namespace. Similarly, [XHTMLMediaType]
designates it as a member of the
family of application/____+xml media type types, and the
specification for text/html
[XHTMLMediaType] those types
[XMLMediaType] allows in turn
provides for certain XHTML content, and
presumably such content would similarly markup to be enabled for
RDFa once the XHTML namespace documentation was revised. Editorial
note In informal discussions with those working
interpreted based on RDFa, they have referred to "updating the
specification for the XHTML namespace". Is it
really namespaces used in the
specification for document. Finally, the namespace that matters? I would have thought it would be the
specification(s) document for
one or more XHTML [XHTMLNamespace] allows
use of the languages that
XHTML with RDFa. So, taken together, these
specifications provide normatively for use elements from that namespace as markup.
of RDFa in documents served with
Content-type application/xhtml+xml .
RDFa provides a standard means of encoding RDF information in XHTML documents, but many other XML variants lack that capability. Furthermore, RDFa requires explicit encoding of each triple in the XHTML instance, and that may in some cases be impractical. [GRDDL] provides a standard means of extracting triples from a broad range of XML document formats. Each GRDDL-enabled XML document links to a transformation that, when applied to the document, produces RDF triples. Typically, the same GRDDL transformation can be used on entire families of similar XML documents.
For example, assume that Albert uses a GRDDL-enabled user agent to retrieve an XML document containing the following fragment:
<employees xmlns="http://example.org/employeeNS">
<employee name="Bob Smith">
<email>BobSmith@example.org</email>
</employee>
</employees>
Note that, unlike the earlier examples, this is neither in HTML
nor in RDF; we can assume that
http://example.org/employeeNS is a namespace created
by some particular business for use in its own busines documents.
Albert's agent has no built in knowledge of this namespace, and so
can not do much with it. Now assume that Albert instead retrieves a
different document. Most of the markup and data in it is identical
to the first, but this document is GRDDL enabled:
<employees xmlns="http://example.org/employeeNS"
xmlns:grddl="http://www.w3.org/2003/g/data-view#"
grddl:transformation=
"http://example.org/GRDDL_For_employeeNS.xsl>
<employee name="Bob Smith">
<email>BobSmith@example.org</email>
</employee>
</employees>
Albert's user agent is GRDDL aware, so it transforms the
<employees> information to RDF using the
supplied GRDDL_For_employeeNS.xsl transformation. If
Albert is lucky, that transformation produces RDF triples that the
agent understands, or that the agent can dynamically discover how
to process using the techniques described above in 5 RDF and the Self-Describing Semantic Web
. As in the earlier examples, Albert's user agent offers to send
mail to Bob Smith.
Good Practice
GRDDL should be used to integrate XML documents into the self-describing Semantic Web.
The Web is an important medium for publishing information, and so it is often important to get agreement about what has in fact been published, and by whom. Consider the following example:
Senator Smith alleges that he has been
libeled by an article published on the Web, and so the senator
files suit against the purported publishers. He produces in court
the log of a response to an HTTP GET for URI
http://publisher.example.com/oursenator.html
:
HTTP/1.1 200 OK
Date: Tue, 22 Oct 2008 02:43:22 GMT
Server: Apache
Content-Type: text/html; charset=utf-8
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>The problem with Senator Smith</title>
</head>
<body>
<h1>Senator Smith is a Liar and a Thief!</h1>
<p>Our senator, Mr. Smith, steals money from children,
and he lies on his income tax returns!
</p>
</body>
</html>
When confronted by the judge, the owners of the Web site attempt to claim that they have not in fact published information about the senator. "Yes", they say, "our server did accept a TCP/IP connection at port 80 and it responded with some bits, but those bits don't mean what you think they mean. They might look to you like HTML, but that's not what we intended. We haven't said anything derogatory about the senator."
In fact, according to the specifications for the Web, the publishers are wrong; those specifications indicate unambiguously that the bits returned are to be interpreted as the UTF-8 encoding of Unicode characters, that those characters are to be interpreted as HTML, and that the text within the HTML is to be read in English. According to the pertinent specifications, the publishers have indeed called the senator a liar and a thief!
How, though, can we know whether these
specifications apply at all? Perhaps the publishers could claim
that some other specifications applied? Unfortunately for them, the
Web is unambiguous on this point as well. Starting with a URI such
as http://localnewspaper.example.com/oursenator.html
and the specification for URIs [URI] ,all the applicable specifications for interpreting an
http-scheme Web representation can be unambiguously discovered by,
recursively, following references to other specifications. Using
this specific example to illustrate:
The [URI] specification indicates that the registration of URI schemes is provided for in [RFC2717] .This in turn indicates that IANA maintains a registry of URI schemes, which is at [IANASchemeRegistry] .That page in turn shows that URIs employing the http scheme are governed by RFC 2616 [HTTP] .
RFC 2616 describes the interpretation of
the Content-type header, indicating that the value of this header
is an Internet Media Type, and provides a reference to
[RFC1590]
for looking up the interpretation of
particular media types. RFC 1590 in turn indicates that a registry
of such media types is available from IANA, and from that
[IANAMediaTypeRegistry] one can discover that the documentation of media
type text/html
is found at [HTMLMediaType]
.This finally provides the normative
interpretation of the charset parameter, verifying that the
information is indeed to be interpreted as UTF-8, and that the
markup in the HTTP entity body is HTML.
Similar techniques can be used to
determine that the lang attribute in the
HTML does indeed call for text in the document to be read in
English.
...andso on.
Of course, determining whether Senator
Smith has been libeled is beyond the scope of Web Architecture. The
architecture does provide the means by which one can determine
unambiguously that he has been called a liar and a thief by the
owners of site publisher.example.com .In general, it is possible to determine which
specifications apply to interactions with resources identified by
any particular URI; as explained above, these specifications can be
located by recursively following references from the specifications
for URIs themselves.
Representations published using HTTP (or other schemes and protocols that are appropriately registered) are thus self-describing not just in the general sense of being interpretable using widely available information, but in the particular sense of having an interpretation that follows from the URI used for access and from the core specifications of the Web. We therefore refer to such representation documents as being grounded in the Web .From the URI referenced and the specification for URIs [URI] ,the other specifications for interacting with the resource and for interpreting its representations are determined unambiguously. Furthermore, this characteristic implies an important degree of accountability for those who serve information on the Web. As illustrated by the example above, Web architecture settles many important questions relating to what information has been published, and by whom.
As explained in 4 Creating New Formats and Standards ,new formats and protocols are occasionally needed for use with the Web. For information published using those new technologies to be grounded in the Web, it's essential that the pertinent specifications be reachable by references, preferably in the form of Web links, from the specifications for existing Web technologies. Means by which this can be achieved include:
The new specification can be registered in a suitable registry, such as the IANA registries for URI schemes [IANASchemeRegistry] or Internet media types [IANAMediaTypeRegistry] .Suitable registries are those that are themselves normatively linked or referenced from applicable specifications for the Web (e.g. the IANA scheme registry is discussed in [RFC2717] ,which in turn is referenced from [URI] ).
Existing specifications can be edited and republished to link explicitly to specifications for the new technology.
In certain cases, the new specification may be discoverable automatically. For example, when a new namespace is used in XML, it may be sufficient to publish a namespace document (see [XMLNamespaceDocuments] ), as [XMLMediaType] is itself properly linked, and it already provides for extension of XML using Namespaces.
Good Practice
Specifications for interactions with Web resources, and for interpretation of resource representations, should be linked (directly or indirectly) from the specification for URIs [URI] .
Ad hoc exploration of the Web is possible only if resource representations are self-describing. Using the techniques described above and starting with an http- or https-scheme URI, a user agent can proceed step by step to retrieve a representation, reliably discover the conventions that have been used to encode it, and if necessary, dynamically find instructions for processing it. Furthermore, it is possible starting with a URI and the specification for URIs [URI] ,to follow successive references to specifications that apply to interactions with the identified resource. The uniform identification of pertinent specifications greatly reduces ambiguities regarding what information has been published in the Web, and by whom.
Those who invent new document formats, new markup tags, or new
conventions for encoding particular data values should use the
techniques described above to make those formats self-describing. self-describing, and to ensure that the pertinent
specifications can be discovered by following references from
existing ones. When these techniques are used, and when
self-describing representations are linked together, the Web as a
whole can support reliable, ad hoc discovery of information.