This document is also available in these non-normative formats: XML .
Copyright
©
2006,
2007,
2008,
2009
2008
W3C
®
(
MIT
,
ERCIM
INRIA
,
Keio
),
All
Rights
Reserved.
W3C
liability
,
trademark
,
document
use
,
and
software
licensing
rules
apply.
The Web is designed to support flexible exploration of information by human users and by automated agents. For such exploration to be productive, information published by many different sources and for a variety of purposes must be comprehensible to a wide range of Web client software, and to users of that software.
HTTP and other Web technologies can be used to deploy resource representations that are self-describing : information about the encodings used for each representation is provided explicitly within the representation. Starting with a URI, there is a standard algorithm that a user agent can apply to retrieve and interpret such representations. Furthermore, representations can be what we refer to as grounded in the Web , by ensuring that specifications required to interpret them are determined unambiguously based on the URI, and that explicit references connect the pertinent specifications to each other. Web-grounding ensures that the specifications needed to interpret information on the Web can be identified unambiguously. When such self-describing, Web-grounded resources are linked together, the Web as a whole can support reliable, ad hoc discovery of information.
This finding describes how document formats, markup conventions, attribute values, and other data formats can be designed to facilitate the deployment of self-describing, Web-grounded Web content.
This document is an editors' copy that has no official standing.
THIS IS A BRANCH DRAFT WITH A PROPOSAL FOR RESOLVING CONCERNS THAT RFC 3023 DOES NOT CLEARLY REQUIRE THAT NAMESPACE-QUALIFIED MARKUP BE INTERPRETED PER THE DEFINITION OF THE NAMESPACE. SEE SECTION 4.2.3 Self-describing XML Documents .
This
document
has
been
produced
by
for
the
W3C
Technical
Architecture
Group
(TAG)
.
The
TAG
It
is
an
editor's
draft
that
has
not
been
approved
this
finding
at
by
the
TAG.
During
its
22 January 2009
teleconference
of
15
January,
2009,
the
TAG
reviewed
the
previous
draft
of
14
January,
2009
.
Two
small
changes
were
requested
to
section
6
Accountability
and
Grounding
Information
in
the
Web
,
and
those
are
reflected
in
this
draft
(see
sentence
beginning
"In
fact,
the
specifications...".)
Unless
TAG
members
raise
addition
concerns
between
now
and
then,
the
TAG
intends
to
take
a
decision
on
its
teleconference
of
22
January
2009
to
publish
this
version
of
the
document
(with
suitable
final
editorial
changes)
as
an
approved
TAG
finding.
Additional TAG findings , both accepted and in draft state, may also be available. The TAG may incorporate this and other findings into future versions of the [AWWW] .
Please send comments on this finding to the publicly archived TAG mailing list www-tag@w3.org ( archive ).
1
Introduction
2
The
Web's
Standard
Retrieval
Algorithm
3
Use
of
Widely
Deployed
Standards
and
Formats
4
Creating
New
Formats
and
Standards
4.1
Use
Existing
URI
Schemes,
Protocols,
and
Media
Types
4.2
URI-based
Extensibility
4.2.1
Example:
The
Atom
Syndication
Format
4.2.2
Example:
Microformats
4.2.3
Self-describing
XML
Documents
5
RDF
and
the
Self-Describing
Semantic
Web
5.1
Using
RDFa
To
Produce
Self-describing
HTML
5.2
Using
GRDDL
to
Bridge
From
XML
To
RDF
6
Accountability
and
Grounding
Information
in
the
Web
6.1
Grounding
New
Specifications
in
the
Web
7
Conclusions
8
References
Supporting ad-hoc exploration is a goal of the Web. Users must therefore be able to get useful information from documents prepared by people whom they don't know, and with whom they have not coordinated in advance. The chapters below explain in more detail how the following techniques can be used to create, deploy and access self-describing Web resource representations that can be correctly interpreted using only widely available information (note that the terms resource and representation , as used here, are defined in [AWWW] ):
Each representation should include standard machine-readable indications, such as HTTP Content-type headers, XML encoding declarations, etc., of the standards and conventions used to encode it.
Documents
used
as
Web
resource
representations
should,
when
practical,
be
encoded
using
widely
deployed
formats
such
as
text/html
and
image/jpeg
,
and
deployed
using
HTTP.
Machine-processable
specifications
for
interpreting
new
formats
should
be
provided
on
the
Web,
and
linked
to
from
representations
that
use
the
formats.
Examples
of
linkable
specifications
include
OWL
ontologies,
RDDL
documents,
GRDDL
transformations,
etc.
By
following
links
to
such
specifications,
user
agents
can
dynamically
obtain
information
needed
to
process
new
representation
formats.
Web resource representations should be grounded in the Web : i.e., the specifications required for their interpretation should be discoverable by recursively following links starting with the specification for URIs [URI] .
For
integration
with
the
Semantic
Web,
self-describing
representations
should
convey
RDF
triples,
either
directly
in
the
representation,
by
linking
to
the
triples
(perhaps
using
<link>
elements
in
HTML
or
the
link:
header
in
HTTP),
or
by
linking
to
transformations
using
technologies
such
as
GRDDL.
A standard HTTP-based algorithm is used to deploy, retrieve and interpret self-describing Web resource representations.
Furthermore, when self-describing representations are linked together, the Web as a whole can support reliable, ad hoc discovery of information.
Principle
Self-describing resources promote ad hoc discovery of information.
Good Practice
Web resource representations should be self-describing.
The sections below discuss in more detail the techniques needed to create self-describing content for the Web, how to extend the Web with new formats that are themselves self-describing, how to publish self-describing Semantic Web data, why it's important that interpretation of Web representations be grounded unambiguously in the core specifications of the Web, and how a standard HTTP-based algorithm enables users to retrieve and interpret self-describing resource representations.
HTTP is the most widely deployed protocol on the Web, and it is designed to facilitate the deployment of self-describing Web resource representations. Indeed, there is a standard algorithm that a user agent can employ to attempt to obtain and interpret the representation of any Web resource that is accessible using the HTTP protocol. Consider the following example, which is representative of many simple Web interactions:
Bob
is
reading
a
Web
page
which
includes
a
link
to
http://example.com/todaysnews
.
Bob
has
had
no
previous
contact
with
the
owner
of
the
referenced
resource,
and
his
browser
has
not
been
specially
configured
for
access
to
it.
The
steps
taken
by
Bob's
browser
when
he
clicks
the
link
illustrate
a
typical
path
through
the
standard
retrieval
algorithm
of
the
Web
(readers
unfamiliar
with
the
HTTP
protocol
may
find
it
useful
to
consult
either
[HTTP]
,
or
one
of
the
many
HTTP
introductions
available
on
the
Web).
Bob's
browser...
parses
the
URI
and,
from
the
http:
at
the
beginning,
determines
that
the
http
scheme
has
been
used
—
this
tells
the
browser
that
a
representation
retrieved
using
the
HTTP
protocol
is
authoritative
looks
up
the
DNS
name
[DNS]
example.com
to
determine
the
associated
IP
address
opens a TCP stream to port 80 at the IP address determined above
formats
an
HTTP
GET
request
for
resource
/todaysnews
,
and
sends
that
to
the
server:
GET /todaysnews HTTP/1.1 Host: example.com User-Agent: TAG Sample HttpClient v1.0 Accept: */* Accept-language: en-us
reads this response from the server:
HTTP/1.1 200 OK Date: Tue, 28 Aug 2007 01:49:33 GMT Server: Apache Content-Type: text/html; charset=utf-8 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Today's news</title> </head> <body> <h1>Today's News: Oh boy!!</h1> [HTML FOR NEWS REPORT HERE] </body> </html>
from
the
status
code
(200)
determines
that
the
request
has
been
successfully
processed,
and
that
a
representation
of
the
resource
is
available
in
the
Content-Type
and
the
entity-body
inspects
the
returned
Content-Type
and
determines
that
it
is
UTF-8
encoded
text/html
,
a
standard
media
type
that
the
browser
supports
passes
the
entity-body
to
its
HTML
rendering
engine,
which
uses
the
markup
in
the
HTML
to
determine
the
title
of
the
page
(Today's
News),
the
rest
of
the
document's
structure,
and
so
on
—
the
browser
presents
the
page
to
Bob
Neither
Bob
nor
his
browser
has
any
advance
knowledge
of
the
nature
of
the
resource
or
the
fact
that
its
representation
is
provided
in
HTML,
yet
the
browser
successfully
retrieves
the
representation,
determines
its
format,
and
renders
it
for
him.
The
link
could
have
been
to
an
image/jpeg
picture,
an
application/atom+xml
feed,
or
to
a
document
containing
application/rdf+xml
data.
Bob's
browser
could
in
each
case
determine
the
format.
Indeed,
as
Bob
continues
to
browse
the
Web,
his
browser
is
able
to
determine
the
format
of
each
representation
that
is
retrieved,
and
can
determine
how
to
present
it
to
him.
This
example
shows
how
HTTP
enables
the
deployment
of
self-describing
Web
resources.
Consider
instead
a
different
example,
in
which
Bob
clicks
on
a
link
to
ftp://example.com/todaysnews
.
Although
Bob's
browser
can
easily
open
an
FTP
connection
to
retrieve
a
file,
there
is
no
way
for
the
browser
to
reliably
determine
the
nature
of
the
information
received.
Even
if
the
URI
were
ftp://example.com/todaysnews.html
the
browser
would
be
guessing
if
it
assumed
that
the
file's
contents
were
HTML,
since
no
normative
specification
ensures
that
data
from
ftp
URIs
ending
in
.html
is
in
any
particular
format.
The Web's retrieval algorithm works best when used with the core suite of protocols and formats that are most widely deployed, and that are capable of supporting retrieval of self-describing representations. These core technologies include: DNS, HTTP 1.1, HTML 4, XML, as well as widely deployed image formats such as image/jpeg and image/gif. As discussed in 5 RDF and the Self-Describing Semantic Web , RDF, OWL and GRDDL are among the additional core technologies that enable self-describing Semantic Web content. A flow diagram illustrating more details of the Web's standard retrieval algorithm is provided in A Diagram of the Web's Retrieval Algorithm .
Successful communication depends on the supplier and the consumer(s) of a document having a shared understanding of the information conveyed, and that in turn requires at least some shared assumptions about the form in which the information is represented. The simplest way to achieve this is if the media type, the document encoding, and any other conventions used for the representation are standards and are widely deployed.
Consider
Susan,
who
buys
a
new
digital
camera.
The
software
supplied
with
her
camera
uploads
photos
to
the
Web
using
the
widely-deployed
image/jpeg
media
type,
and
her
Web
server
correctly
labels
served
representations
with
that
Content-Type
.
Millions
of
user
agents
deployed
around
the
world
are
preconfigured
to
display
Susan's
photographs
and
to
extract
metadata
such
as
camera
settings
from
them.
Search
engines
are
likely
to
index
them
in
helpful
ways
too.
Now
consider
instead
Mary,
who
buys
a
different
camera
with
software
that
does
not
use
widely
deployed
Web
formats.
Indeed,
the
camera's
manufacturer
has
invented
a
new
"raw"
file
format
that
takes
advantage
of
the
camera's
special
features.
The
provided
photo
management
software
not
only
uses
that
format
locally,
it
also
uploads
photos
to
Mary's
Web
server
in
that
same
form.
Indeed,
it
even
uploads
a
.htaccess
file
,
configuring
the
server
to
label
served
representations
with
the
proprietary
Content-Type
image/x-fancyrawphotoformat
.
In
this
example,
there
are
no
outright
violations
of
Web
architecture,
but
the
decision
to
use
an
uncommon,
proprietary,
unregistered
and
apparently
experimental
media
type
is
unfortunate.
No
existing
Web
user
agents
recognize
the
image/x-fancyrawphotoformat
media
type,
search
engine
spiders
are
unlikely
to
extract
useful
information
from
pictures
in
that
format,
and
so
on.
Unlike
Susan's,
which
can
be
viewed
by
almost
anyone,
Mary's
photos
are
at
best
useful
to
a
few
people
who
have
the
proprietary
software
needed
to
decode
them.
Good Practice
Web resource representations should be published using widely deployed standards and formats.
The
techniques
described
above
apply
in
the
many
cases
where
widely
deployed
media
types
such
as
image/jpeg
are
sufficient,
but
the
Web
is
used
for
a
broad
and
continually
growing
range
of
information.
No
fixed
set
of
formats
and
standards
can
fully
meet
the
need
to
encode
all
such
information
for
machine
processing.
Of
course,
ways
can
be
found
to
convey
almost
any
information
using
standard
media
types.
An
employment
record,
for
example,
can
be
transmitted
as
either
text/plain
or
text/html
.
The
resulting
document
may
be
quite
suitable
for
browsing,
but
it
might
not
facilitate
automated
discovery
of
the
employee's
name,
his
or
her
date
of
hire,
and
so
on.
To
meet
such
needs,
new
standards
must
be
created,
e.g.
for
marking
up
the
names
and
dates.
Similarly,
the
need
may
arise
to
use
new
values
for
individual
fields
such
as
rel
attributes
on
HTML
link
elements
(see
[TAGIssue51]
).
So, although the Web requires self-describing documents that can be understood using only widely deployed standards, there is also a continual need for new formats and encoding conventions. How can new formats and encodings be deployed in a manner that is self-describing? The following sections explore ways of creating new formats and encoding conventions that maximize interoperability with existing Web infrastructure, and that can be used to create self-describing documents.
Innovations can be introduced to the Web at many different architectural layers. For example:
New URI schemes can be introduced
New transfer protocols can be deployed
New media types can be introduced
New namespace-qualified markup can be defined for XML
New RDF properties and ontologies can be defined for the Semantic Web
Often, a given capability could in principle be deployed at any of several different layers. For example, new sorts of content, such as movies, could be made available using new URI schemes and/or with new protocols, but doing so would require updating hundreds of millions of user agents, servers, proxies, and so on to understand these changes to the core mechanisms of the Web. Usually it is preferable to leverage the existing core mechanisms of the Web, such as http-scheme URIs and the HTTP protocol, as these are widely deployed. Indeed, one should usually leverage as many existing layers of the Web's architecture as is practical when introducing new function.
Good Practice
When extending the Web with new formats and functions, use existing URI schemes, protocols, and media types wherever practical.
One way to do this is to use URI-based extensibility within existing media types, as described in the sections below.
Many documents, particularly those that convey machine-readable data or messages, encode information using specifications that are specialized to particular purposes. Such specifications may cover details of particular data formats such as lists of customers or inventory records, results of scientific experiments, listings for television shows, details of university course offerings, information about molecular structures or drug tests, etc. Because of the great variety and number of such formats and their specifications, it's not practical to assume that even most of them will be directly implemented by typical Web user agents. Instead, the Web provides means by which the necessary specifications can be discovered, and to a significant degree implemented, dynamically and automatically. This is done by:
ensuring that every specification, and in many cases each markup tag name or data value used, is identified with a URI
ensuring that such URIs are used in the instance either directly as data values or tag names, or else to identify the encodings used
including in Web representations URIs that identify the specifications needed to interpret those representations
In other words, it should be possible to discover from each Web representation the conventions used to encode it, and particularly in cases where those conventions are not widely deployed, to find within the representations links to specifications, ontologies and/or programs necessary for interpreting the representation. So, just as the Web may be used to dynamically discover a great wealth of resources, it can also be used to dynamically discover the specifications, ontologies, or programs needed to interpret the representations of those resources.
Good Practice
Web representations should link to the information needed to support automatic processing of those representations.
The
Atom
Syndication
Format
[ATOM]
is
an
XML-based
format
for
syndicating
information
about
blogs
and
other
Web
resources.
Atom
syndications
are
served
with
with
Content-Type
application/atom+xml
,
and
thus
can
be
recognized
by
user
agents.
ATOM
entries
can
include
<atom:link>
elements
such
as
the
following:
<entry> <title>An interesting picture</title> <link rel="enclosure" type="image/jpeg" length="12345" href="http://example.org/interestingPic"/> <content type="xhtml" xml:lang="en" xml:base="http://example.org/"> <div xmlns="http://www.w3.org/1999/xhtml"> <p><[Update: Here's an interesting picture.]</p> </div> </content> </link> </entry>
The
link
elements
identify
external
resources,
in
this
case
an
image/jpeg
photograph.
Furthermore,
each
link
can
carry
a
rel
attribute
which
specifies
the
relationship
between
the
linked
resource
and
the
ATOM
entry
that
links
to
it.
In
the
example
above,
the
relationship
is
specified
as
enclosure
which,
according
to
the
ATOM
specification,
indicates
that
the
linked
photograph
may
have
been
too
large
for
inline
processing
with
the
rest
of
the
feed.
What's
of
interest
for
this
finding
is
the
fact
that
values
of
the
rel
attribute
are
URIs
(actually
[IRI]
s,
which
are
the
internationalized
form
of
URIs),
or
else
the
values
can
be
mapped
to
URIs.
This
means
that
anyone,
anywhere
can
invent
a
new
sort
of
link
relationship,
can
assign
a
URI
to
identify
that
relationship,
and
can
use
that
value
in
the
rel
attribute.
For
example:
<entry> <title>An interesting picture</title> <link rel="http://example.org/SomeNewATOMRelationship" type="image/jpeg" length="12345" href="http://example.org/interestingPic"/> <content type="xhtml" xml:lang="en" xml:base="http://example.org/"> <div xmlns="http://www.w3.org/1999/xhtml"> <p><[Update: Here's an interesting picture.]</p> </div> </content> </link> </entry>
Furthermore,
anyone
doing
this
can
(and
indeed
should)
provide
information
about
that
new
relationship
via
HTTP
from
the
assigned
URI.
For
convenience,
the
ATOM
specification
also
provides
that
short
form
names
such
as
enclosure
in
the
first
example
can
be
registered
with
IANA,
and
ATOM
provides
a
deterministic
mapping
to
a
URI
for
each
of
these.
These
URIs
are
formed
by
prepending
the
fixed
base
URI
http://www.iana.org/assignments/relation/
to
the
short
form.
Thus,
the
first
example
above
is
in
fact
using
the
relationship
http://www.iana.org/assignments/relation/enclosure
.
This example shows how use of URIs for data values enables distributed assignment of new values. More importantly for this finding, the use of URIs for such values provides the opportunity for information about those values to be discovered dynamically on the Web.
[Microformats]
provide
a
simple
means
of
marking
up
data
in
HTML
Web
pages.
The
presence
of
a
microformat
is
typically
indicated
by
the
appearance
of
an
identifying
value
such
as
vcard
in
an
HTML
class
attribute,
and
particular
data
items
are
usually
marked
with
other
class
values.
Indeed,
somewhat
confusingly,
the
[hCard]
specification
requires
that
hCards
be
identified
with
the
root
class
name
vcard
,
as
in
the
example
below.
This
hCard
provides
contact
information
for
the
North
American
office
of
the
W3C:
<div class="vcard"> <a class="fn org url" href="http://www.w3.org/">World Wide Web Consortium</a> <div class="adr"> <span class="type">Work</span>: <div class="street-address">32 Vassar Street</div> <span class="locality">Cambridge</span>, <abbr class="region" title="Massachusetts">MA</abbr> <span class="postal-code">02139</span> <div class="country-name">USA</div> </div> <div class="tel"> <span class="type">Work</span> +1-617-253-2613 </div> <div class="tel"> <span class="type">Fax</span> +1-617-258-5999 </div> </div>
In
general,
microformats
such
as
hCard
are
not
self-describing,
because
there
is
no
requirement
in
the
HTML
media
type
specifications
that
class
attribute
values
such
as
vcard
or
type
be
interpreted
per
the
hCard
specification.
Indeed,
lacking
any
specific
indication
that
the
resource
owner
has
intended
this
interpretation,
it
is
dangerous
for
clients
to
assume
hCard
semantics
—
there
is
a
real
risk
that
some
HTML
Web
pages
use
values
like
type
,
value
or
even
in
principle
vcard
for
other
purposes.
Unlike
some
other
microformats,
hCard
does
provide
an
option
for
deploying
in
a
way
that
is
self-describing.
The
hCard
profile
specifies
a
value
for
the
profile
attribute
of
the
HTML
4.01
[HTML401]
<HEAD>
element:
<head profile='http://www.w3.org/2006/03/hcard'>
and presence of this profile value indicates that class attributes can be reliably interpreted per the hCard specification. (Note, however, that there is ongoing discussion as to whether the profile attribute will be included as part of HTML 5, and if not, whether some other mechanism will be provided for signaling the use of extensions such as microformats.)
So,
microformats
are
self-describing
only
when
profiles
or
other
means
licensed
by
a
pertinent
media
type
specification
are
used
to
enable
them.
Unfortunately,
few
microformats
have
such
profiles,
and
even
when
profiles
are
available,
evidence
suggests
that
they
are
not
universally
applied.
User
agents
that
infer
the
presence
of
microformats
without
reliable
indicators
such
as
<HEAD>
element
profiles
are
at
risk
of
extracting
incorrect
data
from
Web
pages.
XML
Namespaces
[XMLNamespaces]
facilitate
the
creation
of
self-describing
XML
documents.
When
namespace-qualified
attribute
and
element
names
are
used
in
XML,
recursive
processing
from
the
root
element
may
be
applied
to
determine
not
just
the
overall
nature
of
the
document,
but
also
the
meaning
in
context
of
all
sub-elements.
Doing
this,
however,
requires
understanding
of
the
semantics
of
each
named
element.
Although
a
few
specific
XML
variants
such
as
application/xhtml+xml
may
be
directly
supported
by
some
user
agents,
no
user
agent
can
build
in
support
for
the
ever
growing
set
of
XML
languages
used
for
Web
representations.
This
section
describes
how
namespace
documents,
discoverable
from
the
XML
tag
names
in
the
markup,
can
be
used
to
make
such
languages
self-describing,
and
to
enable
automated
processing
of
them.
When
XML
namespaces
are
used,
each
XML
element
is
named
with
a
qualified
name
,
consisting
of
a
prefix
and
a
local
name.
In
the
following
example,
the
root
element
has
the
qualified
name
<inventory:inventoryItem>
:
<inventory:inventoryItem xmlns:inventory="http://example.org/inventoryNamespace"> <inventory:itemNumber> 87354 </inventory:itemNumber> <inventory:quantityAvailable> 152 </inventory:quantityAvailable> </inventory:inventoryItem>
Qualified
names
map
to
expanded
names
such
as
{http://example.org/inventoryNamespace,inventoryItem}
,
comprised
of
a
namespace
name
URI
(
http://example.org/inventoryNamespace
)
and
a
local
name
(
inventoryItem
).
The
namespace
name
URI
serves
at
least
two
roles:
the
most
obvious
and
the
most
widely
understood
is
to
distinguish
expanded
names
in
one
namespace
from
those
in
another;
the
other
role,
and
the
one
that
is
most
important
for
purposes
of
this
finding,
is
that
it
provides
Web
identification
for
the
namespace
itself.
The
namespace
is
a
Web
resource,
and
like
any
other
resource,
it
can
and
should
provide
representations
using
HTTP.
A
user
agent
processing
an
XML
document
can
retrieve
descriptions
of
the
namespaces
used
in
that
document,
and
can
use
that
retrieved
information
to
determine
how
to
correctly
process
the
XML
markup.
The
TAG
Finding
"Associating
Resources
with
Namespaces"
[NamespaceDocuments]
,
recommends
the
use
of
[RDDL]
as
a
preferred
means
of
documenting
namespaces.
RDDL
is
itself
extensible,
but
it
is
commonly
used
to
suggest
XML
Schemas
(in
any
of
several
languages),
XSLT
Stylesheets,
etc.
that
are
usable
with
markup
from
the
namespace
being
described.
Example:
assume
that
user
Bob
is
browsing
the
Web,
and
that
he
follows
a
link
to
a
resource
that
returns
the
XML
above
as
its
representation.
Specifically,
Bob's
browser
uses
2
The
Web's
Standard
Retrieval
Algorithm
to
retrieve
the
representation,
to
determine
its
character
encoding,
and
to
discover
that
its
Content-type
is
application/inventory+xml
.
Of
course,
it's
very
unlikely
that
Bob's
browser
has
built
in
knowledge
of
the
inventory
XML
language,
but
the
Content-type
makes
clear
[XMLMediaType]
that
the
representation
can
be
interpreted
as
XML
with
Namespaces.
The
root
element
tag
is
from
namespace
http://example.org/inventoryNamespace
,
which
uses
the
http
scheme,
so
Bob's
browser
does
an
HTTP
GET
from
that
URI.
What
comes
back
is
a
RDDL
document
containing
the
following
<rddl:resource>
element:
<rddl:resource xlink:role="http://www.w3.org/1999/XSL/Transform" xlink:arcrole="http://www.w3.org/1999/xhtml" xlink:href="http://example.org/InventoryToBrowsableHTML.xslt" xlink:title="Transform Inventory XML to HTML for Browsing"> </rddl:resource>
This
designates
a
stylesheet
(
http://example.org/InventoryToBrowsableHTML.xslt
)
that
can
be
applied
to
format
the
inventory
XML
as
HTML
—
the
browser
automatically
retrieves
and
applies
the
stylesheet,
producing
HTML
that
is
rendered
on
the
screen.
Without
any
manual
intervention
from
Bob,
his
browser
automatically
displays
the
inventory
record
in
a
format
that
is
convenient
to
read
and
print.
Bob's
browser
may
also
be
enabled
for
XML
validation,
in
which
case
it
can
look
in
the
RDDL
for
a
link
to
a
schema
to
validate
the
inventory
markup.
Bob's
browser
has,
by
retrieving
and
processing
the
RDDL
document,
extended
itself
for
processing
of
the
inventory
markup
language.
Unless
the
RDDL
provides
a
link
to
one
or
more
executable
programs
that
process
inventory
records,
it's
unlikely
that
Bob's
browser
can
automatically
discover
everything
that
one
might
reasonably
want
to
know
about
processing
inventory
markup.
Still,
even
the
limited
automatic
function
described
above
is
very
useful,
and
RDDL
is
an
extensible
framework
that
can
be
easily
adapted
to
provide
new
kinds
of
information
about
namespaces.
Note
that
because
RDDL
documents
are
themselves
XML,
GRDDL
can
be
applied
to
derive
RDF
statements
from
them
(see
5.2
Using
GRDDL
to
Bridge
From
XML
To
RDF
).
In
this
way,
self-describing
XML
documents
can
be
integrated
with
the
self-describing
Semantic
web.
The
"Associating
Resources
with
Namespaces"
Finding
[NamespaceDocuments]
describes
this
technique
in
more
detail.
Careful
readers
of
RFC
3023,
which
governs
Unfortunately,
this
example
is
oversimplified
in
at
least
one
way.
The
media
type
registration
[XMLMediaType]
for
application/xml
,
the
type
conventionally
used
to
transmit
XML
media
types,
will
note
on
the
Web,
does
not
unambiguously
require
that
it
only
allows
namespace
URIs
to
be
interpreted
as
specifying
the
semantics
of
qualified
names,
without
requiring
this.
This
means
be
applied
to
XML
documents.
At
this
time,
it
is
uncertain
whether
a
future
revision
to
that
processors
registration
might
require
such
an
interpretation,
but
in
the
meantime
Bob's
browser
cannot
be
sure
that
do
not
behave
as
the
interpretation
inferred
from
the
RDDL
is
the
one
described
in
intended.
In
principle,
another
media
type,
perhapes
from
the
example
above
are
not
in
violation
family
of
the
RFC.
But
processors
that
do
behave
types
application/____+xml
,
could
be
registered
to
provide
such
extensible
semantics,
and
a
user
agent
aware
of
such
a
type
could
indeed
dynamically
interpret
markup
in
that
way
are
not
only
allowed
by
the
RFC,
but
support
the
integration
of
manner
described
above.
In
any
case,
XML
into
the
self-describing
web.
qualified
names
are
an
important
example
of
URI-based
extensibility.
Those
designing
similar
languages
may
wish
to
carefully
consider
URI-based
extensibility
when
registering
media
types
for
use
with
their
languages.
RDF [RDF] provides an interoperable means of publishing and linking self-describing Web data resources, and for integrating representations rendered using other technologies such as XML. The result is a single, global self-describing Semantic Web that integrates not only resources that are themselves built or represented using RDF, but also the other Web resources to which that RDF links, as well as those that can be mapped to RDF using technologies such as [GRDDL] . Readers unfamiliar with RDF should consult the RDF primer [RDFPrimer] or the N3 Primer [N3Primer] as a prerequisite to understanding the discussion below.
Each
RDF
statement
is
a
triple
consisting
of
a
subject,
a
predicate
(typically
the
identifier
for
a
property,
or
for
a
relationship
between
two
Web
resources),
and
an
object
(the
value
of
the
property
or
the
referent
of
the
relationship).
The
subject,
the
predicate,
and
often
the
object
as
well,
are
themselves
identified
by
URIs,
enabling
the
dynamic
discovery
introduced
in
4.2
URI-based
Extensibility
above
—
if
a
user
agent
has
no
built
in
knowledge
of
some
particular
RDF
subject,
relationship,
or
object,
it
can
often
use
the
URI
to
retrieve
the
information
necessary
for
processing.
Indeed,
RDF's
Schema
[RDFSchema]
and
OWL
Ontology
technologies
[OWL]
together
offer
a
standard,
machine-processable
means
of
describing
relationships
between
RDF
statements,
e.g.
that
one
property
is
an
two
seemingly
differing
predicates
are
the
"
rdfs:subPropertyOf
owl:sameAs
of
another.
"
each
other.
As described in 2 The Web's Standard Retrieval Algorithm , the principal purpose of the Web's core retrieval algorithm is to obtain self-describing representations of Web resources. For the self-describing Semantic Web, the algorithm is extended to achieve a more particular goal: to directly obtain RDF triples that represent or indirectly obtain RDF triples that describe the referenced resource.
Consider Amy, who uses an RDF-enabled user agent to retrieve an RDF/XML document containing the following element:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:employeeData="http://example.org/EmployeeInformation#"> <employeeData:employee rdf:about="http://example.org/Employees#BobSmith"> <employeeData:name>Bob Smith</employeeData:name> <employeeData:email rdf:resource="mailto:BobSmith@example.org"/> </employeeData:employee> </rdf:RDF>
The
user
agent
is
general
purpose,
and
although
it
has
rules
for
certain
commonly
used
ontologies,
it
has
no
built
in
code
to
handle
the
employeeData
properties
in
the
above
example.
To
dynamically
acquire
the
necessary
function,
the
agent
does
an
HTTP
GET
for
http://example.org/EmployeeInformation
.
The
GET
returns
an
OWL
ontology,
from
which
the
agent
discovers
that
http://example.org/EmployeeInformation#email
is
rdfs:subPropertyOf
the
http://www.w3.org/2001/vcard-rdf/3.0#email
property,
one
that
the
agent
recognizes
as
designating
a
person's
e-mail
address.
The
agent
offers
Amy
the
option
to
send
e-mail
to
Bob
Smith.
Amy's
browser
has,
like
Bob's
in
the
example
above,
automatically
extended
itself
for
processing
the
employee
data.
Note
that
in
this
case,
the
user
agent
has
built
in
capabilities
corresponding
to
the
URI,
HTTP,
XML,
RDF,
and
RDFS
specifications,
and
awareness
of
the
vCard
ontology.
FOAF
specifications.
Additional
capability
was
enabled
dynamically
based
on
the
linked
ontology
information
acquired
at
runtime
by
following
links.
runtime.
Good Practice
Representations provided directly in RDF, or those for which automated means can be used to discover corresponding RDF, contribute to the self-describing Semantic Web.
Because its model is uniform, because all of its self-description is provided in the same model as the data itself, and because all RDF information is linked into the Web as a whole, RDF provides uniquely powerful facilities for dynamic integration of a self-describing Web. Therefore, it's particularly important that information not originally supplied in an RDF-specific format be convertible into RDF. The sections below discuss two means of doing this: the first shows how RDFa can integrate HTML documents into the Semantic Web, and the second illustrates the use of GRDDL to extract RDF from XML documents.
[RDFa] is a W3C Recommendation for embedding Semantic Web statements into XHTML Web pages (see also [RDFaSyntaxandProcessing] ). The following example illustrates how RDFa can integrate HTML into the self-describing Semantic Web:
Mary is exploring the Web using a browser that has been enhanced with capabilities for interpreting RDFa. Her browser knows to look through each XHTML Web page that she browses, picking out information from the RDFa, and helping her to use it. For example, the page might contain the following HTML, which represents a FOAF-based contact listing. (This example is adapted from one in [RDFa] ):
<?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml" version="XHTML+RDFa 1.0" xml:lang="en"> <head> <title>FOAF/RDFa Demo</title> </head> <body> <div typeof="foaf:Person" xmlns:foaf="http://xmlns.com/foaf/0.1/" about="http://example.org/staff/Alice"> <p property="foaf:name"> Alice Birpemswick </p> <p> Email: <a rel="foaf:mbox" href="mailto:alice@example.com">alice@example.com</a> </p> <p> Phone: <a rel="foaf:phone" href="tel:+1-617-555-7332">+1 617.555.7332</a> </p> </div> </body> </html>
Even
though
this
document
is
of
media
type
application/xhtml+xml
[XHTMLMediaType]
,
which
is
not
a
member
of
the
RDF
family
of
media
types,
an
RDFa-enabled
user
agent
can
extract
RDF
from
this
document.
This
document
conveys
as
RDF
a
set
of
semantic
Web
statements
about
the
Web
resource
http://example.org/staff/Alice
.
The
predicates
are
all
named
with
the
same
base
URI
http://xmlns.com/foaf/0.1/
,
for
which
the
shorthand
prefix
foaf
is
established
in
the
HTML.
Using
this
syntax,
the
RDFa
carries
triples
for
relationships
such
as
the
full
name
of
the
contact
(
http://xmlns.com/foaf/0.1/#name
),
which
is
Alice
Birpemswick
,
the
e-mail
address
(
http://xmlns.com/foaf/0.1/#mbox
)
which
is
mailto:alice@example.com
,
and
so
on.
An
RDFa-enabled
user
agent
can
extract
these
triples
and
use
them
to
help
Mary
work
with
the
data
they
contain,
or
to
integrate
with
other
Semantic
Web
information.
Indeed
RDF
is
designed
for
such
use
because,
as
discussed
above
in
5
RDF
and
the
Self-Describing
Semantic
Web
,
Semantic
Web
triples
are
inherently
self-describing.
If
a
user
agent
needs
more
information
about
the
processing
of
the
email
triple
it
can,
like
Amy's
user
agent,
do
an
HTTP
GET
for
URI
http://xmlns.com/foaf/0.1/
,
and
use
the
results
to
get
more
information.
That
information
may
allow
the
agent
to
automatically
discover
that,
in
the
example,
mailto:alice@example.org
can
indeed
be
used
to
send
mail
to
the
person
named
Alice
Birpemswick
.
The
browser
can
then
offer
Mary
the
option
to
send
e-mail
to
Alice,
or
to
add
Alice
to
her
address
book.
Good Practice
To integrate HTML information into the self-describing Semantic Web, use RDFa.
RDFa
documents
such
as
the
one
shown
above
are
grounded
in
the
Web
of
data
when
served
using
the
Content-type
application/xhtml+xml
.
The
specification
for
that
media
type
[XHTMLMediaType]
designates
it
as
a
member
of
the
family
of
application/____+xml
media
types,
and
the
specification
for
those
types
[XMLMediaType]
in
turn
provides
for
markup
to
be
interpreted
based
on
the
namespaces
used
in
the
document.
Finally,
the
namespace
document
for
XHTML
[XHTMLNamespace]
allows
use
of
XHTML
with
RDFa.
So,
taken
together,
these
specifications
provide
normatively
for
use
of
RDFa
in
documents
served
with
Content-type
application/xhtml+xml
.
Note
that
in
this
case,
the
user
agent
has
built
in
capabilities
corresponding
to
the
URI,
HTTP,
XML,
XHTML
and
RDFa
and
FOAF
specifications.
The
server
used
these
technologies
to
publish,
and
there
was
successful
communication.
RDFa provides a standard means of encoding RDF information in XHTML documents, but many other XML variants lack that capability. Furthermore, RDFa requires explicit encoding of each triple in the XHTML instance, and that may in some cases be impractical. [GRDDL] provides a standard means of extracting triples from a broad range of XML document formats. Each GRDDL-enabled XML document links to a transformation that, when applied to the document, produces RDF triples. Typically, the same GRDDL transformation can be used on entire families of similar XML documents.
For example, assume that Albert uses a GRDDL-enabled user agent to retrieve an XML document containing the following fragment:
<employees xmlns="http://example.org/employeeNS"> <employee name="Bob Smith"> <email>BobSmith@example.org</email> </employee> </employees>
Note
that,
unlike
the
earlier
examples,
this
is
neither
in
HTML
nor
in
RDF;
we
can
assume
that
http://example.org/employeeNS
is
a
namespace
created
by
some
particular
business
for
use
in
its
own
business
documents.
Albert's
agent
has
no
built
in
knowledge
of
this
namespace,
and
so
can
not
do
much
with
it.
Now
assume
that
Albert
instead
retrieves
a
different
document.
Most
of
the
markup
and
data
in
it
is
identical
to
the
first,
but
this
document
is
GRDDL
enabled:
<employees xmlns="http://example.org/employeeNS" xmlns:grddl="http://www.w3.org/2003/g/data-view#" grddl:transformation="http://example.org/GRDDL_For_employeeNS.xsl> <employee name="Bob Smith"> <email>BobSmith@example.org</email> </employee> </employees>
Albert's
user
agent
is
GRDDL
aware,
so
it
transforms
the
<employees>
information
to
RDF
using
the
supplied
GRDDL_For_employeeNS.xsl
transformation,
producing
RDF
triples
from
a
vCard-based
ontology
—
that
the
agent
has
knowledge
of
this
ontology,
and
it
uses
the
triples
to
determine
recognizes
as
designating
Bob
Smith's
email
address.
As
in
the
example
above,
Albert's
agent
offers
to
send
mail
to
Bob
Smith.
(The
agent
might
also
dynamically
discover
how
to
interpret
the
triples
using
the
techniques
described
above
in
5
RDF
and
the
Self-Describing
Semantic
Web
.)
Good Practice
To integrate information conveyed in XML into the self-describing Semantic Web, use GRDDL.
Note that in this case, the user agent was programmed to understand URI, HTTP, XML, XHTML, GRDDL and the vCard ontology. The information needed for translation from the XML vocabulary into RDF was loaded dynamically.
The Web is an important medium for publishing information, and so it is often important to get agreement about what has in fact been published, and by whom. Consider the following example:
Senator
Smith
alleges
that
he
has
been
libeled
by
an
article
published
on
the
Web,
and
so
the
senator
files
suit
against
the
purported
publishers.
He
produces
in
court
the
log
of
a
response
to
an
HTTP
GET
for
URI
http://publisher.example.com/oursenator.html
:
HTTP/1.1 200 OK Date: Tue, 22 Oct 2008 02:43:22 GMT Server: Apache Content-Type: text/html; charset=utf-8 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>The problem with Senator Smith</title> </head> <body> <h1>Senator Smith is a Liar and a Thief!</h1> <p>Our senator, Mr. Smith, steals money from children, and he lies on his income tax returns! </p> </body> </html>
When confronted by the judge, the owners of the Web site attempt to claim that they have not in fact published information about the senator. "Yes", they say, "our server did accept a TCP/IP connection at port 80 and it responded with some bits, but those bits don't mean what you think they mean. They might look to you like HTML, but that's not what we intended. We haven't said anything derogatory about the senator."
In fact, the specifications for the Web can be useful in establishing that a document was indeed published containing the statement: "Senator Smith is a Liar and a Thief!". The pertinent Web specifications indicate that the bits returned are to be interpreted as the UTF-8 encoding of Unicode characters, that those characters are to be interpreted as HTML, that the text within the HTML is to be read in English, and thus that the entity body is indeed to be read as containing the sentence quoted above.
How,
though,
can
we
know
whether
these
specifications
apply
at
all?
Perhaps
the
publishers
could
claim
that
some
other
specifications
applied?
Unfortunately
for
them,
the
Web
is
clear
on
that
point
as
well.
Starting
with
a
URI
such
as
http://localnewspaper.example.com/oursenator.html
and
the
specification
for
URIs
[URI]
,
all
the
applicable
specifications
for
interpreting
an
http-scheme
Web
representation
can
be
unambiguously
discovered
by,
recursively,
following
references
to
other
specifications.
Using
this
specific
example
to
illustrate:
The
[URI]
specification
indicates
that
the
registration
of
URI
schemes
is
provided
for
in
[BCP35]
[RFC2717]
.
This
in
turn
indicates
that
IANA
maintains
a
registry
of
URI
schemes,
which
is
at
[IANASchemeRegistry]
.
That
page
in
turn
shows
that
URIs
employing
the
http
scheme
are
governed
by
RFC
2616
[HTTP]
.
RFC
2616
describes
the
interpretation
of
the
Content-type
header,
indicating
that
the
value
of
this
header
is
an
Internet
Media
Type,
and
provides
a
reference
to
[RFC1590]
for
looking
up
the
interpretation
of
particular
media
types.
RFC
1590
in
turn
indicates
that
a
registry
of
such
media
types
is
available
from
IANA,
and
from
that
[IANAMediaTypeRegistry]
one
can
discover
that
the
documentation
of
media
type
text/html
is
found
at
[HTMLMediaType]
.
This
finally
provides
the
normative
interpretation
of
the
charset
parameter,
verifying
that
the
information
is
indeed
to
be
interpreted
as
UTF-8,
and
that
the
markup
in
the
HTTP
entity
body
is
HTML.
Similar
techniques
can
be
used
to
determine
that
the
lang
attribute
in
the
HTML
does
indeed
call
for
text
in
the
document
to
be
read
in
English.
...and so on.
Of course, determining whether Senator Smith has been libeled is beyond the scope of Web Architecture. The architecture does, however, provide the means by which one can determine unambiguously the specifications that apply to interpretation of the data retrieved. In general, it is possible to determine which specifications apply to interactions with resources identified by any particular URI; as explained above, these specifications can be located by recursively following references from the specifications for URIs themselves. Whatever the other legal arguments about the case, the interpretation of the document is well defined.
Documents published using HTTP (or other schemes and protocols that are appropriately registered) are thus self-describing not just in the general sense of being interpretable using widely available information, but in the particular sense of having an interpretation that follows from the URI used for access and from the core specifications of the Web. We therefore refer to such documents as being grounded in the Web . From the URI referenced and the specification for URIs [URI] , the other specifications for interacting with the resource and for interpreting its representations are determined unambiguously. Furthermore, this characteristic implies an important degree of accountability for those who serve information on the Web. As illustrated by the example above, Web architecture settles many important questions relating to what information has been published, and by whom.
As explained in 4 Creating New Formats and Standards , new formats and protocols are occasionally needed for use with the Web. For information published using those new technologies to be grounded in the Web, it's essential that the pertinent specifications be reachable by references, preferably in the form of Web links, from the specifications for existing Web technologies. Means by which this can be achieved include:
The
new
specification
can
be
registered
in
a
suitable
registry.
Examples
of
registries
that
are
available
today
include
the
IANA
registries
for
URI
schemes
[IANASchemeRegistry]
or
Internet
media
types
[IANAMediaTypeRegistry]
.
Suitable
registries
are
those
that
are
themselves
normatively
linked
or
referenced
from
applicable
specifications
for
the
Web
(e.g.
the
IANA
scheme
registry
is
discussed
in
[BCP35]
[RFC2717]
,
which
in
turn
is
referenced
from
[URI]
).
Registries
like
this
can
be
extremely
valuable,
but
there
is
also
a
logistical
and
social
cost
to
maintaining
them,
and
a
cost
to
developers
who
must
arrange
for
the
registration
of
new
entries.
In
return
for
these
costs,
the
community
benefits
from
avoidance
of
collisions
in
the
allocation
of
identifiers,
and
in
many
cases
from
careful
review
of
new
registrations.
Existing specifications can be edited and republished to link explicitly to specifications for the new technology.
In certain cases, the new specification may be discoverable automatically. For example, when a new namespace is used in XML, it may be sufficient to publish a namespace document (see [XMLNamespaceDocuments] ), as [XMLMediaType] is itself properly linked, and it already provides for extension of XML using Namespaces.
Good Practice
Specifications for interactions with Web resources, and for interpretation of resource representations, should be linked (directly or indirectly) from the specification for URIs [URI] .
Ad hoc exploration of the Web is possible only if resource representations are self-describing. Using the techniques described above and starting with an http- or https-scheme URI, a user agent can proceed step by step to retrieve a representation, reliably discover the conventions that have been used to encode it, and if necessary, dynamically find instructions for processing it. Furthermore, it is possible starting with a URI and the specification for URIs [URI] , to follow successive references to specifications that apply to interactions with the identified resource. These specifications, recursively, define the interpretation of information published in the Web.
Those who invent new document formats, new markup tags, or new conventions for encoding particular data values should use the techniques described above to make those formats self-describing, and to ensure that the pertinent specifications can be discovered by following references from existing ones. When these techniques are used, and when self-describing representations are linked together, the Web as a whole can support reliable, ad hoc discovery of information, and can grow to include ever more powerful languages and systems.