Metadata in XHTML2

Steven Pemberton, CWI/W3C, Amsterdam

About me

Senior researcher at CWI, the Dutch National Research Centre for Mathematics and Computer Science.

Involved in the Web from the beginning: organised two workshops at the first Web Conference in 1994

Chair of the HTML and Forms working groups at W3C

Co-author of CSS, HTML4, XHTML1, XForms, XML Events, XHTML2, etc.

Abstract

Using a clever mutation of the <link> and <meta> elements, and the addition of a 'role' attribute, XHTML2 allows authors to layer real semantics on top of documents, semantics with a clear relationship to RDF, so that XHTML2 can be properly integrated into the semantic web.

But by layering semantics on top of XHTML in this way, a lot of special-purpose formats are rendered unnecessary.

This talk discusses the XHTML2 approach to Metadata.

HTML and Structure

In the NITF Tutorial, it states:

Web authors use HTML to describe the display of their pages. NITF, on the other hand, is designed to describe the substance of news article,

This is actually not true: HTML was designed as a structure defining language.

The browser manufacturers in classic Marking Behaviour, not understanding the structure defining design of HTML went and added presentation features.

XHTML2

XHTML2 is the next iteration in the HTML family.

XHTML1 addressed the problems of turning HTML into an XML application.

XHTML2 addresses the remaining identified problems in HTML4/XHTML1

XHTML2 Design Aims

In designing XHTML2, a number of design aims were kept in mind to help direct the design. These included:

As generic XML as possible: if a facility exists in XML, try to use that rather than duplicating it. This means that it already works to a large extent in existing browsers (main missing functionality XForms and XML Events).

More structure, less presentation: use stylesheets for defining presentation.

More usability: within the constraints of XML, try to make the language easy to write, and make the resulting documents easy to use.

More accessibility: 'designing for our future selves' – the design should be as inclusive as possible.

XHTML2 (more)

Better internationalization.

Better forms: after a decade of experience, we now know how to make forms a better experience.

Less scripting: achieving functionality through scripting is difficult for the author and restricts the type of user agent you can use to view the document. We have tried to identify current typical usage, and include those usages in markup.

More device independence: new devices coming online, such as telephones, PDAs, tablets, televisions and so on mean that it is imperative to have a design that allows you to author once and render in different ways on different devices, rather than authoring new versions of the document for each type of device.

Better semantics: integrate XHTML into the Semantic Web.

Try to make the world a Better Place

Keep old communities happy

Keep new communities happy

Needs of Semantics in XHTML

Integration with RDF/Semantic Web

Readable and writable by the HTML community

Flexible, extensible

News and metadata

News distribution is all about content and metadata.

NewsML for instance is essentially a big metadata wrapper round XHTML.

The question is: where should the metadata go, and how should it be expressed?

The XHTML2 approach

What we have done is craftily mutated <meta> and <link> so that they look more or less the same to the HTML author, but now have a clear relationship to RDF.

Then we generalised.

This was originally proposed in a white paper RDF/A (warning: details have changed since this was published), and after much work in a joint semantic web/HTML WG task force, was adopted into XHTML2 (that work is still not quite finished, since a detail (bnodes) still has to be finalised).

The approach: meta

Extend the meta element:

Example:

<meta property="dc:creator">Steven Pemberton</meta>

This is also still allowed:

<meta property="dc:creator" content="Steven Pemberton"/>

The approach: link

Extend the link element slightly:

Example:

<link rel="dc:rights"
    href="http://example.com/terms/contract123"/>

The approach: role

Add a role attribute applicable to any element, that specifies a semantic role for that element

Examples

<p role="nitf:byline">By Joseph P. Reporter</p>
<p role="prism:copyright">&copy; Copyright 2001,
    Wanderlust Publications.
    All rights reserved.</p>

The approach: generalise

Having done that, we then allow all the attributes of <link> and <meta> on any element.

This was already allowed:

This work is licensed under the
 <a rel="dc:rights"
    href="http://creativecommons.org/licenses/by/2.0/">
    Creative Commons Attribution License</a>.

but you can also say things like this:

<body>
      <h property="title">My Life and Times</h>
      ...

which makes the top level heading and the title of the document the same thing, so they never get out of step.

The approach: about

The about attribute allows you to describe other documents, but also parts of the current document

Example

<meta about="#p123" ...

One usage of many is to allow richer metadata than the title attribute allows. Now we can just say that

<p id="p123" title="whatever">

is equivalent to:

<p id="p123">
   <meta about="#p123" property="title">whatever</meta>

Examples

<meta property="newsml:Identification">
    <meta property="newsML:ProviderId">Reuters.com</meta>
    <meta property="newsML:DateId">20050524</meta>
    ...
<p><span content="2005-05-23">Yesterday</span>,
   <span rel="references" href="..."
    property="foaf:fullName" content="Tony Blair"
   >the prime minister</span>
    travelled to ... </p>

Microformats

Because of the layered semantics, some formats are now strictly speaking unnecessary.

For instance, the RSS format is used to describe something else. You have to dual author or ensure that both forms are mutually up-to-date.

However, RSS is just a simple hypertext language. You could get the same effect by just marking up the very document you are describing:

<h role="rss:title">...
<p role="rss:description">...

Media

Finally I should say something about media, in particular images.

In XHTML2, the src attribute (and its related attributes) may be applied to any element, not just <img>, with the implication that they should be considered equivalent:

<p src="map.png">Turn left out of the station,
   walk straight on to the High Street, and turn right</p>
<img type="image/jpeg" src="gates.jpg">
     Bill Gates makes speech.
</img>

The relationship to RDF

We can now say that <meta> and <link> define RDF triples:

The URL for the predicate is obtained by concatenating the namespace URL from the prefix to the other part of the value.

A parallel development, GRDDL, can be used to extract the RDF triples from an XHTML2 document.

Why this solution is nice

You can explain it using HTML concepts.

If you don't care, you can just ignore it.

It doesn't require you to learn how to use RDF to be able to benefit from it.

You can build up layers of semantics, and slowly add them to existing content.

The RDF community get their triples without the HTML community having to learn RDF.

You can layer new semantics on top of XHTML2 without having to define a new document type.

Timescale

XHTML2 is going to last call Really Soon

More details: www.w3.org/TR/xhtml2, www.w3.org/MarkUp (and add /Group to the end if you are a member company).

This talk: www.w3.org/2005/Talks/05-steven-Metadata-in-XHTML2