W3C Working Group Home Page Markup Guidelines

Before you dive into this document, you might want to read an introduction into why it matters. If you've already read this and just want to get at the specific implementation details, see the Quick Summary.

Table of Contents

1 Introduction
2 Underlying Concepts
3 Basic Design
    3.1 Classifying Information
    3.2 Information Extraction
    3.3 Output: Diagnostics and Destination
4 Getting Started
    4.1 Use the Profile!
    4.2 Group Characteristics
    4.3 Choosing Markup Styles
    4.4 Referencing Other Documents
    4.5 Representing Dates and Date Ranges
    4.6 Mixing Markup Styles
    4.7 HTML Class Attribute Lists
5 Blocked Method
    5.1 Marking Blocks
    5.2 Marking Items
    5.3 Advanced Property Classes
    5.4 Information Class-Specific Details
6 Tabled Method
    6.1 Information Class-Specific Details
7 Advanced Method
    7.1 Information Class-Specific Details
8 Gotchas
9 Templates and Examples
10 Under The Hood
11 Quick Summary
12 Test Your Page

Introduction

Working group page editors who adhere to this page structure style guide can maintain their HTML 4.0 or XHTML working group page while an automated utility regenerates the information in RDF with little to no extra effort on their part aside from the initial setup. The RDF information can then be processed and reused for applications such as the W3C At A Glance portal. This builds on much of the work Dan Connolly and Dan Brickley did.

If a Working Group page editor decides to pursue their own page structure, they may use the Advanced Method guidelines or else they must independently generate a RDF presentation for their data.

This document is intended to be read in the order it is presented in. Readers may miss important details if they skip over sections.

Questions and comments to w3c-tools.

Underlying Concepts

The working group pages are "screen scraped" for their content; based on the structure and naming of elements in the pages' XHTML or tidy'd HTML 4.0, information is extracted and reformatted in RDF/XML according to a Working Group Information schema. This screen scraping is accomplished using XSL stylesheets and a stylesheet processor. If cooperating working groups maintain a page structure adhering to the following guidelines, the same XSL stylesheet can be reused for each group to extract useful information.

Basic Design

The main design issue involved balancing a working group page editor's freedom to determine page appearance and a need for recognizable page structure. It is generally too restrictive to rely on the order information appears in or the formatting of a document for clues about content, so the stylesheet is designed to recognize certain markup classes for denoting items or blocks of items of information classes.

Information Classes

There are seven classes of information recognized by this document: group characteristics, news, drafts, deliverables, participants, meetings, and teleconferences. Information that fits in these categories is described as follows:

Group characteristics
Information such as the name of the working group, its activity; further described below.
News
Anything of interest to the working group, be it deliverable status, new software, or related developments in other areas of work. Those who use Eric Miller's RSS feed generator do not need to make any changes.
Drafts
Pre-technical report status documents which are not visible to the public ("editor's drafts" and the like). Those who use Bert Bos' work need not change their drafts markup.
Deliverables
Deliverables are documents published as technical reports; most information is derived from the actual report. Discrepancies between reports and working group page information should be corrected.
Participants
Working group participants; eventually, this will be compared against the member database to correct for deficiencies in either.
Meetings
Past and future face to face meetings
Teleconferences
Past teleconferences; refer to the RDF teleconference calendar for future telenconferences.

Information Extraction

There are opposite granularities of extracted information (specific and general) with blends of specificity and generality possible, and three methods of marking up pages to extract information (advanced, tabled, and blocked). Blocked is the most basic method, and its results will tend to be general. Tabled will result in specific information. Advanced may have the same results as blocked, though how specific the results are for both advanced and tabled depends on how much effort page editors use in marking their pages.

General information can be characterized as a link to the working group's page and a description which contains the entire text of the item. This is barely useful, and it would be best if page editors did more to get more specific results.

Specific information is most obviously achieved by using advanced property classes. Each information class has its own set of advanced property classes, described more fully in Advanced Property Classes (they were originally intended only for the advanced structure method, but can also be used in the blocked method).

Blocked Method is very simple and easy to set up, requiring no maintenance at its most basic level. Tabled Method requires slightly more work to setup but still needs no maintenance. Advanced Method requires a good deal of set up and some maintenance effort for any information added to a page. The more effort a page editor invests, the better the data extraction.

The different styles of markup are described in the following sections, starting first with notes on issues common to all methods, then moving to the easiest method, blocked, and finishing with advanced.

Output: Diagnostics and Destination

The information extracted from a working group's page will be placed in their main directory as scrapedOverview.rdf, a Member visible file. The output RDF will contain a reference to a CSS file for displaying the results in a less aesthetically offensive manner, but make sure to check the output for any comments enclosed. Comments (using the <!-- --> syntax) are important diagnostics about missing information and any deficiencies should be corrected.

Getting Started

The marking up process begins here. You could open up your main working group page in your editor of choice and follow along.

Use the Profile!

The head of a working group page must have a profile of http://www.w3.org/2002/12/wg

Example: Head Profile

<html xmlns="http://www.w3.org/1999/xhtml">
 <head profile="http://www.w3.org/2002/12/wg">
  <title>MetaML Working Group</title>
 </head>
...

The HTML 4.0 recommendation defines the profile attribute to be "the location of one or more meta data profiles, separated by white space." The dereferenced URI in this case currently returns an XHTML+RDDL document, but the significance of using it in the profile has not yet been determined.

So put the profile in, we're figuring out how to make good use of it now.

Group Characteristics

Working groups share several characteristics in common; all have a name, home page, charter, encompassing activity, and hopefully a summary of the group's main goals. To mark them for recognition by the stylesheet, use a class attribute (or rel attribute when extracting a URI, like the location of the charter) on enclosing elements, or use the <span> element to mark each characteristic using the following types of classes:

The home page of the working group is passed in as a parameter, as is the group's name, though the parser will fall back to the page <title> if the name parameter is missing.

Example: Basic WG Characteristics

<head>
 <title>MetaML Working Group</title>
</head>
<body>
 <p class="summary">
 This working group is part of the
 <span class="activity">Meta Activity</span>; the
 <a href="charter.html" rel="charter">charter</a>
 is available for viewing.  Most of our work
 involves...
 </p>
 ...
</body>

Choosing Markup Styles

You can use a different method for each class of information. Determine which classes of information you actually present and which of those you want to be extracted (note that news and deliverables currently have the most tools available for using the data, but others tools may be available for different information classes as more groups subscribe).

If a page is very simple in structure, Blocked Method is probably good enough. You can embellish Blocked Method with Advanced Property Classes for better results.

If the page makes heavy use of tables of information, Tabled Method may be a good choice. However, Tabled Method is very rigid in its expectations, so check to see if Advanced Method might be a better payoff for the effort.

Go with Advanced Method if the Blocked and Tabled Methods aren't sufficient. You might consider slight restructuring of your page to use Blocked method if you don't want to deal with the maintenance effort involved with Advanced Method.

After choosing a markup style and discovering its general approach, check the Information Class-Specific Details of that section to see if there are any important notes about marking up the class you're interested in.

Referencing Other Documents

Page editors may structure their site by keeping an authoritative archive of information on a separate page. To support this style, editors can use either the <link> element with a rel attribute having one of the information class values within the <head> of a document, or the <a> element with a rel attribute (appropriate rel values are those listed in Marking Blocks). The separate page can use whatever markup style the page editor chooses.

Editors should choose only one authoritative source of information; if one type of information is being presented on the main page and also on a separate page, designate one version as authoritative and use markup for only that technique on the main page - don't place a <link>, an <a>, or an in-document class attribute on the same page for the same thing, or the scraper will end up extracting repeat items.

Example: Non-Inline News

<head>
 <link rel="news" href="news.html">
 <title>Silver Bullet Working Group</title>
</head>

Representing Dates and Date Ranges

Use class="date" to mark any elements containing dates related to items. This particularly applies to news and meetings. As per the W3C Manual of Style, "either spell out the month or use an ISO-8601-derived form," YYYY-MM-DD (e.g., 2002-10-03 for October 3, 2002). The YYYY-MM-DD form is preferred in this case. Other recognized date forms are YYYY-MM, DD Month YYYY, and the Last-Modified HTTP header format. Anything else will probably be passed through or butchered, so take care to supply dates in a usable format.

Date ranges can take one of the following forms:

Mixing Markup Styles

Different styles of markup can be used within one document, but preferably not within one information class. The choice of style for one class of information is independent of the choice for other classes. For instance, if news is done using Blocked Method, then it should not also be marked with the Advanced Method; however, meetings can be marked with the Advanced Method.

No real errors will result if you do mix markup styles within an information class, and we're looking into removing the restriction.

HTML Class Attribute Lists

The XHTML specification allows for a list of classes as an attribute value; previously there was a restriction within these guidelines on the class attribute to avoid using this capability, but the restriction has been removed. Space-separated lists of values can now be used for classes and hyperlink relationships as well as the document's profile, but avoid using any of the values in these guidelines together within the same class or rel unless it's absolutely necessary; the XSLT does not catch instances when nonsensical classes are used in the same list.

Blocked Method

A block of related information, such as working group news, is delimited using a class attribute for a block-level XHTML tag. Recognized block-level tags are: <ul>, <ol>, <dl>, and <div> blocks with <p> items; <table> is a special case described in the next section.

An information item can be recognized through a consistent method for separating each item in a set of related information, such as the news for each day within a news block. Simply using item-level tags, such as <li> within a list, will correctly and consistently separate each item.

Marking Blocks

Block-level tags must have a class attribute indicating which set of information is contained within. These can be reused should the information be separated across blocks. The following are recognized values:

Example: News Block

<h3>News</h3>
<ul class="news">
 <li>New Working Draft</li>
 <li>Beta Implementation Released</li>
</ul>

Block classes can be reused within the same document if needed. For instance, many groups separate face-to-face meetings into one list, or block, per year. Each year's block can be assigned the class of meetings, and all the meetings will be extracted.

Marking Items

As in the above example, no markup outside of the normal XHTML is necessary to denote an item within a block.

If items are implemented with <ul>, make sure to use both opening and closing <li> tags around each item. This and other requirements for valid XHTML are found in the XHTML Recommendation. There should be no other child nodes in a block aside from its items.

The <dl> element is slightly different since it may contain related children using <dt> and <dd> which do not contain one another. Items in this case must be comprised of a <dt> element and either one following or no <dd> element.

Advanced Property Classes

Each information class item is handled differently by the parser as it attempts to make best guesses about how an item's information is represented. The advanced property classes provide further clarification for which information is which within an item.

An advanced property class is to be used as the class attribute of an element inside an item. The text contained within the element will be extracted, except in the case of deliverable versions. Some advanced property classes are to be used as the rel attribute of an anchor element, in which case the value of the href attribute of the same element will be extracted. The following tables sums up which advanced property classes can be used with which information class items:

Information Class class rel
news title, link, description, date
drafts title, description, date details
deliverables title, description, date
one of: note, wd, lc, ends, cr, pr, per, rec
details, versionof
participants name, email, lastname, firstname, organization,
phone, role
email
meetings description, date agenda, minutes
teleconferences description, date agenda, minutes

The particular version of a deliverable is designated in shorthand; there should be only one in an item. What each one contains, besides ends, is irrelevant, the presence of the class is all that matters.

Information Class-Specific Details

News

There are two methods through which news blocks can be best understood. The first is similar to the W3C home page, that is, each news item is given a name or id attribute for other pages to link directly to. An RSS feed aggregator would link to the news item on the working group's page. The second is to make sure each news item contains only one link; in that case, an RSS feed aggregator would link to whatever the news item links to, using the full text as the link's description. Page editors should choose one or the other.

If neither is chosen (determined by the lack of an anchor tag with name attribute and multiple other anchor tags), the reference and title will simply be the page the news is on and the first couple of words in the item. This is to insure every group will have some sort of functioning RSS news feed, suboptimal though it may be.

Note that Eric Miller describes a method for generating RSS feeds in some of his At A Glance documentation. Editors who already use this method can continue to do so with no changes.

Interested parties can read about the exact process for the news XSLT rules.

Example: News Like W3C Home Page

<ul class="news">
 <li id="n20020181" name="n20020181">Notice that this
     works well for news that has no links.</li>
 <li id="n200210182" name="n200210182">Or
     <a href="a">an</a>
     <a href="b">arbitrary</a>
     <a href="c">number</a>
     of links</li>
</ul>

Example: One-Link News

<ul class="news">
 <li>This method is kind of like running your own
     <a href="http://purl.org/rss/1.0/">RSS feed</a>.
 </li>
 <li>It's also just a little bit less
     <a href="inflexible">useful</a>
     to have only one link.
 </li>
</ul>
Deliverables

Deliverables, if they've been published, refer to other documents from which an XSLT stylesheet can extract information. If a deliverable item links to the actual, published document, then the document will be used as the source of information and the information within the working group page's item will be ignored.

If the document access fails, a comment is left in the RDF saying as much, and the parsing falls through to the case when no link is available. Examine the RDF output for comments in general as these are highly relevant diagnostics on what's missing. If the document stage is planned for the future, it is very important to note the date of the planned publication, as specific as possible, using the date class, and the 'latest version' URI using the versionof rel.

The parsing of deliverables treats a single link, whether marked or not, within an item as if it were the link to the actual document; be wary of making items with one link that don't actually point to the document. Details on the exact method of deliverables processing are available for the interested.

The following will all produce the same output (given that a TR Last Call draft does exist for the fictional group's deliverable).

Example: Blocked Method Deliverables

<ul class="deliverables">
 <li><span class="date">2002-10-01</span>
   This is the <span class="lc">Last Call</span> draft
   for <a rel="versionof" class="title"
   href="/TR/metaml">MetaML 1.0</a>.
   <p class="description">
   MetaML does...
   </p>
 </li>
 <li>
   This is <span class="date">gibberish</span> because
   it links to the <a href="/TR/WD-metaml-20030108">actual
   document</a>, allowing the parser to ignore all the
   <b class="title">junk</b> data, except:
   <p class="description">
   MetaML does...
   </p>
 </li>
 <li>
   This is also <span class="date">gibberish</span>,
   but the link to the <a href="/TR/WD-metaml-20030108"
   rel="details> actual document</a> is explicit,
   allowing more <a href="garbage">great links</a> to
   be presented within the page.
   <p class="description">
   MetaML does...
   </p>
 </li>
 ...
</ul>
Drafts

The same caveat over one link applies to drafts as well. One draft item with one link will be considered the link to the actual document. In this case, the parsing of the draft will use the link to determine the document's date. Further details on drafts processing for the interested.

Others

There are less caveats with the other information classes. The exact method of processing each item per its information class is described in the Class Extraction Rules, and you may not need to make any changes to get good results. The results most faithful to your intentions will definitely be provided by using advanced property classes, but there are other options available.

Tabled Method

Tabled structure resembles blocked structure, though it only applies to data in <table>'s. Tables provide a good balance between the amount of work a page editor must do and the amount of information that is extracted, though it has very tight restrictions on the expected data As is the nature of table data layouts, the first row is a prototype of the structure of information that follows; a column labeled "Phone Number" will be the phone number for each person in the table.

To use tabled structure, the table must be given the correct class from those listed above in the blocked structure description, and the first row should be marked with the type of information in each column, as described below. Note that the advanced property classes here only mean that each data cell in the corresponding column contains that property; the classes are currently not considered in processing within each cell. Each column can have only one advanced property class associated with it.

Unmarked cells in the first row will be ignored when scraping the rest of the table.

Information Class-Specific Details

The semantics of advanced property classes are different in using Tabled Method as outlined in each class. Note that there is no tabled version of news.

Deliverables

Since deliverables tables are normally abbreviated views of information, the deliverables table is expected to be structured with the first column being a link to the latest version of the deliverable and the title with the following columns representing each stage of the deliverable containing a link to the respective version and a date. There is some leeway: if the link in the first column is not to a URI starting with /TR/ or http://www.w3.org/TR/, it will be ignored. Tabled method will try to access the TR document first before falling through to using the data in each cell.

If your table does not conform to this exact structure, you should use Advanced Markup instead.

Example: Tabled Method for Deliverables

<table class="deliverables">
 <tbody>
  <tr>
   <th class="title">Deliverable</th>
   <th class="wd">WD1</th>
   <th class="rec">REC</th>
  </tr>
  <tr>
   <td>DREML 0.9</td>
   <td><a href="wd-dreml-0-9.html">1998-12-25</a></td>
   <td><a href="/TR/dreml">2000-01-01</a></td>
  </tr>
  <tr>
   <td>PUML 2.0</td>
   <td><a href="wd-puml-2-0.html">2000-07-14</a></td>
   <td />
  </tr>
 ...
 </tbody>
</table>
Drafts

Drafts are similar to deliverables in that the first column is expected to be a link to the document in question with the title of the document. The other columns can carry any of the other related advanced property classes in whichever order is convenient.

Participants

There is no restriction on the order of column, and, because it's such common practice, the column with name can optionally be a link to the participant's email address. The email property class can also stand on its own (note that the text of the element with class email should be the email address, not the href, if there is one).

Meetings and Teleconferences

There is no restriction on the order of columns, though the agenda and minutes should contain links to their respective documents.

Example: Tabled Method for Meetings

<table class="meetings">
 <tbody>
  <tr>
   <th class="date">Date</th>
   <th class="agenda">Agenda</th>
   <th class="minutes">Minutes</th>
  </tr>
  <tr>
   <td>2002-10-20</td>
   <td>
    <a href="2002/10/20-agenda.html">agenda</a>
   </td>
   <td>
    <a href="2002/10/20-minutes.html">minutes</a>
   </td>
  </tr>
  <tr>
   <td>2002-11-03</td>
   <td>
    <a href="2002/11/03-agenda.html">agenda</a>
   </td>
   <td>
    <a href="2002/11/03-minutes.html">minutes</a>
   </td>
  </tr>
 ...
 </tbody>
</table>

Advanced Method

The basic block structure will probably work for most page editors, saving them time and effort - however, not every editor chooses to structure their information in blocks. The Advanced Method guidelines will require a good deal of time and effort for editors to mark their pages, though they'll be able to stick to the layout they want for their pages.

The same rules for parsing apply to items in both Blocked and Advanced Method though the items themselves are marked differently. Instead of marking a block of information items, each item is marked on its own. This allows page editors to mix different information classes within one block, working the markup into the way they've already chosen to structure their page.

Except for news, the value of each item's class attribute is just the singular form of the noun used in the block's class attribute value:

Information Class-Specific Details

The classes of information remain the same, but blocks are no longer marked; instead, each individual item must be marked as an item from the class it belongs to, and each individual item's attributes should also be contained within the item element and marked. <dt>/<dd> list structures will also work by marking the dt portion. See the parsing rules for mroe information.

News

This markup is taken directly from Eric Miller's work. An item is marked with the attribute class of item and may contain elements with a class of one of date, title, link, or description. Note that Eric's code requires all URI's to be absolute. The following example is taken from his page:

Example: Advanced News

<div class="item" id="x20020125a">
 <a class="link" href=
  "http://www.w3.org/2001/sw/news#x20020125a">
 <img alt="-" width="17" height="11" src="/Icons/right" />
 <span class="title">An RDF Schema for P3P</span></a> :
 <span class="date">2002-01-25</span>,
 <span class="description">The <a href=
  "http://www.w3.org/P3P/">P3P Working Group</a> has
  published "An <a href=
  "http://www.w3.org/TR/2002/NOTE-p3p-rdfschema-20020125">
  RDF Schema for P3P</a> as a W3C Note. Based on The <a
  href="http://www.w3.org/TR/2001/WD-P3P-20010928/">
  Platform for Privacy Preferences 1.0 (P3P1.0)
  Specification</a> Last Call Working Draft,
  the Note represents one possible RDF schema for P3P.
  P3P simplifies and automates the process of reading
  Web site privacy policies, promoting trust and
  confidence in the Web.</span>
</div>
Meetings and Teleconferences

Some page editors may want to list all meetings and teleconferences in chronological order.

Example: Meetings and Teleconferences

<ul>
 <li class="meeting"><span class="date">2002-12-01</span>
    <a rel="minutes" href="minutes.html">Minutes</a>,
    <a rel="agenda" href="agenda.html">Agenda</a>,
    <span class="description">Discussed stuff at our
    first face to face</span>
 </li>
 <li class="teleconference">
    <span class="date">2002-11-22</span>
    <a rel="minutes" href="112202min.html">minutes</a>
 </li>
 ...
</ul>
Drafts

This type of markup comes directly from Bert Bos' work.

Example: Advanced Drafts

<dl>
 <dt class="draft"><span class="date">2002-10-01</span>
   This is the editor's version of the Last Call draft
   for the <a class="title" rel="details"
   href="100102metaml.html">MetaML 1.0</a> specification.
 </dt>
 <dd>
   MetaML does... (this section is parsed as
   the description)
 </dd>
 ...
</dl>
Participants

Advanced markup may not be as useful for participants, but here's an example anyways to demonstrate the advanced property classes.

Example: Advanced Participants

<ul>
 <li class="participant"><a class="name"
   href="http://people.example.org/somebody/">Some Body</a>
   (<a href="mailto:somebody@example.org" rel="email">
   somebody@example.org</a>),
   the <span class="role">principle</span>
   from <span class="organization">Example, Inc.</span>
 </li>
 ...
</ul>

Gotchas

Working group pages should be validated before publishing changes! Stylesheet processors can't work with malformed XML, so run a copy of the XHTML through the validator when you think everything is ready.

Pages can be maintained in HTML 4.0 instead if they are passed through tidy before being scraped. There is no extra work on a page editors part for either a HTML 4.0 or an XHTML page format.

Try to avoid sacrificing correct usage for formatting; in particular, avoid placing the &nbsp; entity in a table data cell to signal that it contains nothing - instead, use <td/>.

The stylesheet is very forgiving when it comes to being able to locate blocks of information, but it is very unforgiving when items are not rigidly consistent in layout. Display and rearrange blocks as is necessary, but try to use the same representation for items. The stylesheet shouldn't change for anything except bugs, so switching representations is fine so long as the switch is consistent.

Templates and Examples

Combining everything above together in one place, these template files are provided along with links to compliant working group pages. The template (in text/plain, since the idea is to examine the markup) is marked up for the fictional MetaML Working Group; it uses a <link> to a separate meetings page for examples of how to reference external information. Also available are the RDF/XML result and news feed.

Thanks to Dan Connolly, Bert Bos, Dean Jackson, Max Froumentin, and Dominique Hazaël-Massieux for taking on the roles of early adopters. Due to their efforts, a number of working group pages already use the guidelines, allowing the regeneration of their data in RDF. The Web Ontology Language Working Group (result), the SVG Working Group (result), and the Math Working Group (result) are fine examples of guideline compliant XHTML pages. Most use a combination of blocked and advanced methods. The QA Working Group (result) is a set of public XHTML pages using the markup, with the publication roadmap a good example of tabled method for deliverables.

The CSS Working Group is a great example of an HTML 4 guideline compliant page. There is an RDF result set, but it is manually generated (there are known problems using the W3C public XSLT servlet with pages which require recursive confidential document fetching, see the caveat in Test Your Page).

For more examples, see the complete list of groups known to be using the guidelines.

Under The Hood

The scraping stylesheet is run using an XSLT processor, the current favorite of which is Saxon, written in Java.

An attempt to translate the XSLT rules to English rules is provided for developers, though the technically inclined users of these guidelines may also be interested.

A test suite exists for exercising the above requirements. If the stylesheet should ever change, it should still correctly handle the above tests.

Quick Summary

After reading this document, you can refer to a quick summary of what the screen scraper needs for scraping, and what it can and cannot recognize for a given block class.

Test Your Page

After reading through this guide and applying it to your working group page, you can use the W3C XSLT Service to check if the scraper understands your work. Fill our your page URI below (check the box if the page isn't public access). Add to the supplied tidy URI if your page is in HTML 4; replace if XHTML.

Also supply a baseURI, which is the root URL of your page (i.e., http://www.w3.org/YourPath/WG/Overview.html has a baseURI of http://www.w3.org/YourPath/WG/), where YourPath is the location of your home page. The name of your working group is optional but suggested, and strongly suggested if the title on your main page is not strictly the name of your group.

Page URI:

Base URI:

Working Group Name:

Proxy basic authentication for your page

Caveat

Processing working group pages is known to work for public XHTML pages and isolated member-confidential XHTML or HTML 4 pages. Support for cases where other working group pages need to be accessed, either for processing or tidying via the XSLT, is in the works. Testing is possible on a page-by-page basis. Report problems and issues to Ryan Lee.

The XSLT processing may take some time.