W3C Group Pages: metadata extraction experiment

This page describes Amy and Dan's experiments with extracting information from W3C group pages. This document, and the sample cdocument (longer template) that accompanies it are publically visible. Some work referenced here may be Team or Member access only.

Draft templates

The FooML WG sample charter
FooML WG homepage (sample template)

XSLT experiment

(based on web service created for the RSS extraction tool)

Related work

Bert's Member TR page
RSS/RDF newsfeed extraction from homepage
TR-automation project
Semantic Web Activity: Advanced Development (see 'Schedule Coordination and Dependency Tracking' section, lots of useful links)
WG charters in RDF/n3 (extracted by Amy)
W3C at a glance (RSS news item aggregation from group homepages)
template for WG (?IG, ?CG) charters: amended sample template (original - is this used??)
W3C Guide: How to Create and Update Charters
Process doc: 4.2.2. WG and IG charters
W3C Working Groups and Activities page
Activities page
Feb 2000 survey of WG and IG issue list documents

Goals

We want to make it easier to find out various things about the Working, Interest and Coordination Groups chartered within W3C process. We have some of this information in RDF already, while some of it is available only as prose within charter, home page and activity statement documents. Amy has collected (by hand) some information on group expiry date. Rather than do this again, we're trying to come up with some recommended HTML-based conventions that could be included in all such documents, so this info can be extracted automatically (eg. with XSLT) next time. We're starting by working on a demo of a WG homepage and/or charter doc that includes some usefully extractable data...

See sample group page for some properties related to working groups that we are interested in gathering. The basics are those relating to schedule, associated people, activity/domain. A more ambitious project might try to extract a lot more information, eg. inter-group dependencies, versioning history, deliverables, full schedule etc.

Issues, Next steps

Stuff we've run into, or to think about later...

Should this info go in the charter (formal, difficult to update without process), or group homepage (easier to get it added, less formal significance)? Or both? Current samples are for WG homepage... Or should it live in the activity statement?
What information do we want to extract? what is essential (dates?) vs what would be nice (dependencies?)
How do we represent this in RDF? Do we already have appropriate classes and properties defined in existing namespaces?
We want an XSLT script that extracts some of this information into RDF, and a demo application that merges it into a single page.
What do we do about expired (ie. totally dead) activities, vs those that are merely out of charter?
We want to know relationships to domain home page (where relevant?), to activity home page, statement etc. How can this be determined automatically (eg. by following links from /Member/ page). Should we have a master list of all charter URIs, or have each group home page point (in a machine-visible way) to each of its charters?
What to do about versioning? Some groups re-use a URI when charters are updated?
What information can each charter reasonably be expected to contain?
What access control to set for all this info? Member-ACL?
How do IGs, CGs, WGs differ? Do we want to know different things about each?

$Id: Overview.html,v 1.11 2002/04/23 17:33:05 danbri Exp $