W3C Group Pages: metadata extraction experiment
This page describes Amy and Dan's experiments with extracting
information from W3C group pages. This document, and the sample cdocument (longer template) that
accompanies it are publically visible. Some work referenced
here may be Team or Member access only.
Draft templates
XSLT experiment
(based on web service created for the RSS extraction
tool)
Related work
Goals
We want to make it easier to find out various things about
the Working, Interest and Coordination Groups chartered
within W3C process. We have some of this information in RDF
already, while some of it is available only as prose within
charter, home page and activity statement documents. Amy has
collected (by hand) some information on group expiry date.
Rather than do this again, we're trying to come up with some
recommended HTML-based conventions that could be included in
all such documents, so this info can be extracted
automatically (eg. with XSLT) next time. We're starting by
working on a demo of a WG homepage and/or charter doc that
includes some usefully extractable data...
See sample group page for some
properties related to working groups that we are interested
in gathering. The basics are those relating to schedule,
associated people, activity/domain. A more ambitious project
might try to extract a lot more information, eg. inter-group
dependencies, versioning history, deliverables, full schedule
etc.
Issues, Next steps
Stuff we've run into, or to think about later...
-
Should this info go in the charter (formal, difficult to
update without process), or group homepage (easier to get
it added, less formal significance)? Or both? Current
samples are for WG homepage... Or should it live in the
activity statement?
-
What information do we want to extract? what is essential
(dates?) vs what would be nice (dependencies?)
-
How do we represent this in RDF? Do we already have
appropriate classes and properties defined in existing
namespaces?
-
We want an XSLT script that extracts some of this
information into RDF, and a demo application that merges it
into a single page.
-
What do we do about expired (ie. totally dead) activities,
vs those that are merely out of charter?
-
We want to know relationships to domain home page (where
relevant?), to activity home page, statement etc. How can
this be determined automatically (eg. by following links
from /Member/ page). Should we have a master list of all
charter URIs, or have each group home page point (in a
machine-visible way) to each of its charters?
-
What to do about versioning? Some groups re-use a URI when
charters are updated?
-
What information can each charter reasonably be expected to
contain?
-
What access control to set for all this info? Member-ACL?
-
How do IGs, CGs, WGs differ? Do we want to know different
things about each?
$Id: Overview.html,v 1.11 2002/04/23 17:33:05 danbri Exp $