W3C Group Pages: metadata extraction experiment
    
    
      This page describes Amy and Dan's experiments with extracting
      information from W3C group pages. This document, and the sample cdocument (longer template) that
      accompanies it are publically visible. Some work referenced
      here may be Team or Member access only.
    
    
      Draft templates
    
    
    
      XSLT experiment
    
    
      (based on web service created for the RSS extraction
      tool)
    
    
    
      Related work
    
    
    
      Goals
    
    
      We want to make it easier to find out various things about
      the Working, Interest and Coordination Groups chartered
      within W3C process. We have some of this information in RDF
      already, while some of it is available only as prose within
      charter, home page and activity statement documents. Amy has
      collected (by hand) some information on group expiry date.
      Rather than do this again, we're trying to come up with some
      recommended HTML-based conventions that could be included in
      all such documents, so this info can be extracted
      automatically (eg. with XSLT) next time. We're starting by
      working on a demo of a WG homepage and/or charter doc that
      includes some usefully extractable data...
    
    
      See sample group page for some
      properties related to working groups that we are interested
      in gathering. The basics are those relating to schedule,
      associated people, activity/domain. A more ambitious project
      might try to extract a lot more information, eg. inter-group
      dependencies, versioning history, deliverables, full schedule
      etc.
    
    
      Issues, Next steps
    
    
      Stuff we've run into, or to think about later...
    
    
      - 
        Should this info go in the charter (formal, difficult to
        update without process), or group homepage (easier to get
        it added, less formal significance)? Or both? Current
        samples are for WG homepage... Or should it live in the
        activity statement?
      
- 
        What information do we want to extract? what is essential
        (dates?) vs what would be nice (dependencies?)
      
- 
        How do we represent this in RDF? Do we already have
        appropriate classes and properties defined in existing
        namespaces?
      
- 
        We want an XSLT script that extracts some of this
        information into RDF, and a demo application that merges it
        into a single page.
      
- 
        What do we do about expired (ie. totally dead) activities,
        vs those that are merely out of charter?
      
- 
        We want to know relationships to domain home page (where
        relevant?), to activity home page, statement etc. How can
        this be determined automatically (eg. by following links
        from /Member/ page). Should we have a master list of all
        charter URIs, or have each group home page point (in a
        machine-visible way) to each of its charters?
      
- 
        What to do about versioning? Some groups re-use a URI when
        charters are updated?
      
- 
        What information can each charter reasonably be expected to
        contain?
      
- 
        What access control to set for all this info? Member-ACL?
      
- 
        How do IGs, CGs, WGs differ? Do we want to know different
        things about each?
      
    $Id: Overview.html,v 1.11 2002/04/23 17:33:05 danbri Exp $