Use the W3C XSLT Service to check if the scraper understands your page. Replace the tidy URI if your page is already in XHTML. Also supply a base URI (probably the same as the URI of your page).
Examples: see the QA Working Group's implementation (view the page source) and the resulting information extraction (in RDF), the Web Ontology Language Working Group's implementation and results, and the CSS Working Group's implementation and static results.
Processing working group pages is known to work for public XHTML pages and isolated member-confidential XHTML or HTML 4 pages. Support for cases where other working group pages need to be accessed, either for processing or tidying via the XSLT, is in the works. Testing is possible on a page-by-page basis. Report problems and issues to Ryan Lee (also copying to w3c-tools@w3.org).
The XSLT processing may take some time.
The <head>
must have a profile
attribute of http://www.w3.org/2002/12/wg
The name of the working group is derived from the <title>
element if the name is not provided to the parser beforehand.
The activity name is found within an element with class activity
.
A summary of the group's purpose is found within an element with class summary
.
The link to the charter is found in an <a>
element with rel
attribute of value charter
.
Authoritative information located on pages other than the main page can be referred to by using a rel
attribute on anchor or link
elements with a value of one of the respective information classes: news, drafts, deliverables, participants, meetings, teleconferences
.
See the guidelines on Blocked Method. Recognized block-level tags are: <ul>
, <ol>
, <dl>
, and <div>
blocks with <p>
items; <table>
is discussed below. A summary of the information:
Information Class | class |
rel |
---|---|---|
(none) | activity, summary |
charter, activity |
news |
title, link, description, date |
|
drafts |
title, description, date |
details |
deliverables |
title, description, date one of: note, wd, lc, ends, cr, pr, rec
|
details, versionof |
participants |
name, email, lastname, firstname, organization
|
email |
meetings |
description, date |
agenda, minutes |
teleconferences |
description, date |
agenda, minutes |
See also the rules for parsing an item for each information class.
See the guidelines on Tabled Method. The <table>
element uses the information class, and each cell of the first row may take one of the classes listen per each information class. Unmarked columns are ignored.
See above table for table classes and column classes, and note that a column class extends for the entire column; they are not used in extracting information per cell (the semantics are different for this method, though the keywords remain the same).
See the style guideline notes on Advanced Method. The Advanced Method is based on marking individual items that belong to an information class. Possibly contained in the item are marked elements holding more specific information.
See also the rules for parsing an item for each information class (the same rules as Blocked Method items).