Re: Data in HTML Crawl

On Sunday, 20 November 2011 at 20:30, Manu Sporny wrote:
> The first crawl will look for the frequency of
> Microdata/RDFa/Microformats documents on the Web along with usage data
> for each attribute/Microformats class. This crawl will help determine if
> RDFa Lite 1.1 is going to break backwards compatibility in an
> unacceptable way and will give us some usage figures on all three
> languages. A wiki has been created to outline the types of tests we
> intend to run:
> 
> http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design
I think what we need to do is get a complete map of all the tags and their attributes. I don't think we need to crawl every single page in the index (that would probably distort the results), so I think what we need to do is get an idea of how many domains are in the index, and then do random selection of pages from those: some domains hold more pages than others, so if one domain holds 1,000,000+ pages all of which use the same PHP template (e.g., wikipedia), then that would screw up the results because it would be overly representative.

My partner is a statistician, and she can tell us exactly how many domains we should look at before the law of diminishing returns kicks in (say, if we want to tolerate an error rate of 1-3%).    

How I think this should be done: 

1. Create a database that will hold the elements, attributes, and the frequency of each occurrence (element and attribute). 
2. Pick random page from random domain. 
3. Parse page with HTML5 Lib: this will build a correct DOM for every document. 
4. For each each element in the document, populate the database with the name of the element, and each attribute. 
5. If attribute has name found in the Wiki (about, contents, datatype, class etc.), record its attribute value(s). 



Round 1, done. Remove all outliers (N < Statistically significant value). We can now see at least what elements attribute names are being used.  

Round 2 could be more focused. 

 1. Search all domains for use of [RDFa|Microdata|Microformat] and recording also how many times they are not encountered (for balance).  
 2. Record the fragment structure to a given depth (<foo> x <bar>bah <baz> zzz</bar> qqq</foo>). 
 3. Analyse the common usage patterns (e.g., is an address, person, or event marked up in a valid way?) 
 

Received on Monday, 21 November 2011 08:29:20 UTC