Alexa Global Top 500 against HTML 5 validation

Part of Tools

Author(s) and publish date

By:
Published:
Skip to 11 comments

Last month, Brian Wilson published a survey on validation. He took the top 500 sites URI given by Alexa and sent them to the W3C Markup validator. Recently, W3C created a beta instance of html 5 conformance checker. Brian concluded that 32 of the 487 URLs passed validation (6.57%).

So today I decided to take the January 2008 list of web site and to send them to the beta instance of html 5 conformance checker. I created a very simple python script (As usual if you are in horror with my code, any kind suggestions to improve it is welcome). Be careful you will need to install httplib2. The file alexa.txt contains the list of uris, one by line. To be sure to check against html 5, I forced the html 5 doctype.

import httplib2
import time

h = httplib2.Http(".cache")

f = open("alexa.txt", "r")
urllist = f.readlines()
f.close()

for url in urllist:
   # wait 10 seconds before the next request - be nice with the validator
   time.sleep(10)
   resp= {}
   url = url.strip()
   urlrequest = "http://qa-dev.w3.org/wmvs/HEAD/check?doctype=HTML5&uri="+url
   try:
      resp, content = h.request(urlrequest, "HEAD")
      if resp['x-w3c-validator-status'] == "Abort":
         print url, "FAIL"
      else:
         print url, resp['x-w3c-validator-status'], resp['x-w3c-validator-errors'], resp['x-w3c-validator-warnings']
   except:
      pass

Before I give the results, repeat after me 10 times : html 5 Conformance checker is in beta, which means not stable and in testing. html 5 specification is a Working Draft, which means highly to change. The test is only on the home page of the site.

The January 2008 file contains 485 web sites. 23 (4.7%) could not be validated. Most of the time, the site was too slow. Only 4 (< 1%) sites were declared valid html 5 by the conformance checker. If Henri Sivonen could do the same thing with his instance of html 5 conformance checker that would help to know if my results are silly or in the right envelop.

Related RSS feed

Comments (11)

Comments for this post are closed.