Alexa Global Top 500 against HTML 5 validation

Last month, Brian Wilson published a survey on validation. He took the top 500 sites URI given by Alexa and sent them to the W3C Markup validator. Recently, W3C created a beta instance of html 5 conformance checker. Brian concluded that 32 of the 487 URLs passed validation (6.57%).

So today I decided to take the January 2008 list of web site and to send them to the beta instance of html 5 conformance checker. I created a very simple python script (As usual if you are in horror with my code, any kind suggestions to improve it is welcome). Be careful you will need to install httplib2. The file alexa.txt contains the list of uris, one by line. To be sure to check against html 5, I forced the html 5 doctype.

import httplib2
import time

h = httplib2.Http(".cache")

f = open("alexa.txt", "r")
urllist = f.readlines()

for url in urllist:
   # wait 10 seconds before the next request - be nice with the validator
   resp= {}
   url = url.strip()
   urlrequest = "http://qa-dev.w3.org/wmvs/HEAD/check?doctype=HTML5&uri="+url
      resp, content = h.request(urlrequest, "HEAD")
      if resp['x-w3c-validator-status'] == "Abort":
         print url, "FAIL"
         print url, resp['x-w3c-validator-status'], resp['x-w3c-validator-errors'], resp['x-w3c-validator-warnings']

Before I give the results, repeat after me 10 times : html 5 Conformance checker is in beta, which means not stable and in testing. html 5 specification is a Working Draft, which means highly to change. The test is only on the home page of the site.

The January 2008 file contains 485 web sites. 23 (4.7%) could not be validated. Most of the time, the site was too slow. Only 4 (< 1%) sites were declared valid html 5 by the conformance checker. If Henri Sivonen could do the same thing with his instance of html 5 conformance checker that would help to know if my results are silly or in the right envelop.

11 thoughts on “Alexa Global Top 500 against HTML 5 validation

  1. I am surprised that there are sites which use the HTML5 doctype. All other sites fails HTML5 validation right from the doctype.

  2. Please ignore my previous comment, I run a local v.nu instance, which does not override the document doctype as your does.

    I set Preset to HTML5 (experimental), Parser to HTML5, but the original docype is still used.

  3. Interesting… I thought one of the design principles of HTML5 was paving the cowpath and be backward-compatible, which, perhaps via simplistic logic, meant that anything conforming to an older version of HTML would still conform to HTML5. In other words, the number of sites passing the html5 check should be higher, not lower, than the dtd-based validation. At least that’s my assumption.

    It would be interesting to look in more details at what errors made the html5 checking fail. That would clarify whether html5 has drifted from earlier versions of html, whether the checker has bugs, etc. Concatenate the XML output(s) of validation and look into that?

  4. More than 99% (minus 4.7%, sorta) of sites without HTML 5 doctypes don’t validate as HTML 5? I think more tests may be needed here. Do pages with XHTML Strict tend to validate or not validate as HTML 4.01? Time permitting, maybe pages with HTML 5 doctypes could be checked for validation against HTML 5. ;)

    In all seriousness, this is interesting information that’s difficult for me to apply to a question. Is this roughly the same as searching current pages for any use of a tag/attribute that is planned to change in HTML 5? Is it fair to read between the lines that this test indicates that over 99% of pages would need to be changed in some way to conform to HTML 5 as it stands?

  5. Am I the only one who thinks this is utterly pointless. What exactly is the point of validating pages to a doctype they don’t even claim to implement? That’d be as stupid as calling valid HTML pages invalid because they aren’t well-formed XML, even when they make no claim to be.

    Sure, most of these pages aren’t valid anything, but I’m sure most of them aren’t even aiming for HTML5. They’re much more likely to be some sort of HTML4 or XHTML or at least aiming for those two standards. (If, of course, the developers care at all. Sadly, many don’t.)

    I support validation, and standards, and HTML5, but this is a waste of time and brains. I’m of the opinion that any page that triggers quirks mode should just be considered invalid anyway and ignored, which is probaly 486 (or so) of the 500 sites on the list.

  6. @Levi,

    HTML 5 is supposed to be designed by using the content as it is deployed on the Web. So to undestand the practices of users and tools, to try to not create too much discrepancy.

    The goal of this validation was to show that in fact, switching to html 5 will require a lot of efforts from users. Do not forget that html 5 is a moving target, so this study, which didn’t take a long time (just the time to develop the script ;) ), could be run again anytime depending on the status of html5.

  7. muy interesante tu articulo, Un buen ranking en Alexa siempre es una buena carta de presentación y aunque no es representativo y fiable al 100% indica que estamos haciendo las cosas bien en cuanto a nuestra página web

  8. I support HTML5 and, of all industries, ours is certainly one where verything is constantly being imrpoved and upgraded. While some may argue that HTML5 is being pushed too far ahead of it’s development maturity I’m glad to see that things are being driven forward!

    We have created sites that meet the current HTML5 standards and it did not take an overwhelming effort.

  9. @Karl,

    The Conformance Checker has improved greatly since your comment. Either version. Perhaps Brian Wilson would run a new report with the current Alexa 500: how interesting would the results be? compared with the previous report.

  10. Just curious as to how many errors the top 100 online business based on revenue receive when run through W3C validation?

    What’s the point of this again?

Comments are closed.