preventing non-www hostnames from being indexed

W3C’s main web site www.w3.org is load-balanced over a number of servers using the excellent HAProxy load balancing software. Before we discovered HAProxy we used round-robin DNS for a number of years, with www.w3.org rotated among a number of servers with names like web1.w3.org, web2.w3.org, etc.

One unfortunate side effect of this practice was that these individual hostnames would get picked up by Google and other site crawlers, with undesirable consequences such as increased server load, diluted pagerank, less effective caching, and broken links when old hostnames went out of service.

We came up with a simple way to avoid this issue: we created a file called nobots.txt that forbids all crawling/indexing by bots, and started returning it instead of our usual robots.txt file when the HTTP host is not www.w3.org, using this rule in our Apache config:

RewriteCond %{HTTP_HOST} !^www\.w3\.org$ [NC]
RewriteRule ^/robots.txt$ /nobots.txt [L]

This prevents our site being from indexed at individual hostnames that may come and go over time.

Of course, the best way to indicate a site’s preferred hostname is to issue real HTTP redirects pointing to the right place; we didn’t do that in this case because we wanted to keep the ability to view our site on a specific server, for example to verify that our web mirroring software is working correctly or to work around some temporary network issue. We do issue HTTP redirects in many other cases, e.g. to redirect w3.org or w3c.org to www.w3.org.

List-Id: for filtering mail from W3C lists

There are a number of email message headers that people may use to filter messages from W3C lists. The best header to use for this purpose is List-Id: (specified in RFC 2919), because this header is intended to remain constant throughout the lifetime of a list.

Using other headers to filter mail from our lists may not work as reliably, because the headers may change when we make changes to our email infrastucture or mailing list software.

Today we made a change to our list software to cause it to stop generating Sender: headers, because doing so was causing DKIM signatures to fail in some cases, resulting in messages bouncing and people being removed from our mailing lists.

If your email filters were based on the Sender header, please update them to use List-Id instead, as this will provide a more reliable and future-proof means of filtering.

We have a wiki page that provides info on how to filter messages based on List-Id in various email clients; additions welcome!

Why we chose Mercurial as our Decentralized Versioning System

After having asked the community about which decentralized versioning system W3C should use, we made our DVCS platform operational a few months ago, powered by Mercurial. The obvious contender was Git — here are some of the reasons why we picked Mercurial over Git, in the hope that this analysis can be useful to others.

Choosing a DVCS platform in an organization such as W3C is a lot more complicated than just following the trend, which seems to be in favor of Git today. For example, we had to take into account the tools coming along with the system and the clients available on the various operating systems used by the diverse set of Working Group participants and open source developers, as well as the support offered in third party softwares. Mercurial has clearly a superiority there over Git especially for supported clients under Windows.

A lot of our users are able to use the most advanced features of both DVCS but a significative part of them just want to share files, handle easily what they are doing, deal with conflicts and eventually being driven through a graphical user interface.

We haven’t heard any request for fancy history modification features: while Git is clearly more powerful in letting you manipulating your history, we liked the simplicity of the Mercurial command set and the availability of several GUI.

We will have to administrate a growing number of repositories. We knew before launching the service that we would have a lot of requests to tweak their configurations. As a result, the Systems Team needed something easy to set up, easy to hack on and also easy to integrate with our existing systems. Indeed, we couldn’t rely on any external hosting services like GitHub or BitBucket as we wanted all our repositories to be tied to our existing W3C accounts (through LDAP).

Looking into the future, we also verified the integration of Mercurial in Trac that we are considering using for some of our open source projects such as our Validators and Unicorn.

Mercurial has offered all those tools for a long time and lots of tutorials are easy to find on the Web.

To summarize, we liked both systems a lot. On the one hand, Git seemed to be more powerful, with more features but offered too much for what we needed. On the other hand, Mercurial came with more tools and seemed really simpler both for our users, and for us in terms of integration with other tools and services.

Validator 0.8.6 release wraps up a year’s worth of bug fixes and code clean-up

Over the past year, millions of pages were validated every day using W3C’s services. We’d like to thank those who keeping service running smoothly by helping other users, carefully reporting problems, and developing code. We’d especially like to thank Ville Skyttä for continuing maintenance of the validator check perl script and Henri Sivonen for continuing work on the validator.nu code.

Changes in the 0.8.6 release
notes
include:

The validator.nu codebase has seen over 100 fixes and enhancements since our 0.8.5 release in March 2009.

Integration with Unicorn, a project which aims to provide the big picture about the quality of a Web page by gathering the results of several tools, has improved. Unicorn development continues with a recent release of its own and will be a focus in the coming months.


Authors: Dan Connolly and Ted Guild

Decentralized versioning system at W3C

We’ve heard from several groups and individuals that they would like W3C to host a public decentralized versioning repository for W3C-related work items, such as editors drafts, test suites, tools and software.

The goal of such a repository would be to host the reference versions of these items, while allowing as many people as possible to modify, branch, patch the content of the repository, without the hurdles that CVS creates for this kind of cooperation.

As we are looking into experimenting with such a service, we are hitting the question that many others have encountered in that process: which decentralized versioning system to choose?

The main two contenders seems to be Git and Mercurial; Git seems to have a growing number of tools, and more advanced features; Mercurial seems to be easier to use, and possibly easier to set up on a larger number of platforms. Here are some of the comparions we have found in our early investigations:

We’re interested to hear feedback on this question, in particular in the form of sharing experience of using them (inside or outside of the W3C community), and pros and cons of both systems.

We’re trying to have that discussion on our publicly archived mailing list public-qa-dev@w3.org (where I sent a message similar to this post), but feel free to use blog comments if you find that more practical.

Tracking requests

W3C is a fifteen-years old organization, where plenty of people come to collaborate, with a high variation among them in terms of operating systems, computer proficiency, corporate set up, etc. A number of our users manage to be even more geek than we are in the Systems Team, while for many others, a computer is just a tool that really ought to “just work” (which they still often don’t).

The result of that interesting mix is that we have set up over time a fairly large number of tools to facilitate collaboration: several hundreds of mailing lists, an IRC server with a few handy bots, a fine-grained access control system, a questionnaire system, a flexible editing system combined with a robust mirroring scheme, wikis, blogs, various bug and issues tracking system, etc.

And as all these tools are entirely bug-free and work seamlessly together and for all our users (NOT!), we have always had a need to track requests from our users to create and manage various accounts, set up new instances of these tools, correct or work around bugs, etc.

Over the years, the way we have tracked and managed all these requests has evolved, toward somewhat more formalism as the number of our users and tools grew.

When I joined the Systems Team in 2000, our tracking was entirely based on manual scanning of mails threads sent to one of our internal mailing lists: basically, each of us watched for what looked like a request, checked if anybody had responded to it, and if nobody had and the request was something you could manage, and you had some time on your hands, you would take on it.

Given how informal it was, it actually worked quite well, although over time we added some more purely conventional practices: requiring the use of a specific mailing list of a specific type of requests, adding a [closed] flag to the topic of a thread to signal that the request had been dealt with (so that others could simply delete the thread without bothering reading it).

But as the number of tools and users continued to grow, we started to get complaints that some requests had not been dealt with at all, and it became clear that we lacked overall visibility on what needed to be responded.

Back in 2002, I wrote a quick XSLT style sheet that would help us get more visibility on the state of our requests: it took the threaded view of the archive of our request mailing list, and would look for any thread that didn’t contain a message starting with our now conventional [closed] flag, and would present a report showing all the requests that hadn’t been closed, as well as those that hadn’t had an answer at all.

And again, that fairly simple system served us well for quite a few years; some others W3C groups even started re-using it for tracking their issues, and a similar version of the tool based on more Semantic Web technologies was used by a couple of groups to track their specifications’ Last Call comments.

But no matter how well that solution worked, we decided last year that we would finally move to a proper tickets tracking system, the well-known open-source RT, to get the following advantages over our existing hack:

  • get a clear view of who was working on what ticket,
  • be able to assign a given ticket to someone, even if that person hadn’t picked it up yet
  • easily find tickets that were stalled due to lack of responses from the requester (vs. because we didn’t act on it)

The first few months with RT weren’t quite so rosy, actually, as we had to find ways to integrate it as smoothly as possible in our current procedures and infrastructures, and with our mailing list habits.

Some of the changes we’ve brought to it include:

  • make it track messages sent as part of a given thread (as identified by the In-Reply-To header) as belonging to the same ticket (even without the id number included) – with an an existing patch to that end;
  • change the way it modified subject messages (with the Branded Queues extension);
  • fix partially the way it sends messages and notices through the configuration UI;
  • make it understand our [closed] convention so that we could continue using mail as our primary way to close a ticket, using a simple “Scrip” inspired from another RT user’s contribution.

There are still some rough edges – RT seems to be particularly reluctant to send messages in CC when using the Web interface for some reasons, we need to integrate it better with our existing accounts system so that our users can better follow progress on their requests -, and some user interface and HTTP behaviors problems that make me cringe.

But overall, I think the tool has certainly helped us regain control over our growing number of requests, and is also hopefully steadily allowing us to offer a better experience to our community.

Validator Dev Watch: fuzzy matching for unknown elements/attributes

Unknown elements and attributes at top of the validation errors charts

According to MAMA’s survey of validation of a few million web pages, the most common validation error is either “There is no attribute X” or “Element X undefined”. In other words, instances where the document uses elements or attributes which are not standard. As explained in the Validator’s documentation of errors, the most likely reasons for these errors are:

  1. typos. The user wrote <acornym> when what was really meant was <acronym>. I am not sure if this is the most common error, but it can be a terribly frustrating one. “What do you mean acronym is not a standard element. Of course it is! Oh, wait, I made a typo…”
  2. Non-standard elements. Again, I don’t have statistics about which elements/attributes trigger this error most of the times, but I would bet on the <embed> element and the target attribute (which, by the way, is only available in Transitional Doctypes). For those we can’t do much, other than recommend using another doctype and point to standard ways of using <object> to display flash content.
  3. Case-sensitive XHTML. This one bites me more often than I’d like to admit. Copy and paste a snippet of code that uses e.g the onLoad attribute, test the functionality in a few browsers – they will gladly oblige – then see the validator throw an error, because of course, in lower-case XHTML, onLoad isn’t a known attribute. onload is.

What makes these errors frustrating is not so much the difficulty they present. Anyone carefully reading the error message and the explanation that comes with it will easily fix their markup. Unfortunately, for a number of good and bad reasons, few of us ever read the explanations: those tend to be a bit long, propose possible causes for the problem, and a list of potential solutions – and most people will just ignore or gloss over them.

Suggestive power

One way we found of making the validator more user-friendly here is to escalate the most likely solution up into the error message itself. In other words, compare:

Error
Line 12,
Column 14
:
there is no attribute “crass”

<spam crass="foo">typos in attribute and element</span>

lenghty explanation here…

with…

Error
Line 12,
Column 14
:
there is no attribute “crass”. Maybe you meant “class” or “classid”?

<spam crass="foo">typos in attribute and element</span>

same lenghty explanation here…

The former is what the latest stable release of the markup validator will output. The latter is what I implemented last week, and can be tested on our
test instance of the validator.

How is it implemented?

Since the validator is coded in perl, we looked for perl modules implementing algorithm to calculate edit distance between strings. We found String::Approx, which implements the Levenshtein algorithm. Take this algorith, plug in a list of all known elements and attributes in HTML, and after moderate hacking, my code would very easily find that <spam> should be <span>, and some extra tweaking yielded good results suggesting <acornym> could be corrected as <acronym>.

For some reason however, I could not find a way to make the String::Approx algorithm reliably suggest onload as a replacement for onLoad – it seems to consider character substitution as expensive, regardless of the fact that the substitution is from a character to its uppercase equivalent. A trivial additional test took care of this glitch, and we seem to be all set to have a more usable validator at the upcoming release.

What do you think?

What do you think of this feature? Would you have implemented it differently?

Any suggestion for a better way to word/present the suggested correction for unknown element/attributes? Any thought on other small improvements to the validator which would dramatically improve its usability?

New Link Checker opens series of open source releases

I like writing change logs: that’s usually a sign that a period of development for one of our tools is ending and that we are about to release the product of weeks, sometimes months, of work. This one is a winner… As we are releasing a new version of the W3C Link Checker, the additions to the changelog span almost 400 lines and summarize the work done for more than 2 years: new UI, many bug fixed and a generally more friendly and reliable tool. Kudos to the developers and contributors, including Ville Skyttä, Michael Ernst and Brett Bieber!

For those who would like to play with the new link checker on their own machine (did you know it worked as a command-line tool, too?) or server, the open source code for the latest version (4.4) should very soon be available on CPAN.

More W3C tools release should come very soon. In the meantime, tell us how you like this new release of the link checker in the comments, and get involved to make the next release even better: comments bug report are welcome, and so are Donations to the W3C validator program.

Validator donation program: the first numbers are in!

A few months ago, when a small group of people gathered to discuss about an idea that would later become the
W3C Validator Donation and Sponsorship Program,
one of the “design counstraints” we almost naturally gave ourselves was to strive for transparency. The W3C has sometimes been perceived as closed or secretive – partly because of its structure
as an industrial consortium where part of the membership value is early confidential access to some information, and in spite of the the fact that all that confidential information is always exposed for all to see through public reviews,
as early as draft stage…

On the othe hand, in the case of the validators, where all the work is done through
public mailing-lists
and a public code repository, it has always been much easier to
share everything with the community, and it felt natural that the funding of validator work through donations and sponsorships would follow
that trend.

I heart Validator

With the donation program entering its 7th week, and more big things to come, it felt like a good time to start analysing some of the data and sharing them with you. These are all subject to caution, especially given my weak number-crunching abilities and the less-than-stellar chemistry between the paypal UI and me, but they should be interesting nevertheless.

How much?

If this was to be a FAQ, there would probably only be a single question: “how much did the program receive so far?” You people are obsessed with money or what? :)

As of today (Jan 26, 2009) our balance is approximately 1900 Euros (2500 US$), with a little above 100 donations received. Not all donations were identified as coming from a specific country, but of the 65% identified donations, a good half came from the USA, followed by Japan, France, and Germany. The Japanese donator were, on average, the most generous (around 35$/avg) while the biggest (250$) and smallest (a number of 1$ donations from people who probably took the “A dollar for every time validation saved your day?” quip too literally) contributions came from the USA. We only had to refund a couple of people, when the small amount would not be enough to cover the fees and taxes taken by Paypal on every transaction – but we thank you all for your help!

The few people with whom I’ve shared this figure so far have fairly consistently asked whether the amount received isn’t surprisingly small. Of course, I’d rather the count were in millions and that we could already achieve full independence, but given how we have so far only tapped a small core community of fans and blogger-friends, I think the result has been extremely positive, way beyond the raw numbers:

  • What actually mattered in this first phase of the program was to raise awareness in the fact that although the validators are free to use, building and maintaining them was far from free for the W3C. We did not really know how the community would react, and the response has been fantastic. Almost everyone learning about the donation program was supportive, and many started spreading the word through blogs, forums, microblogging, etc.

    The past weeks have also been a great opportunity to have constructive discussions about W3C, its finances and the validators. Questions and misunderstandings, such as whether the W3C is insanely rich or why hosting/bandwidth is not the main cost of the validators (the main cost is human effort) could finally be addressed, and even if the program had not made a penny, for that alone it would have been worth the effort. We need more open dialogue like this around W3C.

  • An unexpected byproduct of the donation program was an incredible raise in goodwill aroun our open source products. With the global economy in a relatively sad state, a lot of the friends of the validators thought “I might not have much cash to spare, but maybe I could help?”.

    After years of saying, with limited success, that the validators belonged to the community and that their progress depended on the goodwill of all, we’ve seen a renewed activity around the projects, many people bringing a very positive attitude to discussions, development and bug reporting – in the paraphrased words of a famous orator: “Ask not why this validator is not to your liking, ask what you can do to make it better”.

What next?

If things go well, the next few weeks will see us switch gears and push the donation and sponsorship program to another level. Our small team of staff and contributors has been working really hard to prepare new releases of the HTML Validator, CSS Validator and Link Checker, and all three new releases will feature the donation program, as well as our first sponsor(s), prominently:

  1. development version of the Markup Validator
  2. development version of the Link Checker
  3. development version of the CSS Validator

These new releases will also make the value of sponsorship much more obvious, and our small team will keep pushing hard to close a deal on the first batch of sponsors: if you think your company should really be in there, Contact Us!. We’re cooking up cool ideas of validator subscriptions too, and goodies!

This is only the beginning. The past few weeks have shown that a lot of people care about the validators and are ready to help to keep them alive, to keep them growing, to keep them free. We need to keep that good energy going: please keep spreading the word, donate if you haven’t, talk around you about the sponsorship opportunity, and most important, get involved in those projects.

Cumulative Voting with JQuery

W3C has its own system designed to create, answer and gather results from votes, strawpolls, registrations, etc., known as WBS; I have been developing and maintaing this system over the past six years – I’m hoping to find sometimes the time and energy to make that tool open-source, but given that this requires abstracting many of its W3C-isms, I’m not holding my breath. WBS knows a variety of types of questions that can be included in the questionnaires it creates: open comments, radio-selectable list of choices, checkboxes-selectable list of choices, timetable selector, etc. It is also (more or less weel) designed so that new types of questions can be reasonably easily added. As part of an upcoming survey for W3C Members, it was requested that WBS handles a new type of questions, to support cumulative voting.

Screenshot of Cumulative Vote in WBS, without Javascript

Adding the server-side part of that new control was relatively straightforward, thanks to WBS architecture; but given the nature of cumulative voting, it seemed more or less required to complete it with a client-side layer (read “Javascript”) to ensure a minimum level of usability. Although I started writing Javascript almost as soon as I started writing Web pages some twelve years ago, I had grown wary of doing it; I restarted coding with it lightly over the past two years, but had found it quite cumbersome to write, in particular when wanting to do it properly, that is to say with graceful degradation, events binding, proper DOM operations, etc. So I figured that if everyone is raving about these new Javascript frameworks, there may be some reason to it and that I would better pay attention – and thus started using JQuery, mostly because it happened to be already set up on the W3C site and that I hadn’t heard or read bad things about it.

Screenshot of Cumulative Vote in WBS, with Javascript layer

Sure enough, one hour and a few tutorial pages later, I had written the 30 lines of code that were needed to support what I had in mind to ease the interactions with the cumulative voting. Quite impressive! Write less, do more seems to be right on target. The usage of CSS selectors to pick the elements on which you want to act is a bliss. I did have to tweak the code a bit to improve its performance – my initial code didn’t scale well when the number of options on which to vote went up; I fixed it up by relying on event delegation, essentially reducing the number of events binding by attaching the onChange event to the container of the select elements rather than to each of them. For sake of illustration, let’s compare the code to the one I would have needed to write without using JQuery (and not mentioning cross-browsers compatibility issues) for the first couple of lines:

  • Without:
    window.addEventListener('load', function() {
  • With:
    $(document).ready(function() {
  • Without:
    // Adding a line to the list with the total displayed
    divs = document.getElementsByTagName("div");
    for(i=0;i<divs.length;i++) {
      div = divs[i];
      if (div.class=='cumulative') {
        cumulativeSelectors = div.getElementsByTagName("ul");
        for (j=0;j<cumulativeSelectors.length;j++) {
          ul = cumulativeSelectors[j];
          if (ul.class=="compare") {
            li = document.createElement("li");
            totalSpan = document.createElement("span");
            totalInput = document.createElement("input");
            remainingSpan = document.createElement("span");
            remainingInput = document.createElement("input");
            totalSpan.appendChild(document.createTextNode('Total'));
            remainingSpan.appendChild(document.createTextNode('Remaining'));
            totalSpan.class=remainingSpan.class='label';
            totalInput.disabled = remainingInput.disabled = true;
            li.appendChild(totalSpan);
            li.appendChild(document.createTextNode(':'));
            li.appendChild(totalInput);
            li.appendChild(remainingSpan);
            li.appendChild(remainingInput);
          }
        }
      }
    }
  • With:
    // Adding a line to the list with the total displayed
     $("div.cumulative ul.compare").append("<li><span class='label'>Total</span>:
     <input disabled='disabled' /> <span class='label'>
    Remaining</span> <input disabled='disabled' /></li>");

And that’s taking the less complex part of the code… I think I’m going to like coding in Javascript again! I would be curious to read the XForms equivalent of this widget – any taker? Possible improvements to my cumulative voting widget include:

  • Packaging it as a JQuery widget – this could possibly be useful to others?
  • Using sliders to give a better visual feedback
  • Annotating the controls with WAI ARIA for improved accessibility and usability