Role
W3C maintains an internationally distributed network of servers and services that support Public, Member, and Team audiences in pursuit of the Consortium's technical and social objectives. W3C uses these systems to manage its Activities and Working Groups according to W3C Process.
Design
W3C's systems infrastructure is based almost completely on open source software running on Debian GNU/Linux servers. Many of our tools are built using the popular LAMP platform (Linux, Apache, MySql, Perl, PHP, Python scripting/programming languages.)
Status
We try our best to document known outages and disruptions to our services on the Web; should you encounter a problem with one of our services not documented on that page, please let us know at <web-human@w3.org>.
W3C Systems Team Submits Application to GSOC
Posted on 11, March 2009, by Ted Guild in Homegrown tools
Cumulative Voting with JQuery
Posted on 20, January 2009, by Dominique Hazael-Massieux in Homegrown tools
W3C has its own system designed to create, answer and gather results from votes, strawpolls, registrations, etc., known as WBS; I have been developing and maintaing this system over the past six years - I'm hoping to find sometimes the time and energy to make that tool open-source, but given that this requires abstracting many of its W3C-isms, I'm not holding my breath.
WBS knows a variety of types of questions that can be included in the questionnaires it creates: open comments, radio-selectable list of choices, checkboxes-selectable list of choices, timetable selector, etc. It is also (more or less weel) designed so that new types of questions can be reasonably easily added.
As part of an upcoming survey for W3C Members, it was requested that WBS handles a new type of questions, to support cumulative voting.

Adding the server-side part of that new control was relatively straightforward, thanks to WBS architecture; but given the nature of cumulative voting, it seemed more or less required to complete it with a client-side layer (read "Javascript") to ensure a minimum level of usability.
Although I started writing Javascript almost as soon as I started writing Web pages some twelve years ago, I had grown wary of doing it; I restarted coding with it lightly over the past two years, but had found it quite cumbersome to write, in particular when wanting to do it properly, that is to say with graceful degradation, events binding, proper DOM operations, etc.
So I figured that if everyone is raving about these new Javascript frameworks, there may be some reason to it and that I would better pay attention - and thus started using JQuery, mostly because it happened to be already set up on the W3C site and that I hadn't heard or read bad things about it.

Sure enough, one hour and a few tutorial pages later, I had written the 30 lines of code that were needed to support what I had in mind to ease the interactions with the cumulative voting. Quite impressive! Write less, do more
seems to be right on target. The usage of CSS selectors to pick the elements on which you want to act is a bliss.
I did have to tweak the code a bit to improve its performance - my initial code didn't scale well when the number of options on which to vote went up; I fixed it up by relying on event delegation, essentially reducing the number of events binding by attaching the onChange event to the container of the select elements rather than to each of them.
For sake of illustration, let's compare the code to the one I would have needed to write without using JQuery (and not mentioning cross-browsers compatibility issues) for the first couple of lines:
| Without | With |
|---|---|
|
|
|
|
(syntax highlighter credits go to Enjoy*Study.)
And that's taking the less complex part of the code…
I think I'm going to like coding in Javascript again! I would be curious to read the XForms equivalent of this widget - any taker?
Possible improvements to my cumulative voting widget include:
- Packaging it as a JQuery widget - this could possibly be useful to others?
- Using sliders to give a better visual feedback
- Annotating the controls with WAI ARIA for improved accessibility and usability
W3C's Excessive DTD Traffic
Posted on 8, February 2008, by Ted Guild in Homegrown tools
If you view the source code of a typical web page, you are likely to see something like this near the top:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
and/or
<html xmlns="http://www.w3.org/1999/xhtml" ...>
These refer to HTML DTDs and namespace documents hosted on W3C's site.
Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say "this is HTML". In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.
The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema.
Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.
A while ago we put a system in place to monitor our servers for abusive request patterns and send 503 Service Unavailable responses with custom text depending on the nature of the abuse. Our hope was that the authors of misbehaving software and the administrators of sites who deployed it would notice these errors and make the necessary fixes to the software responsible.
But many of these systems continue to re-request the same DTDs from our site thousands of times over, even after we have been serving them nothing but 503 errors for hours or days. Why are these systems bothering to request these resources at all if they don't care about the response? (For repeat offenders we eventually block the IPs at the TCP level as well.)
We have identified some of the specific software causing this excessive traffic and have been in contact with the parties responsible to explain how their product or service is essentially creating a Distributed Denial of Service (DDoS) attack against W3C. Some have been very responsive, correcting the problem in a timely manner; unfortunately others have been dragging on for quite some time without resolution, and a number of sources remain unidentified.
We would like to see this issue resolved once and for all, not just for our own needs but also to improve the quality of software deployed on the Web at large. Therefore we have a number of suggestions for those writing and deploying such software:
- Pay attention to HTTP response codes
This is basic good programming practice: check your return codes, otherwise you have no idea when something goes wrong.
- Honor HTTP caching/expiry information
Resources on our site are served in a cache-friendly way: our DTDs and schemata generally have explicit expiry times of 90 days or more, so there's no reason to request these resources several times a day. (In one case we noticed, a number of IP addresses at one company were requesting DTDs from our site more than three hundred thousand times per day each, per IP address.)
Mark Nottingham's caching tutorial is an excellent resource to learn more about HTTP caching.
- If you implement HTTP in a software library, allow for caching
Any software that makes HTTP requests to other sites should make it straightforward to enable the use of a cache. Applications that use such libraries to contact other sites should clearly document how to enable caching, and preferably ship with caching enabled by default.
Many XML utilities have the ability to use an XML catalog to map URIs for external resources to a locally-cached copy of the files. For information on configuring XML applications to use a catalog, see Norman Walsh's Caching in with Resolvers article or Catalog support in libxml.
- Take responsibility for your outgoing network
traffic
If you install software that interacts with other sites over the network, you should be aware how it works and what kind of traffic it generates. If it has the potential to make thousands of requests to other sites, make sure it uses an HTTP cache to prevent inflicting abuse on other sites. If the software doesn't make it straightforward to do so, file a bug report with the vendor, seek alternatives, or use an intercepting proxy server with a built-in cache.
- Don't fetch stuff unless you actually need it
Judging from the response to our 503 errors, much of the software requesting DTDs and schemata from our site doesn't even need them in the first place, so requesting them just wastes bandwidth and slows down the application. If you don't need it, don't fetch it!
- Identify your user agents
When deploying software that makes requests to other sites, you should set a custom
User-Agentheader to identify the software and provide a means to contact its maintainers. Many of the automated requests we receive have generic user-agent headers such asJava/1.6.0orPython-urllib/2.1which provide no information on the actual software responsible for making the requests.Some sites (e.g. Google, Wikipedia) block access to such generic user-agents. We have not done that yet but may consider doing so.
It is generally quite easy to set a custom User-Agent with most HTTP software libraries, see for example How to change the User-Agent of Python's urllib.
We are interested in feedback from the community on what else we can do to address the issue of this excessive traffic. Specifically:
Do we need to make our specifications clearer in terms of HTTP caching and best practices for software developers?
You might think something like "don't request the same resource thousands of times a day, especially when it explicitly tells you it should be considered fresh for 90 days" would be obvious, but unfortunately it seems not.
At the W3C Systems Team's request the W3C TAG has agreed to take up the issue of Scalability of URI Access to Resources.
Do you have any examples of specific applications that do things right/wrong by default, or pointers to documentation on how to enable caching in software packages that might be affecting us?
What do other medium/large sites do to detect and prevent abuse?
We are not alone in receiving excessive schema and namespace requests, take for example the stir when the DTD for RSS 0.91 disappeared.
For other types of excessive traffic, we have looked at software to help block or rate-limit requests, e.g. mod_cband, mod_security, Fail2ban.
Some of the community efforts in identifying abusive traffic are too aggressive for our needs. What do you use, and how do you use it?
Should we just ignore the issue and serve all these requests?
What if we start receiving 10 billion DTD requests/day instead of 100 million?
Authors: Gerald Oskoboiny and Ted Guild
:: Next Page >>