Role
W3C maintains an internationally distributed network of servers and services that support Public, Member, and Team audiences in pursuit of the Consortium's technical and social objectives. W3C uses these systems to manage its Activities and Working Groups according to W3C Process.
Design
W3C's systems infrastructure is based almost completely on open source software running on Debian GNU/Linux servers. Many of our tools are built using the popular LAMP platform (Linux, Apache, MySql, Perl, PHP, Python scripting/programming languages.)
Status
We try our best to document known outages and disruptions to our services on the Web; should you encounter a problem with one of our services not documented on that page, please let us know at <web-human@w3.org>.
W3C's Excessive DTD Traffic
Posted on 8, February 2008, by Ted Guild in Homegrown tools
If you view the source code of a typical web page, you are likely to see something like this near the top:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
and/or
<html xmlns="http://www.w3.org/1999/xhtml" ...>
These refer to HTML DTDs and namespace documents hosted on W3C's site.
Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say "this is HTML". In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.
The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema.
Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.
A while ago we put a system in place to monitor our servers for abusive request patterns and send 503 Service Unavailable responses with custom text depending on the nature of the abuse. Our hope was that the authors of misbehaving software and the administrators of sites who deployed it would notice these errors and make the necessary fixes to the software responsible.
But many of these systems continue to re-request the same DTDs from our site thousands of times over, even after we have been serving them nothing but 503 errors for hours or days. Why are these systems bothering to request these resources at all if they don't care about the response? (For repeat offenders we eventually block the IPs at the TCP level as well.)
We have identified some of the specific software causing this excessive traffic and have been in contact with the parties responsible to explain how their product or service is essentially creating a Distributed Denial of Service (DDoS) attack against W3C. Some have been very responsive, correcting the problem in a timely manner; unfortunately others have been dragging on for quite some time without resolution, and a number of sources remain unidentified.
We would like to see this issue resolved once and for all, not just for our own needs but also to improve the quality of software deployed on the Web at large. Therefore we have a number of suggestions for those writing and deploying such software:
Pay attention to HTTP response codes
This is basic good programming practice: check your return codes, otherwise you have no idea when something goes wrong.
Honor HTTP caching/expiry information
Resources on our site are served in a cache-friendly way: our DTDs and schemata generally have explicit expiry times of 90 days or more, so there's no reason to request these resources several times a day. (In one case we noticed, a number of IP addresses at one company were requesting DTDs from our site more than three hundred thousand times per day each, per IP address.)
Mark Nottingham's caching tutorial is an excellent resource to learn more about HTTP caching.
If you implement HTTP in a software library, allow for caching
Any software that makes HTTP requests to other sites should make it straightforward to enable the use of a cache. Applications that use such libraries to contact other sites should clearly document how to enable caching, and preferably ship with caching enabled by default.
Many XML utilities have the ability to use an XML catalog to map URIs for external resources to a locally-cached copy of the files. For information on configuring XML applications to use a catalog, see Norman Walsh's Caching in with Resolvers article or Catalog support in libxml.
Take responsibility for your outgoing network
trafficIf you install software that interacts with other sites over the network, you should be aware how it works and what kind of traffic it generates. If it has the potential to make thousands of requests to other sites, make sure it uses an HTTP cache to prevent inflicting abuse on other sites. If the software doesn't make it straightforward to do so, file a bug report with the vendor, seek alternatives, or use an intercepting proxy server with a built-in cache.
Don't fetch stuff unless you actually need it
Judging from the response to our 503 errors, much of the software requesting DTDs and schemata from our site doesn't even need them in the first place, so requesting them just wastes bandwidth and slows down the application. If you don't need it, don't fetch it!
Identify your user agents
When deploying software that makes requests to other sites, you should set a custom
User-Agentheader to identify the software and provide a means to contact its maintainers. Many of the automated requests we receive have generic user-agent headers such asJava/1.6.0orPython-urllib/2.1which provide no information on the actual software responsible for making the requests.Some sites (e.g. Google, Wikipedia) block access to such generic user-agents. We have not done that yet but may consider doing so.
It is generally quite easy to set a custom User-Agent with most HTTP software libraries, see for example How to change the User-Agent of Python's urllib.
We are interested in feedback from the community on what else we can do to address the issue of this excessive traffic. Specifically:
Do we need to make our specifications clearer in terms of HTTP caching and best practices for software developers?
You might think something like "don't request the same resource thousands of times a day, especially when it explicitly tells you it should be considered fresh for 90 days" would be obvious, but unfortunately it seems not.
At the W3C Systems Team's request the W3C TAG has agreed to take up the issue of Scalability of URI Access to Resources.
Do you have any examples of specific applications that do things right/wrong by default, or pointers to documentation on how to enable caching in software packages that might be affecting us?
What do other medium/large sites do to detect and prevent abuse?
We are not alone in receiving excessive schema and namespace requests, take for example the stir when the DTD for RSS 0.91 disappeared.
For other types of excessive traffic, we have looked at software to help block or rate-limit requests, e.g. mod_cband, mod_security, Fail2ban.
Some of the community efforts in identifying abusive traffic are too aggressive for our needs. What do you use, and how do you use it?
Should we just ignore the issue and serve all these requests?
What if we start receiving 10 billion DTD requests/day instead of 100 million?
Authors: Gerald Oskoboiny and Ted Guild
Upgrading to Apache 2.2.8 using debian and ssh
Posted on 31, January 2008, by Jose in Open Source Software
I've justupgraded all our production servers to Apache 2.2.8 from 2.2.4. This operation requires adding three custom patches we made to solve some issues with Apache. Applying these patches and updating the servers is quite easy when you use debian and ssh. Without going into specific detail, here's an overall view of the steps that need to be taken.
- Get the source files. Go to the debian package page for Apache 2.2.8. Yep, this is for debian unstable. We're going to backport it to debian stable, which is what we use on our production servers. From that page, download the original tar file and the diff file. Note that you could also get the original tar file (or a more recent one) from the Apache ASF httpd home page
- Create the debian source package. Explode the tar file, rename the target dir from http2 to apache2-2.2.8. Uncompress the diff file and apply it using patch -p0 <diff_file name. This will create
the environment for compiling the debian package. - Install the custom patches. Go to apache2-2.2.8/debian/patches and copy the patches there. I generate the patches using diff -c. I had to prefix each patch with a specific debian prologue. Inspire yourself from the other patches. Each patch file has to have a number and a name describing what it does. Once you add the patches, you need to update the file 00list so that it includes it. Just do
ls [0-9]??_* > 00list. Note that if you're upgrading to a new apache2 version, you need to make sure your patch is still needed and adjust it as needed. - Generate the debian package.
cdto apache2-2.2.8 and invokedpkg-buildpackage. Don't forget to do achmod +x debian/rulesbefore.
You're all set up. Time to complete the operation? Just some minutes. Complexity? If you already have the patches, none. Results: fresh .deb packages with our custom patches.
To roll out our apache2 custom .deb packages, I put them in our local debian apt repository. All of our production servers point to it. Moreover, in each server, the respective apache2 packages are put on hold and are pinned to our apt server. This makes sure that they will only be updated when we explicitly request so, and that they will come from our repository.
In each server I have a local script that will do the apt update and install of the apache2 packages. From the comfort of my main box, I go through the list of servers invoking the local script through ssh. Time and effort to roll-out the new version? Negligeable. This pressumes I'd already tried out the server and patches on a test server.
List of apache 2.2 patches available
Posted on 31, January 2008, by Jose in Open Source Software
As the rollout period for patches submitted to apache-dev can often be long, I put a list of all our contributed patches and the issues they solve so that other apache users can profit from them. All these patches and test cases were sent to the Apache server bugzilla server. You will actually download the patches from their bugzilla entry.
All of these patches are being used without problems in our W3C production servers.
:: Next Page >>