Role

W3C maintains an internationally distributed network of servers and services that support Public, Member, and Team audiences in pursuit of the Consortium's technical and social objectives. W3C uses these systems to manage its Activities and Working Groups according to W3C Process.

Design

W3C's systems infrastructure is based almost completely on open source software running on Debian GNU/Linux servers. Many of our tools are built using the popular LAMP platform (Linux, Apache, MySql, Perl, PHP, Python scripting/programming languages.)

Status

We try our best to document known outages and disruptions to our services on the Web; should you encounter a problem with one of our services not documented on that page, please let us know at <web-human@w3.org>.

feed Subscribe to this blog's Articles feed

Email address obfuscation in mailing list archives

One of W3C's most important system services are our mailing lists and corresponding online archives. Thousands of people participate in these lists, and the archives now contain millions of messages dating back to 1991.

These archives are an essential resource for groups collaborating on standards work, to build shared context and record the history behind various decisions.

Many of our discussion lists are public, with public archives; unfortunately one of the side effects of making these archives public is that the email addresses contained therein can be harvested by spammers looking for new victims.

Occasionally we receive requests to obfuscate or remove email addresses from our archives in an attempt to prevent or delay spammers from harvesting these addresses. We have implemented simple measures to foil the simplest harvesters (replacing each character of the domain name with its corresponding HTML entity) but have been reluctant to remove the email addresses completely because they are an essential part of the record.

Some email address obfuscation methods are more effective than others in preventing spam; usually there is a tradeoff between effectiveness and usability/accessibility for human readers. Most of the obfuscation techniques that preserve the full email address in some form are straightforward to decode and therefore ineffective in the long run as address harvesters are updated to compensate for new obfuscation techniques.

Even if we completely removed the email addresses from our archives, posters would still be subject to spam, since spammers can simply subscribe to our public lists and harvest email addresses as they are distributed in the original email messages. (This fact is noted in Google Groups' documentation on its email address masking.)

One option would be to remove email addresses or mask them like Google does (e.g. display gerald@w3.org as ger...@w3.org), and make the original messages available only to authenticated users, for example people who have a W3C Member or Invited Expert account. This would help reduce the amount of spam received by participants in the short term, while making the data available to people we know and trust.

Personally, I feel that doing so is a bad idea. Email addresses are not secrets, and pretending they are is misleading and a waste of effort. If you maintain a web site or blog and participate in online communities like W3C's, keeping your email address a secret for an extended period of time will be very difficult, and once it's out, it's out, as spammers sell or swap lists of millions of addresses with each other.

If you care about your online reputation and want to stand behind what you say, you should want your identity to be associated with the things you write. If not, you can create a disposable email address at one of the thousands of free email hosting sites, and use that when participating in public forums. (also noted in Google's docs.)

Removing email addresses from our archives has negative consequences besides the human aesthetic and usability aspects: an email address is the best machine-readable way to identify the author of an email message, so omitting it from an archived message causes one of the most important parts of semantic data about the message to be lost.

Meanwhile, spam will continue to come in as addresses are leaked by other means — the solution to that isn't to try to hide from spammers, but to develop better spam-blocking tools, including smart agents that use existing data (such as the history of previous correspondence from your mailboxes and public archives like W3C's, and data on various relationships in social networking sites) to generate a list of trusted correspondents.

W3C's mailing lists are generally spam-free even though we invite anyone with an email address to provide feedback (subject to the posting policy of the given mailing list) — this openness is an important part of W3C's process, so we invested in the tools needed to make it happen. If others have spam problems they should do likewise!

Giving up on email is not the answer. In the words of John Gilmore,

We have built a communication system that lets anyone in the world send information to anyone else in the world, arriving in seconds, at any time, at an extremely low and falling cost. THIS WAS NOT A MISTAKE! IT WAS NOT AN ACCIDENT! The world collectively has spent trillions of dollars and millions of person-years, over hundreds of years, to build this system -- because it makes society vastly better off than when communication was slow, expensive, regional, and unreliable.

What do you think? Is it worth preserving the machine-readability of details like this in our mailing list archives, or should we remove them in the interest of hiding from spammers? (even though that won't work in the long run.)

Should we build systems that generate and consume rich semantic data about our world, or hide these details because a few parasites might use the data the wrong way?

Leave a comment

W3C's Excessive DTD Traffic

If you view the source code of a typical web page, you are likely to see something like this near the top:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

and/or

<html xmlns="http://www.w3.org/1999/xhtml" ...>

These refer to HTML DTDs and namespace documents hosted on W3C's site.

Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say "this is HTML". In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.

The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema.

Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.

A while ago we put a system in place to monitor our servers for abusive request patterns and send 503 Service Unavailable responses with custom text depending on the nature of the abuse. Our hope was that the authors of misbehaving software and the administrators of sites who deployed it would notice these errors and make the necessary fixes to the software responsible.

But many of these systems continue to re-request the same DTDs from our site thousands of times over, even after we have been serving them nothing but 503 errors for hours or days. Why are these systems bothering to request these resources at all if they don't care about the response? (For repeat offenders we eventually block the IPs at the TCP level as well.)

We have identified some of the specific software causing this excessive traffic and have been in contact with the parties responsible to explain how their product or service is essentially creating a Distributed Denial of Service (DDoS) attack against W3C. Some have been very responsive, correcting the problem in a timely manner; unfortunately others have been dragging on for quite some time without resolution, and a number of sources remain unidentified.

We would like to see this issue resolved once and for all, not just for our own needs but also to improve the quality of software deployed on the Web at large. Therefore we have a number of suggestions for those writing and deploying such software:

  • Pay attention to HTTP response codes

    This is basic good programming practice: check your return codes, otherwise you have no idea when something goes wrong.

  • Honor HTTP caching/expiry information

    Resources on our site are served in a cache-friendly way: our DTDs and schemata generally have explicit expiry times of 90 days or more, so there's no reason to request these resources several times a day. (In one case we noticed, a number of IP addresses at one company were requesting DTDs from our site more than three hundred thousand times per day each, per IP address.)

    Mark Nottingham's caching tutorial is an excellent resource to learn more about HTTP caching.

  • If you implement HTTP in a software library, allow for caching

    Any software that makes HTTP requests to other sites should make it straightforward to enable the use of a cache. Applications that use such libraries to contact other sites should clearly document how to enable caching, and preferably ship with caching enabled by default.

    Many XML utilities have the ability to use an XML catalog to map URIs for external resources to a locally-cached copy of the files. For information on configuring XML applications to use a catalog, see Norman Walsh's Caching in with Resolvers article or Catalog support in libxml.

  • Take responsibility for your outgoing network
    traffic

    If you install software that interacts with other sites over the network, you should be aware how it works and what kind of traffic it generates. If it has the potential to make thousands of requests to other sites, make sure it uses an HTTP cache to prevent inflicting abuse on other sites. If the software doesn't make it straightforward to do so, file a bug report with the vendor, seek alternatives, or use an intercepting proxy server with a built-in cache.

  • Don't fetch stuff unless you actually need it

    Judging from the response to our 503 errors, much of the software requesting DTDs and schemata from our site doesn't even need them in the first place, so requesting them just wastes bandwidth and slows down the application. If you don't need it, don't fetch it!

  • Identify your user agents

    When deploying software that makes requests to other sites, you should set a custom User-Agent header to identify the software and provide a means to contact its maintainers. Many of the automated requests we receive have generic user-agent headers such as Java/1.6.0 or Python-urllib/2.1 which provide no information on the actual software responsible for making the requests.

    Some sites (e.g. Google, Wikipedia) block access to such generic user-agents. We have not done that yet but may consider doing so.

    It is generally quite easy to set a custom User-Agent with most HTTP software libraries, see for example How to change the User-Agent of Python's urllib.

We are interested in feedback from the community on what else we can do to address the issue of this excessive traffic. Specifically:

  • Do we need to make our specifications clearer in terms of HTTP caching and best practices for software developers?

    You might think something like "don't request the same resource thousands of times a day, especially when it explicitly tells you it should be considered fresh for 90 days" would be obvious, but unfortunately it seems not.

    At the W3C Systems Team's request the W3C TAG has agreed to take up the issue of Scalability of URI Access to Resources.

  • Do you have any examples of specific applications that do things right/wrong by default, or pointers to documentation on how to enable caching in software packages that might be affecting us?

  • What do other medium/large sites do to detect and prevent abuse?

    We are not alone in receiving excessive schema and namespace requests, take for example the stir when the DTD for RSS 0.91 disappeared.

    For other types of excessive traffic, we have looked at software to help block or rate-limit requests, e.g. mod_cband, mod_security, Fail2ban.

    Some of the community efforts in identifying abusive traffic are too aggressive for our needs. What do you use, and how do you use it?

  • Should we just ignore the issue and serve all these requests?

    What if we start receiving 10 billion DTD requests/day instead of 100 million?


Authors: Gerald Oskoboiny and Ted Guild

80 comments

Upgrading to Apache 2.2.8 using debian and ssh

I've justupgraded all our production servers to Apache 2.2.8 from 2.2.4. This operation requires adding three custom patches we made to solve some issues with Apache. Applying these patches and updating the servers is quite easy when you use debian and ssh. Without going into specific detail, here's an overall view of the steps that need to be taken.

  1. Get the source files. Go to the debian package page for Apache 2.2.8. Yep, this is for debian unstable. We're going to backport it to debian stable, which is what we use on our production servers. From that page, download the original tar file and the diff file. Note that you could also get the original tar file (or a more recent one) from the Apache ASF httpd home page
  2. Create the debian source package. Explode the tar file, rename the target dir from http2 to apache2-2.2.8. Uncompress the diff file and apply it using patch -p0 <diff_file name. This will create
    the environment for compiling the debian package.
  3. Install the custom patches. Go to apache2-2.2.8/debian/patches and copy the patches there. I generate the patches using diff -c. I had to prefix each patch with a specific debian prologue. Inspire yourself from the other patches. Each patch file has to have a number and a name describing what it does. Once you add the patches, you need to update the file 00list so that it includes it. Just do ls [0-9]??_* > 00list. Note that if you're upgrading to a new apache2 version, you need to make sure your patch is still needed and adjust it as needed.
  4. Generate the debian package. cd to apache2-2.2.8 and invoke dpkg-buildpackage. Don't forget to do a chmod +x debian/rules before.

You're all set up. Time to complete the operation? Just some minutes. Complexity? If you already have the patches, none. Results: fresh .deb packages with our custom patches.

To roll out our apache2 custom .deb packages, I put them in our local debian apt repository. All of our production servers point to it. Moreover, in each server, the respective apache2 packages are put on hold and are pinned to our apt server. This makes sure that they will only be updated when we explicitly request so, and that they will come from our repository.

In each server I have a local script that will do the apt update and install of the apache2 packages. From the comfort of my main box, I go through the list of servers invoking the local script through ssh. Time and effort to roll-out the new version? Negligeable. This pressumes I'd already tried out the server and patches on a test server.

Leave a comment

:: Next Page >>

W3C Systems Team