Email address obfuscation in mailing list archives

One of W3C’s most important system services are our mailing lists and corresponding online archives. Thousands of people participate in these lists, and the archives now contain millions of messages dating back to 1991.

These archives are an essential resource for groups collaborating on standards work, to build shared context and record the history behind various decisions.

Many of our discussion lists are public, with public archives; unfortunately one of the side effects of making these archives public is that the email addresses contained therein can be harvested by spammers looking for new victims.

Occasionally we receive requests to obfuscate or remove email addresses from our archives in an attempt to prevent or delay spammers from harvesting these addresses. We have implemented simple measures to foil the simplest harvesters (replacing each character of the domain name with its corresponding HTML entity) but have been reluctant to remove the email addresses completely because they are an essential part of the record.

Some email address obfuscation methods are more effective than others in preventing spam; usually there is a tradeoff between effectiveness and usability/accessibility for human readers. Most of the obfuscation techniques that preserve the full email address in some form are straightforward to decode and therefore ineffective in the long run as address harvesters are updated to compensate for new obfuscation techniques.

Even if we completely removed the email addresses from our archives, posters would still be subject to spam, since spammers can simply subscribe to our public lists and harvest email addresses as they are distributed in the original email messages. (This fact is noted in Google Groups’ documentation on its email address masking.)

One option would be to remove email addresses or mask them like Google does (e.g. display gerald@w3.org as ger...@w3.org), and make the original messages available only to authenticated users, for example people who have a W3C Member or Invited Expert account. This would help reduce the amount of spam received by participants in the short term, while making the data available to people we know and trust.

Personally, I feel that doing so is a bad idea. Email addresses are not secrets, and pretending they are is misleading and a waste of effort. If you maintain a web site or blog and participate in online communities like W3C’s, keeping your email address a secret for an extended period of time will be very difficult, and once it’s out, it’s out, as spammers sell or swap lists of millions of addresses with each other.

If you care about your online reputation and want to stand behind what you say, you should want your identity to be associated with the things you write. If not, you can create a disposable email address at one of the thousands of free email hosting sites, and use that when participating in public forums. (also noted in Google’s docs.)

Removing email addresses from our archives has negative consequences besides the human aesthetic and usability aspects: an email address is the best machine-readable way to identify the author of an email message, so omitting it from an archived message causes one of the most important parts of semantic data about the message to be lost.

Meanwhile, spam will continue to come in as addresses are leaked by other means — the solution to that isn’t to try to hide from spammers, but to develop better spam-blocking tools, including smart agents that use existing data (such as the history of previous correspondence from your mailboxes and public archives like W3C’s, and data on various relationships in social networking sites) to generate a list of trusted correspondents.

W3C’s mailing lists are generally spam-free even though we invite anyone with an email address to provide feedback (subject to the posting policy of the given mailing list) — this openness is an important part of W3C’s process, so we invested in the tools needed to make it happen. If others have spam problems they should do likewise!

Giving up on email is not the answer. In the words of John Gilmore,

We have built a communication system that lets anyone in the world send information to anyone else in the world, arriving in seconds, at any time, at an extremely low and falling cost. THIS WAS NOT A MISTAKE! IT WAS NOT AN ACCIDENT! The world collectively has spent trillions of dollars and millions of person-years, over hundreds of years, to build this system — because it makes society vastly better off than when communication was slow, expensive, regional, and unreliable.

What do you think? Is it worth preserving the machine-readability of details like this in our mailing list archives, or should we remove them in the interest of hiding from spammers? (even though that won’t work in the long run.)

Should we build systems that generate and consume rich semantic data about our world, or hide these details because a few parasites might use the data the wrong way?

W3C’s Excessive DTD Traffic

If you view the source code of a typical web page, you are likely to see something like this near the top:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

and/or

<html xmlns="http://www.w3.org/1999/xhtml" ...>

These refer to HTML DTDs and namespace documents hosted on W3C’s site.

Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say “this is HTML”. In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven’t changed in years.

The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema.

Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.

A while ago we put a system in place to monitor our servers for abusive request patterns and send 503 Service Unavailable responses with custom text depending on the nature of the abuse. Our hope was that the authors of misbehaving software and the administrators of sites who deployed it would notice these errors and make the necessary fixes to the software responsible.

But many of these systems continue to re-request the same DTDs from our site thousands of times over, even after we have been serving them nothing but 503 errors for hours or days. Why are these systems bothering to request these resources at all if they don’t care about the response? (For repeat offenders we eventually block the IPs at the TCP level as well.)

We have identified some of the specific software causing this excessive traffic and have been in contact with the parties responsible to explain how their product or service is essentially creating a Distributed Denial of Service (DDoS) attack against W3C. Some have been very responsive, correcting the problem in a timely manner; unfortunately others have been dragging on for quite some time without resolution, and a number of sources remain unidentified.

We would like to see this issue resolved once and for all, not just for our own needs but also to improve the quality of software deployed on the Web at large. Therefore we have a number of suggestions for those writing and deploying such software:

  • Pay attention to HTTP response codes

    This is basic good programming practice: check your return codes, otherwise you have no idea when something goes wrong.

  • Honor HTTP caching/expiry information

    Resources on our site are served in a cache-friendly way: our DTDs and schemata generally have explicit expiry times of 90 days or more, so there’s no reason to request these resources several times a day. (In one case we noticed, a number of IP addresses at one company were requesting DTDs from our site more than three hundred thousand times per day each, per IP address.)

    Mark Nottingham’s caching tutorial is an excellent resource to learn more about HTTP caching.

  • If you implement HTTP in a software library, allow for caching

    Any software that makes HTTP requests to other sites should make it straightforward to enable the use of a cache. Applications that use such libraries to contact other sites should clearly document how to enable caching, and preferably ship with caching enabled by default.

    Many XML utilities have the ability to use an XML catalog to map URIs for external resources to a locally-cached copy of the files. For information on configuring XML applications to use a catalog, see Norman Walsh’s Caching in with Resolvers article or Catalog support in libxml.

  • Take responsibility for your outgoing network
    traffic

    If you install software that interacts with other sites over the network, you should be aware how it works and what kind of traffic it generates. If it has the potential to make thousands of requests to other sites, make sure it uses an HTTP cache to prevent inflicting abuse on other sites. If the software doesn’t make it straightforward to do so, file a bug report with the vendor, seek alternatives, or use an intercepting proxy server with a built-in cache.

  • Don’t fetch stuff unless you actually need it

    Judging from the response to our 503 errors, much of the software requesting DTDs and schemata from our site doesn’t even need them in the first place, so requesting them just wastes bandwidth and slows down the application. If you don’t need it, don’t fetch it!

  • Identify your user agents

    When deploying software that makes requests to other sites, you should set a custom User-Agent header to identify the software and provide a means to contact its maintainers. Many of the automated requests we receive have generic user-agent headers such as Java/1.6.0 or Python-urllib/2.1 which provide no information on the actual software responsible for making the requests.

    Some sites (e.g. Google, Wikipedia) block access to such generic user-agents. We have not done that yet but may consider doing so.

    It is generally quite easy to set a custom User-Agent with most HTTP software libraries, see for example How to change the User-Agent of Python’s urllib.

We are interested in feedback from the community on what else we can do to address the issue of this excessive traffic. Specifically:

  • Do we need to make our specifications clearer in terms of HTTP caching and best practices for software developers?

    You might think something like “don’t request the same resource thousands of times a day, especially when it explicitly tells you it should be considered fresh for 90 days” would be obvious, but unfortunately it seems not.

    At the W3C Systems Team’s request the W3C TAG has agreed to take up the issue of Scalability of URI Access to Resources.

  • Do you have any examples of specific applications that do things right/wrong by default, or pointers to documentation on how to enable caching in software packages that might be affecting us?

  • What do other medium/large sites do to detect and prevent abuse?

    We are not alone in receiving excessive schema and namespace requests, take for example the stir when the DTD for RSS 0.91 disappeared.

    For other types of excessive traffic, we have looked at software to help block or rate-limit requests, e.g. mod_cband, mod_security, Fail2ban.

    Some of the community efforts in identifying abusive traffic are too aggressive for our needs. What do you use, and how do you use it?

  • Should we just ignore the issue and serve all these requests?

    What if we start receiving 10 billion DTD requests/day instead of 100 million?


Authors: Gerald Oskoboiny and Ted Guild

Upgrading to Apache 2.2.8 using debian and ssh

I’ve justupgraded all our production servers to Apache 2.2.8 from 2.2.4. This operation requires adding three custom patches we made to solve some issues with Apache. Applying these patches and updating the servers is quite easy when you use debian and ssh. Without going into specific detail, here’s an overall view of the steps that need to be taken.

  1. Get the source files. Go to the debian package page for Apache 2.2.8. Yep, this is for debian unstable. We’re going to backport it to debian stable, which is what we use on our production servers. From that page, download the original tar file and the diff file. Note that you could also get the original tar file (or a more recent one) from the Apache ASF httpd home page
  2. Create the debian source package. Explode the tar file, rename the target dir from http2 to apache2-2.2.8. Uncompress the diff file and apply it using patch -p0 <diff_file name. This will create
    the environment for compiling the debian package.
  3. Install the custom patches. Go to apache2-2.2.8/debian/patches and copy the patches there. I generate the patches using diff -c. I had to prefix each patch with a specific debian prologue. Inspire yourself from the other patches. Each patch file has to have a number and a name describing what it does. Once you add the patches, you need to update the file 00list so that it includes it. Just do ls [0-9]??_* > 00list. Note that if you’re upgrading to a new apache2 version, you need to make sure your patch is still needed and adjust it as needed.
  4. Generate the debian package. cd to apache2-2.2.8 and invoke dpkg-buildpackage. Don’t forget to do a chmod +x debian/rules before.

You’re all set up. Time to complete the operation? Just some minutes. Complexity? If you already have the patches, none. Results: fresh .deb packages with our custom patches.
To roll out our apache2 custom .deb packages, I put them in our local debian apt repository. All of our production servers point to it. Moreover, in each server, the respective apache2 packages are put on hold and are pinned to our apt server. This makes sure that they will only be updated when we explicitly request so, and that they will come from our repository.
In each server I have a local script that will do the apt update and install of the apache2 packages. From the comfort of my main box, I go through the list of servers invoking the local script through ssh. Time and effort to roll-out the new version? Negligeable. This pressumes I’d already tried out the server and patches on a test server.

List of apache 2.2 patches available

As the rollout period for patches submitted to apache-dev can often be long, I put a list of all our contributed patches and the issues they solve so that other apache users can profit from them. All these patches and test cases were sent to the Apache server bugzilla server. You will actually download the patches from their bugzilla entry.
All of these patches are being used without problems in our W3C production servers.

W3C Mail Search Engine updated

W3C’s MASE search engine got a nice new year present, in the shape of a new server. The additional computing muscle will be well-used to index and search among the hundreds of thousands of messages on the W3C’s mailing-list forums: at the time of this writing, the public mailing-lists hold more than 600,000 messages, and that’s not counting the internal lists of W3C staff and collaborators.

Moving to a new server was an opportunity to update the Namazu engine which powers this search engine, create new, clean indexes, and to hack a little on new features.

Creating indexes from scratch made us discover a tricky little bug in the mail search’s sorting algorithm. So far, we had been using namazu--sort=date to sort results by date, but as we found out, this is not sorting results according to the Date: mail header, but the dated timestamp of the locally archived mail file. The proper syntax is actually --sort=field:utc. Can’t make this one up, and it isn’t properly documented, but fortunately the namazu mailing-lists got us the answer in no time.

The new feature was a low hanging fruit. Namazu was already indexing the Date: mail headers, and it supports field-specific searching, which we already use to search messages from a specific person or with a given Subject:. Adding Date: specific search and we now have a way to filter results and show only mails received in a given month, or a given year. Twisting this feature to not only allow filtering of a given month, but a time span would be even better, but… that will be for later.

In Theory, Namazu is not supposed to be used on such a massive set of data as the W3C lists, but it works: our hacking was, and is, mostly limited to interfacing the engine with our lists system, and make indexing of individual lists real-time. The real-time indexing is a clever hack, but nothing complicated: our lists server creates a queue of messages that need to be indexed, and we feed that to the namazu indexer. One would usually give the namazu indexer a whole directory to check and re-index, which is way too slow. And if giving it a single file, mknmz would generally replace the contents ofits existing index with this sole file.

The trick? Feed mknmz a single file, but run it with the --no-delete option, so that it would keep the existing index content and just append the new file to it. We were afraid this might eventually corrupt the indexes, but so far, so good.

namazu is developed as an open-source project, and the source of our MASE search system is open (albeit a little messy and undocumented, I’m afraid), too.

Systems Team Provides Audio Broadcast of W3C Technical Plenary Day

One Wednesday November 7th we provided an experimental audio broadcast of W3C’s Technical Plenary Day. We used icecast as the streaming server and broadcast in Ogg Vorbis providing some help links on getting Ogg Vorgis codecs working various clients and had an embedded Java applet Ogg client available as well. We will be making audio recordings available shortly along with transcript.

Online htmldiff service

Many Webmasters have heard about or used the W3C Link checker to find dead links on their pages, but very few would know that this service was initially created to help editors of W3C specifications find broken links in their documents, as required by W3C publication rules as a corrollary of our motto on stable cool URIs.

Every once in a while, we provide new services to make the life of our collaborators easier, and offer them to the public at large as much as possible; our latest toy in this category is an htmldiff service, which out of two online HTML documents will create a new document highlighting the differences between the two documents.

This is of course mostly useful to find the changes between two versions of a given document – and indeed, was created to help show the variations between two versions of a given Technical Report.

The tool itself is a pretty simple Python wrapper around Shane McCarron’s htmldiff perl script – I’m happy to share the code of the Python wrapper if anyone is interested.

Migration to Apache2

Took some time, but we finally made it, upgrading not only the Apache server, but also the boxes on which it runs. We’re now using 64-bits boxes, with 8GB RAM and running Apache 2.2, effectively reducing by two the number of boxes we used with 32-bits boxes we used for Apache 1.3, with a better quality of service.

Most of the migration time was taken by evaluating our configuration and checking the differences between Apache 1.3 and Apache 2.2, plus evaluating that everything worked well. We didn’t have any test procedures before to automatize the testing of the server and see that everything works as usual. I started a small piece of code to test the aaa setup (see here below) and now plan to migrate to the Apache perl test framework, which I found very useful. This will minimize the migration time next time this issue arises. Some time was also needed to update different scripts and databases to cope with difference between 32 and 64-bits architectures.

At W3C we use some custom aaa modules (Authentication and Access Control/Authorization in Apache notation) that basically let you do more fine-grained access-control than the one proposed by Apache by default. In a nutshell, they let use your IP@ automatically in guise of a password (a feature dating from the CERN server, which has never been available AFAIK in Apache), and fine-grained access control using ACLs, stating who can access which resource using which HTTP method. Although the code for this modules is available on our public CVS server, there are probably not useful to other users because we haven’t cleaned/published the scripts that let you populate the aaa dbm files. You can use the CVS Web interface to see the changes I had to do to port the modules to from Apache 1.3 to Apache 2.2. (not so many).

The Apache 2.2 architecture is cleaner, more modular, and provides more API hooks than the Apache 1.3. This has minimized the number of places where a patch is needed. The only code patches I needed to apply were to allow passing the IP@ to the access control modules in mod_auth_basic and mod_auth_digest (by default, Apache stops the request processing if the user hasn’t been authenticated). I also needed to extend mod_negotiation so that it does take into account the authorized conn-neg resources, discarding the unauthorized ones (the default behavior is to discard everything if any resource is unauthorized). These were very minor extension patches, consisting of only a couple of lines. The plan is to propose them to the Apache developers. For the moment, we made alternate modules for mod_auth_basic, mod_auth_digest, and mod_negotiation that include these minor patches. We use them instead of the ones that Apache proposes. We didn’t need to patch the core Apache code to add these features, as was the case in Apache 1.3. This is good.

On the other hand, we found a bug in the Apache 2.2 server. Connneg and internal rewrites generate something that Apache calls a sub-request. When handling sub-requests, Apache 2.2 simulates a complete request. However, because of an speed optimization change, Apache doesn’t take into account .htaccces files when processing sub-requests… although it did so before in Apache 1.3 and Apache 2.0 This is clearly a bug. I submitted a bug report, a small code patch, and an extension to the Apache test suite in May (corrected since). The bug was recognized as such by some among the Apache team and I got some initial guidance. However, the patch hasn’t yet been committed. The Apache team seems to be too time-overwhelmed to evaluate my patch. We went ahead and installed Apache2 with this patch and it has been working as expected. Perhaps one day the Apache team will adopt it. Compared to a previous patch I sent them a couple of years ago, the delay between submission and committing has increased from one month to five months and counting.

For the packaging and installation of Apache, we’re now using our own Debian packages, derived in 99.99% from the official Apache 2.2 Debian packages. They are clean, efficient, easy to install. Moreover, when you have patches, you just have to pop them to a directory and the Debian packaging system will take care of patching the code when building a new package. As you may guess, we are generating our own Apache2 Debian packages to be able to apply our patch.
Moreover, this lets us benefit from the Debian Apache 2.2 security patches.

Apache 2.2.6 came out at the same time we did our migration to 2.2.4. The Apache-dev list reveals there are some issues regarding mod_perl, so we’re going to wait a while before migrating, probably to 2.2.7.

Systeam Starts Public Blog

W3C’s Systems Team shares many of the challenges and ambitions of other website’s software developers and administrators. We thought it would be valuable both to others and ourselves to open up some of our thought processes publicly.
We use, contribute back and create Open Source software and believe others may have similar interests in features and customization. Not all of our software has been made Open Source, some is too particular to our organization or we have not taken the time to make it more abstract, clean it up, document and package for release.
We look forward to hearing back from the Community on our posts from those with similar issues or interests. Suggestions welcome. In order to keep a good signal to noise ratio we will be moderating comments for relevance.