W3C's Excessive DTD Traffic
If you view the source code of a typical web page, you are likely to see something like this near the top:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" ...>
These refer to HTML DTDs and namespace documents hosted on W3C's site.
Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say "this is HTML". In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.
The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema.
Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.
A while ago we put a system in place to monitor our servers for abusive request patterns and send
503 Service Unavailable responses with custom text depending on the nature of the abuse. Our hope was that the authors of misbehaving software and the administrators of sites who deployed it would notice these errors and make the necessary fixes to the software responsible.
But many of these systems continue to re-request the same DTDs from our site thousands of times over, even after we have been serving them nothing but
503 errors for hours or days. Why are these systems bothering to request these resources at all if they don't care about the response? (For repeat offenders we eventually block the IPs at the TCP level as well.)
We have identified some of the specific software causing this excessive traffic and have been in contact with the parties responsible to explain how their product or service is essentially creating a Distributed Denial of Service (DDoS) attack against W3C. Some have been very responsive, correcting the problem in a timely manner; unfortunately others have been dragging on for quite some time without resolution, and a number of sources remain unidentified.
We would like to see this issue resolved once and for all, not just for our own needs but also to improve the quality of software deployed on the Web at large. Therefore we have a number of suggestions for those writing and deploying such software:
- Pay attention to HTTP response codes
This is basic good programming practice: check your return codes, otherwise you have no idea when something goes wrong.
- Honor HTTP caching/expiry information
Resources on our site are served in a cache-friendly way: our DTDs and schemata generally have explicit expiry times of 90 days or more, so there's no reason to request these resources several times a day. (In one case we noticed, a number of IP addresses at one company were requesting DTDs from our site more than three hundred thousand times per day each, per IP address.)
Mark Nottingham's caching tutorial is an excellent resource to learn more about HTTP caching.
- If you implement HTTP in a software library, allow for caching
Any software that makes HTTP requests to other sites should make it straightforward to enable the use of a cache. Applications that use such libraries to contact other sites should clearly document how to enable caching, and preferably ship with caching enabled by default.
Many XML utilities have the ability to use an XML catalog to map URIs for external resources to a locally-cached copy of the files. For information on configuring XML applications to use a catalog, see Norman Walsh's Caching in with Resolvers article or Catalog support in libxml.
- Take responsibility for your outgoing network traffic
If you install software that interacts with other sites over the network, you should be aware how it works and what kind of traffic it generates. If it has the potential to make thousands of requests to other sites, make sure it uses an HTTP cache to prevent inflicting abuse on other sites. If the software doesn't make it straightforward to do so, file a bug report with the vendor, seek alternatives, or use an intercepting proxy server with a built-in cache.
- Don't fetch stuff unless you actually need it
Judging from the response to our 503 errors, much of the software requesting DTDs and schemata from our site doesn't even need them in the first place, so requesting them just wastes bandwidth and slows down the application. If you don't need it, don't fetch it!
- Identify your user agents
When deploying software that makes requests to other sites, you should set a custom
User-Agentheader to identify the software and provide a means to contact its maintainers. Many of the automated requests we receive have generic user-agent headers such as
Python-urllib/2.1which provide no information on the actual software responsible for making the requests.
Some sites (e.g. Google, Wikipedia) block access to such generic user-agents. We have not done that yet but may consider doing so.
It is generally quite easy to set a custom User-Agent with most HTTP software libraries, see for example How to change the User-Agent of Python's urllib.
We are interested in feedback from the community on what else we can do to address the issue of this excessive traffic. Specifically:
Do we need to make our specifications clearer in terms of HTTP caching and best practices for software developers?
You might think something like "don't request the same resource thousands of times a day, especially when it explicitly tells you it should be considered fresh for 90 days" would be obvious, but unfortunately it seems not.
Do you have any examples of specific applications that do things right/wrong by default, or pointers to documentation on how to enable caching in software packages that might be affecting us?
What do other medium/large sites do to detect and prevent abuse?
We are not alone in receiving excessive schema and namespace requests, take for example the stir when the DTD for RSS 0.91 disappeared.
Some of the community efforts in identifying abusive traffic are too aggressive for our needs. What do you use, and how do you use it?
Should we just ignore the issue and serve all these requests?
What if we start receiving 10 billion DTD requests/day instead of 100 million?