Role

W3C maintains an internationally distributed network of servers and services that support Public, Member, and Team audiences in pursuit of the Consortium's technical and social objectives. W3C uses these systems to manage its Activities and Working Groups according to W3C Process.

Design

W3C's systems infrastructure is based almost completely on open source software running on Debian GNU/Linux servers. Many of our tools are built using the popular LAMP platform (Linux, Apache, MySql, Perl, PHP, Python scripting/programming languages.)

Status

We try our best to document known outages and disruptions to our services on the Web; should you encounter a problem with one of our services not documented on that page, please let us know at <web-human@w3.org>.

feed Subscribe to this blog's Articles feed

W3C's Excessive DTD Traffic

If you view the source code of a typical web page, you are likely to see something like this near the top:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

and/or

<html xmlns="http://www.w3.org/1999/xhtml" ...>

These refer to HTML DTDs and namespace documents hosted on W3C's site.

Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say "this is HTML". In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.

The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema.

Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.

A while ago we put a system in place to monitor our servers for abusive request patterns and send 503 Service Unavailable responses with custom text depending on the nature of the abuse. Our hope was that the authors of misbehaving software and the administrators of sites who deployed it would notice these errors and make the necessary fixes to the software responsible.

But many of these systems continue to re-request the same DTDs from our site thousands of times over, even after we have been serving them nothing but 503 errors for hours or days. Why are these systems bothering to request these resources at all if they don't care about the response? (For repeat offenders we eventually block the IPs at the TCP level as well.)

We have identified some of the specific software causing this excessive traffic and have been in contact with the parties responsible to explain how their product or service is essentially creating a Distributed Denial of Service (DDoS) attack against W3C. Some have been very responsive, correcting the problem in a timely manner; unfortunately others have been dragging on for quite some time without resolution, and a number of sources remain unidentified.

We would like to see this issue resolved once and for all, not just for our own needs but also to improve the quality of software deployed on the Web at large. Therefore we have a number of suggestions for those writing and deploying such software:

  • Pay attention to HTTP response codes

    This is basic good programming practice: check your return codes, otherwise you have no idea when something goes wrong.

  • Honor HTTP caching/expiry information

    Resources on our site are served in a cache-friendly way: our DTDs and schemata generally have explicit expiry times of 90 days or more, so there's no reason to request these resources several times a day. (In one case we noticed, a number of IP addresses at one company were requesting DTDs from our site more than three hundred thousand times per day each, per IP address.)

    Mark Nottingham's caching tutorial is an excellent resource to learn more about HTTP caching.

  • If you implement HTTP in a software library, allow for caching

    Any software that makes HTTP requests to other sites should make it straightforward to enable the use of a cache. Applications that use such libraries to contact other sites should clearly document how to enable caching, and preferably ship with caching enabled by default.

    Many XML utilities have the ability to use an XML catalog to map URIs for external resources to a locally-cached copy of the files. For information on configuring XML applications to use a catalog, see Norman Walsh's Caching in with Resolvers article or Catalog support in libxml.

  • Take responsibility for your outgoing network traffic

    If you install software that interacts with other sites over the network, you should be aware how it works and what kind of traffic it generates. If it has the potential to make thousands of requests to other sites, make sure it uses an HTTP cache to prevent inflicting abuse on other sites. If the software doesn't make it straightforward to do so, file a bug report with the vendor, seek alternatives, or use an intercepting proxy server with a built-in cache.

  • Don't fetch stuff unless you actually need it

    Judging from the response to our 503 errors, much of the software requesting DTDs and schemata from our site doesn't even need them in the first place, so requesting them just wastes bandwidth and slows down the application. If you don't need it, don't fetch it!

  • Identify your user agents

    When deploying software that makes requests to other sites, you should set a custom User-Agent header to identify the software and provide a means to contact its maintainers. Many of the automated requests we receive have generic user-agent headers such as Java/1.6.0 or Python-urllib/2.1 which provide no information on the actual software responsible for making the requests.

    Some sites (e.g. Google, Wikipedia) block access to such generic user-agents. We have not done that yet but may consider doing so.

    It is generally quite easy to set a custom User-Agent with most HTTP software libraries, see for example How to change the User-Agent of Python's urllib.

We are interested in feedback from the community on what else we can do to address the issue of this excessive traffic. Specifically:

  • Do we need to make our specifications clearer in terms of HTTP caching and best practices for software developers?

    You might think something like "don't request the same resource thousands of times a day, especially when it explicitly tells you it should be considered fresh for 90 days" would be obvious, but unfortunately it seems not.

    At the W3C Systems Team's request the W3C TAG has agreed to take up the issue of Scalability of URI Access to Resources.

  • Do you have any examples of specific applications that do things right/wrong by default, or pointers to documentation on how to enable caching in software packages that might be affecting us?

  • What do other medium/large sites do to detect and prevent abuse?

    We are not alone in receiving excessive schema and namespace requests, take for example the stir when the DTD for RSS 0.91 disappeared.

    For other types of excessive traffic, we have looked at software to help block or rate-limit requests, e.g. mod_cband, mod_security, Fail2ban.

    Some of the community efforts in identifying abusive traffic are too aggressive for our needs. What do you use, and how do you use it?

  • Should we just ignore the issue and serve all these requests?

    What if we start receiving 10 billion DTD requests/day instead of 100 million?


Authors: Gerald Oskoboiny and Ted Guild

75 comments

Comments:

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

As you might imagine hosting these DTDs make for some rather interesting emails and phone calls from people thinking things such as we're hosting their website or responsible for a particularly offensive spam in an email composed in HTML.
PermalinkPermalink 2008-02-08 @ 13:08

Comment from: Nils [Visitor]

Just a simple question: Why does an identifier have to have a http:// uri. This implies that there is gettable content available. Would it be possible to use e.g. id:// uris as identifier?
PermalinkPermalink 2008-02-08 @ 19:21

Comment from: Peter [Visitor]

Here's a good way to frame the question:

Are you in the business of hosting a XML schema validation service?

If no, then you have no responsibility to attempt to do a good job of it. Focus on what's important, i.e. something other than fixing buggy third-party products unrelated to your core business, and unrelated to serving out millions of requests per day simply because these same third-party products failed to cache or otherwise copy your schema.

Focus on what's important.
PermalinkPermalink 2008-02-08 @ 19:49

Comment from: Steve [Visitor]

I definitely like the idea of defining which DTD to use through something other than a straight HTTP URL.

Another possible idea is that of decentralizing where the DTDs are actually stored, coupled with the above idea. A doctype declaration could then just be the name of the actual type of document. The browser (and supporting libraries) could keep track of "DTD servers", which could possibly be set up by anyone. This solves both problems, both of having a URI for the doctype and for eliminating the strain of rogue applications who suck up W3's bandwidth.
PermalinkPermalink 2008-02-08 @ 20:41

Comment from: Poul-Henning Kamp [Visitor]

Instead of rejecting the requests with 503, it might be a much better strategy to serve them, but very very slowly.

Failures and error codes are rutinely ignored in code, but things which basically work will happen, and if things are suddenly incredibly slow, people tend to notice.

Drip out the respons at around 10 char every other second, fast enough to keep the TCP connection open and the application running, but slow enough to totally wreck any hope of responsetime in the other end.

One level nastier, and in general one level too nasty, is to return buggy contents for repeat offenders, hoping to make their applications fail with interesting diagnostics.

Poul-Henning

(Who has far more experience with this problem in NTP context than he ever wanted)
PermalinkPermalink 2008-02-08 @ 22:06

Comment from: Ben Strawson [Visitor] · http://www.gladsheim.com

One current problem is that all the DTDs are under http://www.w3.org/ which makes it difficult to distribute them around the Internet. Perhaps if new DTDs were placed at http://dtd.w3.org/ then DNS could be used to return the nearest server to the requesting agent - a technique used on a number of other sites of course.

In the short term, there may be little effect as most documents have www.w3.org in them, but over time people would start using the new hostname instead, and of course newer DTDs would not be available anywhere else.
PermalinkPermalink 2008-02-08 @ 22:50

Comment from: Fabian [Visitor]

How about limiting every IP to one request of those special urls per whatever amount of time is sane. People needing more than this (Who?!) petition you for a guid they can use as their user agent. Of course, abuse of this guid makes it easy to ban. Forgive any blatant ignorance by enlightening me. :)
PermalinkPermalink 2008-02-08 @ 23:01

Comment from: john [Visitor]

I don't agree with the attitude in the article. In most IT shops you have lots and lots of third party apps all over the place. I think most places are lucky if they have one person there that understands the issue. And that person is probably busy.
PermalinkPermalink 2008-02-09 @ 01:26

Comment from: Veritas [Visitor] · http://nigelt.blog.com

And it has been slashdotted...I find that ironic...complain about traffic, get slashdotted.

Furthermore they mention Java Clients, There must be tons of site using Java Clients that make theses requests, and out of those tons of sites every visitor is a request, probably times the refresh, which is probably...a lot.
PermalinkPermalink 2008-02-09 @ 02:59

Comment from: Martin Nicholls [Visitor] · http://mybrokenlogic.com/

You guys are completely right about this - the number of badly behaved robots, spiders and other tools around the internet these days is starting to get silly.

We block a few of the generic ones like libwww-perl and some others, and yet they keep comming back for more.

We had another spider looking for RSS feeds that didn't exist going round & round in circles and eventually putting enough load on our server for it to send me an alert at 4am so I had to get up and block it's IP range (that's another annoying trend - spiders over massive and disparate IP address ranges that won't go way.. *glares at Slurp*).

These aren't exactly the same thing, but they are kinda of the same ilk - badly behaved robots doing what they shouldn't.

There's one more thing that drives me nuts that's kinda part of this family of annoyances - robots that pick out URLs that don't exist and aren't linked from anywhere. The number of 404 requests we get from robots is insane. What I really wish is that robot developers would use the referer header so we can figure out where they got the URL from.

A little off topic I grant you but it's kinda the same thing at the same time, it's time robots were written properly so they don't hit servers in a way that seems like an attack - then they might not be treated as such all the time?
PermalinkPermalink 2008-02-09 @ 03:04

Comment from: Jack [Visitor]

All of the examples I have seen for xhtml include the DTD statements. I suspect most people, myself included, really do not understand when a DTD statement is needed and when it is not. If it is true that "software does not usually need to fetch these resources", then examples and explanation on when and when not to use them could potentially help.
PermalinkPermalink 2008-02-09 @ 03:16

Comment from: Gerald Oskoboiny [Member] · http://www.w3.org/People/Gerald/

To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us/popular , etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
PermalinkPermalink 2008-02-09 @ 03:16

Comment from: Douglas W. Goodall [Visitor] · http://www.goodall.com

Please accept my humble observations. While attempting to learn about W3 and the evolving language of the hyperweb, I had certain misconceptions that must be common. I thought you were in the business of providing certain basic reference schemas that were critical to writing well formed web pages int the time of XML and xhtml. My first few compliant web pages were cut and pasted versions of example pages from W3. Only now after reading about yor problem in /. do I even have a clue that it is not so. I see two ways out of this. The first is education. You must make it clear how important schemas need to live somewhere else, keeping in mind that existing ones may have to last decades for the sake of legacy documents. The other possibility is that you accept responsibility for hosting these keystone documents, but push them out onto akamai servers or some other way of not being the bottleneck of the hyperverse. Although this is a terrible hack, your most common existing DTDs could be cached in the browser's themselves, not unlike the root certificates. It is a compliment that peope think you are the center of the Document universe, if only you can survive it. Thank you very much for your time. Doug
PermalinkPermalink 2008-02-09 @ 03:25

Comment from: Jim Russell [Visitor] · http://bjimba.blogspot.com

I've been saying for years that whoever originally suggested using HTTP URLs as namespace URIs should be drummed out of the corps. I *never* use HTTP URLs to identify my XML namespaces, and neither should anyone else. Remember, folks, it is not required to be fetchable, it is only required to be unique.
PermalinkPermalink 2008-02-09 @ 03:35

Comment from: Greg Hemphill [Visitor] · http://www.webstop.com/

What about getting the ISPs and major internet backbones involved. For example an ISP could scan the URL traffic and return a locally cached version of the DTD.

Also what if DTD requests were handled more like DNS requests? Where you have a handful of root servers on the back bones and each ISP has one or more DTD servers with cached copies of the DTD schemas. The various DTD servers would then only update their caches when the TTL settings of the DTD schema expire.
PermalinkPermalink 2008-02-09 @ 03:59

Comment from: Will [Visitor] · http://forkforge.org

I think the with a problem as massive as this, defeating it will require working from several different angles at once.

For starters, I'd try working with the people who make the libraries that get used by the offending apps to try and convince them to change their API's. If the python urllib API changed so that you had to use a custome user agent to initialize the library, it'd be easier to track down the rogue apps. (Or at least print a mode-deprecation warning to stderr if you don't give a custom User string)
Once you find the bad apples, try to get in touch with the developers. Ask them what materials they learned from. Was there a specific online tutorial that taught them to do things badly? If so, try and submit some revised text, so that a second printing will lead fewer astray, or a tutorial webpage can update its offerings.

Maybe try to dedicate a small group who spend at least one hour every week on giving the problem publicity. Posting a blog about the problem, submitting an article to slashdot, writing a tutorial on the right way to do things, actively searching out tutorials that sopmebody might use if they were learning how to write an app that could have these problems, and asking the tutorial to include some discussion of the issue, etc.

And, others have suggested making things painfully slow when abusive behavior is encountered. I agree with that. Break some apps, force the issue. I'd hate you if your broke one of my apps, but it still is probably the correct thing to do.
PermalinkPermalink 2008-02-09 @ 04:05

Comment from: Guillaume Coté [Visitor]

I think you should insite on librairy provider that a catalog is configure in the default setting. I remember last time I had to set one in Java, it was complex and time customing. I would find it better if a standard catalog was configure by default and a empty catalog could be use only when specificaly required.

Most developper don't have the time to play with these thing.
PermalinkPermalink 2008-02-09 @ 04:16

Comment from: Wayne Dixon [Visitor] · http://www.waynesworkshop.com

Can one host these files locally and just change their own URIs to reflect this. I'd be glad to host the things that I need locally, it'd make it quicker and much simpler for coding purposes.

or if you had an SVN server that I could check for updates every so often and if there was a change, then it'd be downloaded. This would be much simpler in my opinion.
PermalinkPermalink 2008-02-09 @ 04:50

Comment from: Andrew [Visitor]

Tarpitting the response sounds like a good idea, but it would require the capacity to sustain hundreds of times more connections than the servers do presently. That's not free, either.

Rate limiting the requests from individual IPs is more appealing -- it would just require some decent packet filtering software in front of the servers, still not free but not a significant cost. This will reduce the bandwidth you're paying for, but it probably won't reduce the connection attempts significantly.

I think the mistake you're making here is that you only returned 503s for "hours or days". Broken software is often broken due to neglect. Days is nothing. Think months, years maybe.

OK, that's contra to the mission of the w3c, I understand.

Here's an alternative, if rate limiting requests turns out, as I suspect, to have zero effect after months:

Announce that the URIs are changing. dtd.w3.org is a good idea1. Wait a few months. Start returning 301s for the old request paths (far less bandwidth, and just as fast or faster than returning the DTD). Broken software that doesn't cache results probably won't follow the redirect (try it for a few hours to see). Humans that don't follow the w3c news, but do care, will check in a regular browser and see what's going on and update their software. If broken software follows the redirect, send them a special error page with instructions on how to clear their src address from the blacklist.

Even if all of the above fails, you'll still have gained by moving to a new host part of the URI -- as mentioned before, you can distribute the load to other hosts in different places, and you can put them behind slow connections without affecting www.w3.org.

1. dtd.w3.org (or any three-character nodename to replace "www") is important because the source to some of the software could be lost or otherwise unavailable. Replacing "www" with "dtd" in a binary editor is simple, but changing it to a string of a different length might not be. Obviously don't change the request path portions of the URIs either. :)

PS: Preview appears to be broken -- any changes I made, and then re-previewed, were lost in both the preview display and the textarea. Hopefully the edits will make it through when I 'Send' for real. If they do, this comment will be visible.
PermalinkPermalink 2008-02-09 @ 04:55

Comment from: Arman Sharif [Visitor] · http://jreform.sourceforge.net

Why not force people to host the DTDs they need on their own servers?

For example, W3 could offer DTDs for download as zip files (which could also be distributed through other sites / mirrors).

Another possibility is to present DTDs within regular web pages so people could copy and paste them.
PermalinkPermalink 2008-02-09 @ 05:42

Comment from: Don Marti [Visitor] · http://linuxworld.com/community/

You could always tarpit them. The naughty software starts running slowly, users complain, and the vendors fix it.
PermalinkPermalink 2008-02-09 @ 05:44

Comment from: Daniel Barkalow [Visitor]

I think the main source of the problem is that the XML spec (section 5.1) implies that a validating parser without any persistent state is supposed to do this, and so they mostly do. For example, the default Java SAX handler, if given an XHTML document, will fetch all of the DTD parts, extremely slowly. Of course, it'll use a User-Agent like Java/1.6.0, because the application author probably doesn't realize that the application will be making any network connections at all.
PermalinkPermalink 2008-02-09 @ 06:22

Comment from: John [Visitor]

I second the "serve them slow" suggestion. The owners of the bad software that usually won't notice a 503, very likely will notice when their precious slows down to a crawl and won't get their TPS reports on time.
PermalinkPermalink 2008-02-09 @ 06:58

Comment from: David [Visitor]

Yes this is a huge problem. The applications that do these requests slow themselves down and the whole network. There is no need. For Java apps, see "Apache XML Commons resolver" http://xml.apache.org/commons/components/resolver/ which developed from Norm Walsh's work.
PermalinkPermalink 2008-02-09 @ 07:31

Comment from: Devon [Visitor] · http://devonyoung.com/

That's insane. I thought I had it bad with "deaf user-agents" that refuse to hear me scream 410 or 302 at them after saying it loud and clear 20 times each day for a month. I feel for you.
PermalinkPermalink 2008-02-09 @ 07:49

Comment from: ivan [Visitor]

I think you have to combine solutions Poul-Henning Kamp and Ben Strawson wrote.

I think the good solution is to use DNS load-balancing with a separated domain (e.g. schema.w3c.org or dtd.w3c.org) and find partners who can operate reliable mirrors (I think you can find easily such partners). To solve current problem you should permanently redirect (http 301) requests to the separated domain. You can slow down serving requests for biggest crawlers.

Beside these solutions maybe you have to contact with major xml library vendors to ask them to disable validation or enable caching dtds by default and write best practices about validation in their documentation.

I can't understand why somebody use validation this way, it's the slowest I can imagine. :-)

Best regards,
ivan
PermalinkPermalink 2008-02-09 @ 09:04

Comment from: Lyle [Visitor]

I have to agree with a delay. After a user requests more than 100 in an hour, put them on a 2 second delay; after 1,000 a 10 second delay.

That may alert them to the fact that something is wrong.

And for those addresses that hit the 10 second delay and don't seem to notice or slow their requests after 3 months, cut 'em off after the first 100 requests in an hour.

Personally, I am also curious about all of the new developer tools for firefox, ie, etc. that perform validation on every page. I hope these are using caching mechanisms, but given the ease with which they can be implemented, they could quickly set up a distributed DOS.
PermalinkPermalink 2008-02-09 @ 09:28

Comment from: Steve [Visitor] · http://www.stevekallestad.com

It's good that this blog post is getting attention - I'm sure many of the developers of various libraries are starting to take notice. IMO, that's the biggest problem right there - libraries, their documentation, and their default behavior.

It's a hard sell for a library developer to include a disk based cache mechanism as part of the parsing library when something like that really should be the responsibility of the program making use of the library. If every XML parsing library were to include file io as part of the package, that would be overkill. Certainly publishing tutorials about caching for performance reasons including sample code would be a good thing. Also, memory based caching should really be included by default for those libraries that can.

I think the best way to resolve the problem for w3c would be to create some tutorials for using some of the more common parsing libraries that exhibit undesired behavior by default and as part of those tutorials show performance benchmarks. As those kinds of things start hitting the top pages for google searches, developers will take notice and start to build better solutions.

I think short term fixes like 503 or even 404 errors will end up doing very little to resolve your issues long term, and probably will have very little effect short term as well.

I also think that setting up a subdomain for dtds is a great idea. (dtd.w3.org). It at least opens up some potential solutions down the road. (geo-responsive dns, etc).
PermalinkPermalink 2008-02-09 @ 09:36

Comment from: Nils H [Visitor]

Some cases like "Java/1.6.0" user agent may be that it's a Java applet or application using the standard Java library. Changing the user agent string here may not always be possible and blocking it may cause harm to a lot of applications on the net.

One way that was suggested is to limit the throughput to the resource, and that may be fine, but that should be done dynamically to not introduce problems for "normal" applications. To make things worse - in some cases a lot of requests may appear from a single IP address, but that address may be a NAT firewall for a large company. Of course - such companies should have a web cache.

A limitation in throughput may not even be easy for the application developer to track down since the delay may be masked in a library and the developer may end up hunting problems in all other places than the DTD link.

The use of a separate DTD serving address may actually be a good idea, since it will allow for a distributed load. The downside is that it will take some time before it becomes effective and that it will require some resources. The good side is that the amount of data served is very small and also very predictable, so it's not very problematic to set up such a server.
PermalinkPermalink 2008-02-09 @ 09:39

Comment from: Henri Sivonen [Visitor] · http://hsivonen.iki.fi/no-dtd/

Your list of questions omits two important ones:

  1. Are DTDs a bad idea and should we get rid of them?
  2. Is using URIs as identifiers that aren't meant to be dereferenced a bad idea and should we stop using them that way?
PermalinkPermalink 2008-02-09 @ 10:34

Comment from: Barry Hunter [Visitor]

Perhaps 503 isnt the ideal status code, basically meaning "we are busy now, try again later". You dont want them to try again, but rather use a cached version. 403? This is of course on the precondition you are only serving this to persistent bag guys ;)
PermalinkPermalink 2008-02-09 @ 13:13

Comment from: Michael Daconta [Visitor] · http://www.daconta.us

Hi Folks,
Sorry but I think the W3C is at fault here. If a namespace should not be retrievable than DON'T use an http URL to identify it.
If you want to foster or require the use of caching define some kind of optional cache identifier to define a namespace.
In general, I think the identity of a resource being retrievable is a good thing. It promotes the idea that the "definition" of an identified thing can be discovered by retrieving it. That makes sense - the only disconnect here is that we are talking about a "well-known" thing like HTML where the logic breaks down. However, for not very-well-known tag sets, this makes a lot of sense.
So, recommend you make an alternative syntax to specify that a local cached copy exists (or must exist). Than you can switch the default to that "cached copy URI".
If such a caching URI exists, my apologies for not researching it before posting.
Best wishes,

- Mike
PermalinkPermalink 2008-02-09 @ 13:35

Comment from: NSK [Visitor] · http://karastathis.org/

How about tracking down those who generate too many requests and sending them a nice bill with lots of 000s?
PermalinkPermalink 2008-02-09 @ 14:02

Comment from: Jay [Visitor]

The standards encourage this behavior. The 4.01 spec says, "The URI in each document type declaration allows user agents to download the DTD and any entity sets that are needed." And the XHTML spec says, "If the user agent claims to be a validating user agent, it must also validate documents against their referenced DTDs". General purpose SGML or XML parsers will not embed the HTML DTDs, and even if they have the luxury of a cache it isn't likely to persist across process invocations. There is obviously a software component to this problem, and developers need to be aware. But as you point out, the problem is not limited to the W3C. The best way forward will be to improve infrastructure, and in particular, to find sustainable caching strategies. I would like to think that the Scalability TAG will come up with solutions, but the emails are not encouraging.
PermalinkPermalink 2008-02-09 @ 15:34

Comment from: uv [Visitor]

Track the IP to a person (use a court order if necessary) and find out what is the offending software. Then sue them for abuse. One case like that and you would get much more publicity than via slashdot.

Track the IP to an ISP and ask them to install a transparent proxy for your site, or to contact their user and tell them to configure one.

Convince Sun that the next update of Java 6 (and Apache commons) would install a local cache. Same for Python urllib2 and Perl's libwww.

I like the idea of slowing down offensive connections, but since that may be hard on the server level, you can just return a wrong DTD. Make it have a valid syntax, so the DTD parser would not fail, but contain no real elements. If that fails, use a completely invalid DTD.
PermalinkPermalink 2008-02-09 @ 16:16

Comment from: Paul Boddie [Visitor]

If we ignore the intentional "dual purpose" of the URLs concerned (that they should be used as a unique identifier, yet can also be used to consult the DTD), probably the biggest reason why the URLs are getting so many hits is because many parsing toolkits have bad defaults: that for many implementations of APIs like SAX, you have to go out of your way to override various "resolver" classes in order for your application not to go onto the network and retrieve stuff. So it's quite possible that most people don't even know that their code is causing network requests until their application seems to be freezing for an unforeseen reason which they then discover to be related to slow network communications.

My first experience with excessive network usage probably arose with various Java libraries, but it's true that Python's standard library has similar mechanisms, and if you look at tools like xsltproc there are actually command switches called --nonet and --novalid, implying that you'll probably fetch DTDs with that software unless you start using these switches.

Who is responsible? Well, I don't think you can put too much blame on the application authors. If using XML technologies has to involve a thorough perusal of the manual in order to know which switches, set in the "on" position by default, have to be turned off in order for the application to behave nicely, then the people writing the manual have to review their own decisions.

Some clearer language in various specifications would help, rather than having to read around the committee hedging their bets on most of the issues the whole time.
PermalinkPermalink 2008-02-09 @ 17:06

Comment from: Josh Anderson [Visitor]

Ban them. It doesn't have to be complicated. Just ban them until the number of requests goes down and stays that way.

Start with individual IP addresses. If you're banning a significant number of IP addresses in a class (say, 50% of them), just ban that whole network. Keeping track of all the addresses/classes efficiently would require a bit of clever data structure usage, but it could be done. (I'd start with a trie with a child-count in each node.)

Anyway, just ban them.
PermalinkPermalink 2008-02-09 @ 23:13

Comment from: Gerald Oskoboiny [Member] · http://www.w3.org/People/Gerald/

Thanks to everyone for your comments. I'll try to reply in more detail later, just a quick followup for now:

The tarpitting idea sounds worth trying; does anyone have specific software to recommend that is able to keep tens of thousands of concurrent connections open on a typical cheap Linux box?
PermalinkPermalink 2008-02-10 @ 09:10

Comment from: Sander Bos [Visitor]

As Arman Sharif says you would say that the W3C would make it easy to download the schema-files they don't want you to access directly. But it's my experience that the W3C does not offer this properly.

I actually wrote some software that reads XHTML documents into XML DOMs. As soon as the XML parser encounters an entity reference the URL will be loaded. So I created a local resolving mechanism with an entity resolver to read the DTDs from local, however:
- I had to go to all the individual specifications and download the individual specs there, and create my own full repository (I don't have the source-code here, but I am quite sure I ended up with over 50 files for 5 or specs).
- Create my own mapping file that goes from public (handling broken in .Net XML parsers) and system ids to my local files.
- And then of course implement entity resolving to actually pick up my local files.

Every time a developer implements an application that loads html documents using a standard XML parser (a quite common thing I would say), they need to perform these steps to alleviate stress on the W3C servers.

What I actually naively expected this article (found from slashdot) to contain when I opened it was a link to an archive with the files for all your stable specifications in one, with a id->path mapping, and some sample resolver code for common parser libraries in various languages. Does it exists?

(and caching it after one request is not usable for many situations, since my reason for caching was actually not to lessen W3C's Internet bill but allow the application to run without Internet access)
PermalinkPermalink 2008-02-10 @ 09:47

Comment from: Peufeu [Visitor]

130 million requests a day is about 1500 per second. On a few static files. A 3 years-old laptop running lighttpd can manage that easily (and actually a lot more). On my server which is the cheapest Core2, lighttpd handles 200 req/s and it uses a few percent CPU.

Just use lighttpd or nginx (obviously you should forget about Apache !)

Now here is my suggestion to get rid of the spamming.

You need two servers, a main server and a backup server. Both are machines suitable for serving a few thousands static requests per second, ie. the cheapest Core 2 boxes you can get. You will need to adjust the allowed number of sockets on both, of course, to allow as many concurrent connections as you can. Don't forget to enable zlib compression.

Now you implement some redirections :
- When www.w3.org sees a request for a DTD, it redirects to dtd.w3.org which runs on the "main" box.
- The "main" box has a good connection (like 100 Mbits)
- When the "main" box receives a HTTP request, it looks up the client IP address. If this address has submitted few requests, it serves the requests. However, if this IP has submitted say, more than 5 requests in a period of a few hours, it redirects it to the "backup" server, which is dtd2.w3.org, on a different IP.

To implement this you will need to code a simple C module for lighttpd.

- The "backup" server is connected to the internet via a completely separate, rather slow (10 Mbits) connection. It just serves static files.

So, the "main" server will always be fast and responsive, and the "backup" server will always have its connection horribly saturated.

Therefore, any client will get fast response on the first request from the "main" server. Well behaving clients will cache it, and it ends there. Badly behaving clients will not cache it and will request again, they will get redirected to the "backup" server and feel the pain.


PermalinkPermalink 2008-02-10 @ 13:55

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

Thank you all for the comments.

Scaling is not the long term solution as it does not address the cause, however it is something we will have to continue to do and appreciate suggestions made in this area.

By making this post we are trying to increase awareness so ideally this gets resolved as far upstream at the library level as possible since that will have the broadest effect. Community involvement with their respective development platforms of choice will help as we have had mixed success in identifying and contacting software and library maintainers. Some have been very responsive and are implementing caching catalogs or making static catalogs a default instead of an afterthought left to those installing and utilizing the library. Some developers have noticed our blocking scheme and have contacted us letting us know they have taken corrective steps on their own.

For those who wondered why these schemata and namespace resources are made available via HTTP to begin with is we intend for them to be derefenced as needed but expected this to be more reasonable given the caching directives in the HTTP spec. The performance cost of going over the network repeatedly for the same thing should be reason enough for developers to cache. Since many of these systems ignore response codes a tarpit solution might very well succeed in gaining their attention, plus has some entertainment value. If their application performance suffers substantially enough developers may take notice.

As mentioned many of these systems only understand a small handful of the various HTTP responses (200 OK, 302 Found, 401 Unauthorized, 403 Forbidden, 404 Not Found). We are more than slightly curious how the browser plugins, in cases we have for instance suspected a particular large scale ISP's webmail plugin is a traffic culprit, would handle HTTP 401 "Authorization Required" responses to their requests. Inside the realm part of the WWW-Authenticate header it would be quite tempting to give the technical support phone number to the ISP who has not listened to our repeated phone calls and emails on the subject. That would likely get their attention and potentially encourage them to correct the plugin.

Identifying the sources of W3C's abusive DTD traffic can be quite time consuming and difficult depending on the data in the HTTP request. One rather odd case we see often has the HTTP Referrer as "file:///C:/WINDOWS/fonts/set.ttf" and we have so far not found the related software. For identifying some we have found resources provided by various organizations (eg McAfee SiteAdvisor) that catalog browser plugins, software network interactions and viruses quite helpful. We would very much like to collaborate with such organizations or similar community efforts to help us identify more software responsible for this traffic. We have made a couple efforts to establish contacts within a couple such organizations but unfortunately emails and phone calls have not gone very far. Specific suggestions or contacts for us to follow up with would be appreciated.

PermalinkPermalink 2008-02-10 @ 18:30

Comment from: Louis [Visitor] · http://netsharc.wordpress.com/

The amount of comments here, and in Slashdot, blaming the W3C for specifying a URL as DTD-ids and thinking it is the correct behavior for programs to fetch that URL, is amazing...

It's just one more proof that the internet is gaining more widespread use.. the percentage of idiots online is now probably equal to the percentage of idiots in the world...
PermalinkPermalink 2008-02-10 @ 18:37

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

Martin Nicholls,

I cannot agree more with your sentiment towards poorly behaving bots/crawlers, they are getting out of hand. There has been talk around here at W3C and elsewhere on starting an activity for directives governing bot interactions with a website. There have been some scattered conventions which should be standardized and improved upon.

For instance polite bots could

  • identity themselves properly including URI with various information
    • how to submit complaints
    • means to authenticate crawler and IP range it is coming from
  • respect directives regarding frequency, concurrency, etc. given in response headers of site being crawled
  • advertise peak hours with higher thresholds if the crawler would like to schedule it's return
  • server being crawled could make available data of resources and last modified dates so more intelligent, minimal crawls can be made saving both sides resources

Those bots that do not abide by these conventions and overstep the boundaries spelled out can be spotted and blocked through automated means.

Those that do could do their indexing in as efficient a manner as is comfortable for the website being crawled.

PermalinkPermalink 2008-02-10 @ 18:56

Comment from: Joshua [Visitor]

Why not change the doctype tag to something like:

<!DOCTYPE PUBLIC html "-//W3C//DTD XHTML 1.0 Strict//EN"
"dtd://TR/xhtml1/DTD/xhtml1-strict.dtd">


Then it's known where the dtd is located in case it's actually needed, but would only be used by applications and libraries that would probably need it, since it would require special handling instead of blind handling. Slowing traffic for the http link wouldn't be great, because on a few occasions I've downloaded the DTD's for learning the document types(you do want valid html instead of what most tutorials provide, right?).

Still unique, still locatable, and hard to misinterpret.
PermalinkPermalink 2008-02-10 @ 22:05

Comment from: David A. Spitzley [Visitor]

If I'm understanding correctly, this problem is located firmly in the code realm, not the HTML realm. For those of us who are simply web developers, and not application developers, am I correct that no changes to our DTD referral practices are called for?
PermalinkPermalink 2008-02-11 @ 16:58

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

David, Correct, in fact we encourage DTD referral by web developers to differentiate what their markup is.
PermalinkPermalink 2008-02-11 @ 17:17

Comment from: Stephen Fletcher [Visitor] · http://www.thief.net

There is some good smtp tarpit software in OpenBSD. It isnt hard to change the BSD specific c calls to linux based calls using the same functions from Rsync. I guess it would need to be modified for http as well however.
PermalinkPermalink 2008-02-12 @ 01:16

Comment from: Clyde Kessel [Visitor]

Serving requests very very slowly will unduly penalize the innocent bystanders. Many of us have no idea where to look when an application has slow response. Our systems are running software from dozens of vendors, and we will have no idea which of the vendors is running slowly. So we will just suffer. Or, we will decide our computer is old and needs to be replaced with a faster one: One which can hit your website even more frequently.
PermalinkPermalink 2008-02-12 @ 14:37

Comment from: JR [Visitor]

I've had similar problems with dumb crawlers that couldn't handle escaped '&' entities in URLs and would bombard the server with invalid requests for them. So I have sympathy.

However, I don't know if the tarpit solution is a good idea. What's John Q. Public who's running some misbehaving software going to think? "Oh, this must be what they mean when they say XML is slow. This problem never happened before I put the DOCTYPE on all my files. That guy who pushed us to adopt XHTML is a moron." Fix the problem going forward by changing the scheme for identifying DTDs, but think carefully before spreading the pain just to save the W3C some inconvenience.
PermalinkPermalink 2008-02-13 @ 05:48

Comment from: Martin v. Löwis [Visitor]

I would like to point out that, contrary to what Ted says, the system identifier *is* a downloadable resource, not just an idenitification (unlike the namespace URI, which is mere identification). Specifically, XML 1.0, section 4.4.3, says that a parser MUST "include" references to parsed entities if they are validating (so it's no option not to read the DTD), and it MAY do so even if non-validating.

In particular, reading the DTD is necessary even in non-validating mode in case the document contains entity references.

Of course, to read the DTD, one might be able to use an alternate URL based on the public identifier. Unfortunately, catalogs are not in wide-spread use, and W3C does nothing to promote them.
PermalinkPermalink 2008-02-16 @ 00:09

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

Martin,

Where did I say System Ids are not downloadable resources? This post is about the frequency of the downloads, disregarding HTTP caching directives.
PermalinkPermalink 2008-02-16 @ 04:42

Comment from: Daniel [Visitor]

Ted,
I think Martin was referring to the following excerpt, which does sound like "these URIs are identifiers, not for download".

Note that these are not hyperlinks, these URIs are used for identification. This is a machine-readable way to say -this is HTML-. In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over



As many have pointed, the data downloaded is needed, so it'd be great if W3C could provide basic catalogs/suggestions to be used as the sane default.

You have to keep in mind that caching is a great solution for many given scenarios, but low level tools/libraries cannot be expected to assume caching/catalogs are THE right thing to do when the spec includes checking against what the URI point to.

You saying that developers were supposed to implement caching due to their own performance concerns makes me wonder: what if most already do that? What will change when they do?

What if most hits you get are from software that scraps "http://[...]" from data and follow that? What if library per-process/thread cache is already there but the system forks for each URL? How about distributing batches of URLs to visit?

So IMO the W3C has a chance of simply postponing the issue if no steps are taken towards providing local, reliable catalogs to the community and changing the recommended http:// URIs to something else (like the dtd:// above).
PermalinkPermalink 2008-02-17 @ 04:32

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

Daniel,

A misunderstanding then, my apologies for us being ambiguous. We went with that wording to avoid going into the differences between DTDs and namespaces which parsers have no need to dereference as there may not be anything of use to them as is the case with xmlns="http://www.w3.org/1999/xhtml". DTDs are meant to be downloaded for machine processing but reasonably not incessantly by an application running on a machine. We are seeing XML processors grab these even when they are not using them.

Making catalogs available has come up before and we certainly will consider it. Catalogs still would need to make their way into the various tools and libraries, many of which do not come with any. We are just one of the many organizations and individuals making these sorts of resources (namespaces, DTD) available so tool and library developers will still have to collect these.

It is difficult for tool and library developers to know what markup that will run through their utilities for validations or transformations not to mention new schemata are always being created. Because of this the best solution is for a caching XML Catalog resolver as I understand is part of Glassfish. The library will add DTDs to it's cache as it needs them, caching is part of the HTTP protocol.

PermalinkPermalink 2008-02-17 @ 05:48

Comment from: Daniel [Visitor]

Ted,

Thanks a lot for the discussion.

I like the caching idea, but believe distributing the load extends it. The general issue with caching is related to where to cache. Local (or in memory. per process, etc.) caches will be less efficient than shared ones. Shared (system, library) caches will have their own load of issues. So you might end with much nicer libraries and still being hammered by requests.

Notice that scaling up and mirroring amounts to an extremely-shared cache. I believe having the machinery for mirroring in place (checksums, compressed snapshots, change notifications) could lead to lower level mirroring (dtd-daemon, anyone?).

The benefits would be:
Network admins could save resources that lazy programmers forgot to (and legacy code would automagically stop being so nasty).

Users could get performance boosts by installing software that tricks dumb apps to fetch DTDs from a local cache, regardless of upstream actions.

Library developers (and even dumb programmers) would have a Darn Easy® recommended route to caching, as local-ish mirroring and checksums would be discussed all over the place (and faster, cheaper, tastier).

Also, maybe you should talk to Coral Content Distribution Network regarding forwarding traffic. It might be interesting for them to have such a huge source of input to their research.

On a meta note, I think it could be very useful to have a central location (wiki?) to gather resources and discussions on this issue.
PermalinkPermalink 2008-02-18 @ 00:43

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

One of the arguments against having caching resolvers in XML libraries has been this is attainable outside of the library, which it certainly is with a caching proxy server for instance. It is a very worthwhile solution and why we give the caching directives in the first place.

We have seen a number of corporate and large ISP HTTP proxies hammering us because of some XML application[s] running behind them. Sometimes the network admins would, if they were responsive at all to us, add caching to their proxy setup or less often track down the parties responsible for the software causing the traffic. More often they would refuse to add caching to their proxy or any other action citing cost or complexity. Bandwidth is cheaper than equipment and admin time I guess.

PermalinkPermalink 2008-02-18 @ 01:31

Comment from: Stéphane Bortzmeyer [Visitor] · http://www.bortzmeyer.org/

It is strange (and probably an indication of the lack of XML knowledge
of many posters here) that noone mentioned the best solution on the
application side: catalogs. They have been part of SGML and XML for
many years, so there is good support for them. Any XML parser should
support catalogs and, then, the DTD would be retrieved on the local
disk and not through the network.

http://www.oasis-open.org/committees/entity/spec-2001-08-06.html
http://www.sagehill.net/docbookxsl/Catalogs.html

(Of course, there are always broken programs and sites where catalogs
will not be installed, so there is a still a need for other measures.)

PermalinkPermalink 2008-02-19 @ 13:27

Comment from: Stéphane Bortzmeyer [Visitor] · http://www.bortzmeyer.org/

A few people mentioned testing that the file has been changed or
not. The HTTP protocol has a If-Modified-Since header for precisely
this purpose and W3C's server honors it. You can set it, for instance,
with curl:

% curl -v --time-cond 20080201 -o html.dtd http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
...
< HTTP/1.0 304 Not Modified
[No download]

% curl -v --time-cond 20000201 -o html.dtd http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
...
< HTTP/1.0 200 OK
[And the file is actually downloaded]

Of course, this requires a program that sends it and has a local
storage to keep the DTD, but recommending this technique may help
(among other techniques like HTTP caching, XML catalogs, terminating
the offenders, etc).

PermalinkPermalink 2008-02-19 @ 13:36

Comment from: Les [Visitor]

I want to do a system call and keep a copy of a dtd on my computer. I have read about this and tried it. How do I know it is working.
I am using !DOCTYPE html SYSTEM "loose.dtd"
does the browser actually refer to the dtd I have placed on the computer which is by W3c OR do I need to put the Identifier in also, before the URL

as :
!DOCTYPE html PUBLIC "-W3C//DTD Html 4.0 Transitional//EN" "loose.dtd" ?

PermalinkPermalink 2008-02-23 @ 22:55

Comment from: Jeroen Pulles [Visitor] · http://redslider.net/

Ted,

Adding to what Daniel already said:

I just happen to be stacking together a new flavor of modular XHTML in the spirit of XHTML+RDFa for the backend of a new website I'm working on.

I'm using libxml on MacOS X via MacPorts. MacPorts has a package with HTML4 dtd's, but not XHTML and it does not supply a catalog with the DTD's. I'll accept that I'll have to add a new entry to the catalog, but I'll still have to get the DTD's in the first place.

I have three options at this point: download each DTD (module) manually through my webbrowser at /MarkUp/DTD, let wget crawl the DTD directory, or download Debian's w3c-dtd-xhtml package and rip the files from that package. Hardly convenient.

I assume that a lot of developers will even ignore the speed problems whilst getting their new apps to work.

I think it would really help if W3C would package its DTD's in a tar.gz, and perhaps even pro-actively work with package maintainers to distribute these files.

Obviously, this will not be a quick fix to your bandwidth problems, but I think it does address the core of the problem: Too many developers are not aware the inner workings of XML validation (or validating parsing) and assume 'it just works'.

my 2 cents
PermalinkPermalink 2008-03-18 @ 12:05

Comment from: kev [Visitor]

Well, this could be an example of rogue crawlers, bots and spiders causing an effect on the website. Large numbers of crawlers which take on pages and catches everything that seems like a link and cannot analyse the HTML and miss ignoring the links in the dtd and xmlns.

I agree we can't just depend on w3 to persistently follow up as everyone has the responsibility to help.
PermalinkPermalink 2008-12-28 @ 15:30

Comment from: Jon Leech [Visitor] · http://www.opengl.org/sdk/docs/man/

Can anyone comment on the combination of IE7, Docbook-generated XHTML, and the DTD http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd that's imbedded in such XHTML? We've generated a bunch of man pages built using Docbook 4.3 (see the link at top) which are now all failing in IE because (as best we can figure it out) w3.org is rejecting requests from the IE user agent.

While I sympathize with the bandwidth concerns discussed above, the question is what can we do about it? We're not the source of the offending app, Microsoft is, but we bear the consequences. Even if MS were to turn around a caching patch quickly, it probably wouldn't get widely deployed for years and I imagine the W3C admins will not lift the IE ban until then.

All I can think of to do on our end is to locally cache the DTD (and the entity files it references, IE also tries to fetch those) on our server and patch all of the documents to refer to those.

BTW, I don't see any reason this isn't affecting the combination of IE with every Docbook-generated XHTML doc in the world, if they're built using the standard stylesheet distribution.

Are there any other options within our control, that don't require cooperation from W3C or Microsoft?
PermalinkPermalink 2009-02-25 @ 08:23

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

Jon,

So I notice in these man pages, the main frame (eg http://www.opengl.org/sdk/docs/man/xhtml/glBindAttribLocation.xml ) is already XHTML markup but served with HTTP header Content-Type: application/xhtml+xml

If you serve this [X]HTML markup as HTML instead of XML, MSIE will not call it's XML processor which in turn tries to dereference DTD from us. Try serving it with .html extension and/or Content-Type: text/html and your problem should be resolved.
PermalinkPermalink 2009-02-26 @ 13:26

Comment from: Jon Leech [Visitor] · http://www.opengl.org/sdk/docs/man/

Hi Ted,
If we serve our content as HTML instead of XML, as you suggest, then IE will not invoke the MathPlayer plugin to render MathML content and the pages aren't rendered correctly. Getting MathML displayed properly in the man pages is pretty much the whole point of this exercise, so I don't think that will work for us. We've modified the man pages to refer to a local cache of the DTD, and that seems to work well enough.
PermalinkPermalink 2009-03-16 @ 03:15

Comment from: Helen Gavaghan [Visitor] · http://www.gavaghancommunications.com

I would like to know why certain workable tags have to be deprecated. It seems fine to me that some people have certain choices, but I like working with basic html and the creativity it allows. The permissable tags stifles creativity and seems to me to be anticompetitive if I am not allowed to use basic html tags. Why should the underline option be blocked. It is obvious when one moves one's cursor over the word that it is not a link.
AAnd if the document is public surely it is obvious it is public. Surely the standardisation that matters technically is at the base mark up level not what someone wants to impose. It seems to me both creativity and competivity are being stifled.
Helen Gavaghan
PermalinkPermalink 2009-05-17 @ 14:43

Comment from: Shannon Stout [Visitor]

I'm really sorry to interrupt the conversations. all this is way above my level. I ended up here just trying to resolve an issue I a started having a couple of days ago while searching and am basically taking "shots in the dark" on the source info I'm searching to try to find out why and what files are causing me to be redirected when I try to select links. the following is a cut of the code...

can anybody tell from this what I need to remove from my registry files or what settings need to be changed in my browser or give me any sort of clue how I can fix the issue or is this all pretty normal? thanks in advance for taking time out from your other issue to enlighten my ignorance.
PermalinkPermalink 2009-05-20 @ 00:00

Comment from: Gummistiefel [Visitor] · http://www.gubeda.de

Sure, they're ignoring the response status, but I'll betcha most of them are doing synchronous requests. If I were solving this problem for W3C, I'd be delaying the abusers by 5 or 6 *minutes*. Maybe respond to the first request from a given IP/user agent with no or little delay, but each subsequent request within a certain timeframe incurs triple the previous delay, or the throughput gets progressively throttled-down until you're drooling it out at 150bps. That would render the really abusive applications immediately unusable, and with any luck, the hordes of angry customers would get the vendors to fix their broken software.
PermalinkPermalink 2009-05-26 @ 23:33

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

Microsoft blog article on how to more efficiently invoke MSXML in your applications.

PermalinkPermalink 2009-05-27 @ 20:36

Comment from: Robin Berjon [Visitor] · http://berjon.com/


It's not a complete or ideal solution, but have you considered in-place editing of the relevant DTDs to make them smaller while maintaining their semantics? It's unpleasant, but Process allows for it.

Take for instance http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. It is 25kb. Instead, replace it with a DTD that only contains:

<!ENTITY % x SYSTEM "http://dtd.w3.org/xhtml1-strict">%x

That's 56 bytes, 455 smaller than the raw version you have to serve to those stupid libraries that often don't send Accept-Encoding: gzip (even if they support it), and still 120 times smaller than the gzip version.

Now this assumes that an important subset of the requests that are made don't actually do anything useful with the content and so don't make a second request to the actual content. I suspect it's worth a shot, or at least worth testing.

It has the additional advantage that using a different DNS means that you might be able to use load distribution tricks not available to you for the general website.

Anyway, just a thought!
PermalinkPermalink 2009-06-10 @ 12:19

Comment from: Matthias Kraft [Visitor]

Please tell people from Saxon not to reload the xhtml.dtd everytime you open an internet document with the xpath-document() function.
PermalinkPermalink 2009-06-10 @ 18:58

Comment from: Dave [Visitor]

In case anyone experiences this issue when using Java facelets(!) look at this bug report:
https://facelets.dev.java.net/issues/show_bug.cgi?id=352
PermalinkPermalink 2009-06-13 @ 13:13

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

We are now seeing such extreme surges in traffic that our automatic and manual methods simply cannot keep up. Increases in serving capacity are readily consumed by this traffic and our site becomes overwhelmed. As such we are taking some more drastic temporary measures which we hope to be able to back down shortly. We are sorry for the impact this is causing the community. We continue experimenting with various methods including some of those suggested by posters here.

If you are impacted file a bug report with the developers of the library or utility you use asking them to implement a [caching] catalog solution. You may also put a caching proxy in front of your application for immediate remedy to your situation, populating the cache with a user agent we are not blocking DTD access to.

Java based applications and libraries are presently accounting for nearly 1/4th of our DTD traffic (in the hundred of millions a day). There is also another more substantial source of traffic which the vendor is working to correct in the hopefully near future.
PermalinkPermalink 2009-06-15 @ 13:29

Comment from: Dominique Hazael-Massieux [Member] · http://www.w3.org/People/Dom/

To ensure that Saxon doesn't hit W3C Web site when transforming XHTML content, see:
http://saxon.wiki.sourceforge.net/XML+Catalogs
PermalinkPermalink 2009-06-25 @ 13:39

Comment from: Dan Craft [Visitor]

Trying to write a well-behaved system using Xerces DOM parser. Seem to be two things I need to do: set the UserAgent to indicated I'm not the raw Java libs, and manage a local cache.

I can set the System user.agent, which URLConnection then uses in the request. It appends the Java "Java/vers" string to the one I provide, giving e.g. "DSS/1.0 Java/1.6.0_13". I believe this is the correct format and intention for UserAgent, indicating the primary system and version followed by any subsystem.

You're still denying this request. Are you searching for the Java identifier *anywhere* in the string? That precludes any Java-based system (at least, ones not controlling the headers all the way down, i.e. using most any libraries) from behaving properly and working.

Caveat: My understanding is limited.
PermalinkPermalink 2009-06-25 @ 18:44

Comment from: Steve Donie [Visitor] · http://donie.endofinternet.org

For those that (like me) run into this when using Ant (java build tool) xslt task - take a look at the xslt task manual and the section on xmlcatalog. That allowed me to keep the dtd files locally and use them from there. I was transforming XHTML, so that also required downloading several entities files as well.
PermalinkPermalink 2009-06-27 @ 03:30

Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted

Dan,

Changing the user-agent is commendable especially if you post it somewhere it can be indexed and people can contact you if there is an issue with it.

Instead of writing something to maintain your own cache look to Xerces XML Catalog capabilities which I wish were the default instead of an after thought.

For the time being I am also relaxing the filtering based on your suggestion. There is one particular Java UA that prepends a string that is causing 80 million or so hits/day at present. We contacted them after researching the user-agent used.
PermalinkPermalink 2009-07-01 @ 22:18

This post has 3 feedbacks awaiting moderation...

Leave a comment:

Your email address will not be displayed on this site.
Your URL will be displayed.

Allowed XHTML tags: <p, ul, ol, li, dl, dt, dd, address, blockquote, ins, del, span, bdo, br, em, strong, dfn, code, samp, kdb, var, cite, abbr, acronym, q, sub, sup, tt, i, b, big, small>
(Line breaks become <br />)
(Set cookies for name, email and url)
(Allow users to contact you through a message form (your email will NOT be displayed.))

W3C Systems Team