Role
W3C maintains an internationally distributed network of servers and services that support Public, Member, and Team audiences in pursuit of the Consortium's technical and social objectives. W3C uses these systems to manage its Activities and Working Groups according to W3C Process.
Design
W3C's systems infrastructure is based almost completely on open source software running on Debian GNU/Linux servers. Many of our tools are built using the popular LAMP platform (Linux, Apache, MySql, Perl, PHP, Python scripting/programming languages.)
Status
We try our best to document known outages and disruptions to our services on the Web; should you encounter a problem with one of our services not documented on that page, please let us know at <web-human@w3.org>.
W3C's Excessive DTD Traffic
Posted on 8, February 2008, by Ted Guild in Homegrown tools
If you view the source code of a typical web page, you are likely to see something like this near the top:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
and/or
<html xmlns="http://www.w3.org/1999/xhtml" ...>
These refer to HTML DTDs and namespace documents hosted on W3C's site.
Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say "this is HTML". In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.
The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema.
Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.
A while ago we put a system in place to monitor our servers for abusive request patterns and send 503 Service Unavailable responses with custom text depending on the nature of the abuse. Our hope was that the authors of misbehaving software and the administrators of sites who deployed it would notice these errors and make the necessary fixes to the software responsible.
But many of these systems continue to re-request the same DTDs from our site thousands of times over, even after we have been serving them nothing but 503 errors for hours or days. Why are these systems bothering to request these resources at all if they don't care about the response? (For repeat offenders we eventually block the IPs at the TCP level as well.)
We have identified some of the specific software causing this excessive traffic and have been in contact with the parties responsible to explain how their product or service is essentially creating a Distributed Denial of Service (DDoS) attack against W3C. Some have been very responsive, correcting the problem in a timely manner; unfortunately others have been dragging on for quite some time without resolution, and a number of sources remain unidentified.
We would like to see this issue resolved once and for all, not just for our own needs but also to improve the quality of software deployed on the Web at large. Therefore we have a number of suggestions for those writing and deploying such software:
- Pay attention to HTTP response codes
This is basic good programming practice: check your return codes, otherwise you have no idea when something goes wrong.
- Honor HTTP caching/expiry information
Resources on our site are served in a cache-friendly way: our DTDs and schemata generally have explicit expiry times of 90 days or more, so there's no reason to request these resources several times a day. (In one case we noticed, a number of IP addresses at one company were requesting DTDs from our site more than three hundred thousand times per day each, per IP address.)
Mark Nottingham's caching tutorial is an excellent resource to learn more about HTTP caching.
- If you implement HTTP in a software library, allow for caching
Any software that makes HTTP requests to other sites should make it straightforward to enable the use of a cache. Applications that use such libraries to contact other sites should clearly document how to enable caching, and preferably ship with caching enabled by default.
Many XML utilities have the ability to use an XML catalog to map URIs for external resources to a locally-cached copy of the files. For information on configuring XML applications to use a catalog, see Norman Walsh's Caching in with Resolvers article or Catalog support in libxml.
- Take responsibility for your outgoing network
traffic
If you install software that interacts with other sites over the network, you should be aware how it works and what kind of traffic it generates. If it has the potential to make thousands of requests to other sites, make sure it uses an HTTP cache to prevent inflicting abuse on other sites. If the software doesn't make it straightforward to do so, file a bug report with the vendor, seek alternatives, or use an intercepting proxy server with a built-in cache.
- Don't fetch stuff unless you actually need it
Judging from the response to our 503 errors, much of the software requesting DTDs and schemata from our site doesn't even need them in the first place, so requesting them just wastes bandwidth and slows down the application. If you don't need it, don't fetch it!
- Identify your user agents
When deploying software that makes requests to other sites, you should set a custom
User-Agentheader to identify the software and provide a means to contact its maintainers. Many of the automated requests we receive have generic user-agent headers such asJava/1.6.0orPython-urllib/2.1which provide no information on the actual software responsible for making the requests.Some sites (e.g. Google, Wikipedia) block access to such generic user-agents. We have not done that yet but may consider doing so.
It is generally quite easy to set a custom User-Agent with most HTTP software libraries, see for example How to change the User-Agent of Python's urllib.
We are interested in feedback from the community on what else we can do to address the issue of this excessive traffic. Specifically:
Do we need to make our specifications clearer in terms of HTTP caching and best practices for software developers?
You might think something like "don't request the same resource thousands of times a day, especially when it explicitly tells you it should be considered fresh for 90 days" would be obvious, but unfortunately it seems not.
At the W3C Systems Team's request the W3C TAG has agreed to take up the issue of Scalability of URI Access to Resources.
Do you have any examples of specific applications that do things right/wrong by default, or pointers to documentation on how to enable caching in software packages that might be affecting us?
What do other medium/large sites do to detect and prevent abuse?
We are not alone in receiving excessive schema and namespace requests, take for example the stir when the DTD for RSS 0.91 disappeared.
For other types of excessive traffic, we have looked at software to help block or rate-limit requests, e.g. mod_cband, mod_security, Fail2ban.
Some of the community efforts in identifying abusive traffic are too aggressive for our needs. What do you use, and how do you use it?
Should we just ignore the issue and serve all these requests?
What if we start receiving 10 billion DTD requests/day instead of 100 million?
Authors: Gerald Oskoboiny and Ted Guild
Comments:
Are you in the business of hosting a XML schema validation service?
If no, then you have no responsibility to attempt to do a good job of it. Focus on what's important, i.e. something other than fixing buggy third-party products unrelated to your core business, and unrelated to serving out millions of requests per day simply because these same third-party products failed to cache or otherwise copy your schema.
Focus on what's important.
Another possible idea is that of decentralizing where the DTDs are actually stored, coupled with the above idea. A doctype declaration could then just be the name of the actual type of document. The browser (and supporting libraries) could keep track of "DTD servers", which could possibly be set up by anyone. This solves both problems, both of having a URI for the doctype and for eliminating the strain of rogue applications who suck up W3's bandwidth.
Failures and error codes are rutinely ignored in code, but things which basically work will happen, and if things are suddenly incredibly slow, people tend to notice.
Drip out the respons at around 10 char every other second, fast enough to keep the TCP connection open and the application running, but slow enough to totally wreck any hope of responsetime in the other end.
One level nastier, and in general one level too nasty, is to return buggy contents for repeat offenders, hoping to make their applications fail with interesting diagnostics.
Poul-Henning
(Who has far more experience with this problem in NTP context than he ever wanted)
In the short term, there may be little effect as most documents have www.w3.org in them, but over time people would start using the new hostname instead, and of course newer DTDs would not be available anywhere else.
Furthermore they mention Java Clients, There must be tons of site using Java Clients that make theses requests, and out of those tons of sites every visitor is a request, probably times the refresh, which is probably...a lot.
We block a few of the generic ones like libwww-perl and some others, and yet they keep comming back for more.
We had another spider looking for RSS feeds that didn't exist going round & round in circles and eventually putting enough load on our server for it to send me an alert at 4am so I had to get up and block it's IP range (that's another annoying trend - spiders over massive and disparate IP address ranges that won't go way.. *glares at Slurp*).
These aren't exactly the same thing, but they are kinda of the same ilk - badly behaved robots doing what they shouldn't.
There's one more thing that drives me nuts that's kinda part of this family of annoyances - robots that pick out URLs that don't exist and aren't linked from anywhere. The number of 404 requests we get from robots is insane. What I really wish is that robot developers would use the referer header so we can figure out where they got the URL from.
A little off topic I grant you but it's kinda the same thing at the same time, it's time robots were written properly so they don't hit servers in a way that seems like an attack - then they might not be treated as such all the time?
Also what if DTD requests were handled more like DNS requests? Where you have a handful of root servers on the back bones and each ISP has one or more DTD servers with cached copies of the DTD schemas. The various DTD servers would then only update their caches when the TTL settings of the DTD schema expire.
For starters, I'd try working with the people who make the libraries that get used by the offending apps to try and convince them to change their API's. If the python urllib API changed so that you had to use a custome user agent to initialize the library, it'd be easier to track down the rogue apps. (Or at least print a mode-deprecation warning to stderr if you don't give a custom User string)
Once you find the bad apples, try to get in touch with the developers. Ask them what materials they learned from. Was there a specific online tutorial that taught them to do things badly? If so, try and submit some revised text, so that a second printing will lead fewer astray, or a tutorial webpage can update its offerings.
Maybe try to dedicate a small group who spend at least one hour every week on giving the problem publicity. Posting a blog about the problem, submitting an article to slashdot, writing a tutorial on the right way to do things, actively searching out tutorials that sopmebody might use if they were learning how to write an app that could have these problems, and asking the tutorial to include some discussion of the issue, etc.
And, others have suggested making things painfully slow when abusive behavior is encountered. I agree with that. Break some apps, force the issue. I'd hate you if your broke one of my apps, but it still is probably the correct thing to do.
Most developper don't have the time to play with these thing.
or if you had an SVN server that I could check for updates every so often and if there was a change, then it'd be downloaded. This would be much simpler in my opinion.
Rate limiting the requests from individual IPs is more appealing -- it would just require some decent packet filtering software in front of the servers, still not free but not a significant cost. This will reduce the bandwidth you're paying for, but it probably won't reduce the connection attempts significantly.
I think the mistake you're making here is that you only returned 503s for "hours or days". Broken software is often broken due to neglect. Days is nothing. Think months, years maybe.
OK, that's contra to the mission of the w3c, I understand.
Here's an alternative, if rate limiting requests turns out, as I suspect, to have zero effect after months:
Announce that the URIs are changing. dtd.w3.org is a good idea1. Wait a few months. Start returning 301s for the old request paths (far less bandwidth, and just as fast or faster than returning the DTD). Broken software that doesn't cache results probably won't follow the redirect (try it for a few hours to see). Humans that don't follow the w3c news, but do care, will check in a regular browser and see what's going on and update their software. If broken software follows the redirect, send them a special error page with instructions on how to clear their src address from the blacklist.
Even if all of the above fails, you'll still have gained by moving to a new host part of the URI -- as mentioned before, you can distribute the load to other hosts in different places, and you can put them behind slow connections without affecting www.w3.org.
1. dtd.w3.org (or any three-character nodename to replace "www") is important because the source to some of the software could be lost or otherwise unavailable. Replacing "www" with "dtd" in a binary editor is simple, but changing it to a string of a different length might not be. Obviously don't change the request path portions of the URIs either. :)
PS: Preview appears to be broken -- any changes I made, and then re-previewed, were lost in both the preview display and the textarea. Hopefully the edits will make it through when I 'Send' for real. If they do, this comment will be visible.
For example, W3 could offer DTDs for download as zip files (which could also be distributed through other sites / mirrors).
Another possibility is to present DTDs within regular web pages so people could copy and paste them.
I think the good solution is to use DNS load-balancing with a separated domain (e.g. schema.w3c.org or dtd.w3c.org) and find partners who can operate reliable mirrors (I think you can find easily such partners). To solve current problem you should permanently redirect (http 301) requests to the separated domain. You can slow down serving requests for biggest crawlers.
Beside these solutions maybe you have to contact with major xml library vendors to ask them to disable validation or enable caching dtds by default and write best practices about validation in their documentation.
I can't understand why somebody use validation this way, it's the slowest I can imagine. :-)
Best regards,
ivan
That may alert them to the fact that something is wrong.
And for those addresses that hit the 10 second delay and don't seem to notice or slow their requests after 3 months, cut 'em off after the first 100 requests in an hour.
Personally, I am also curious about all of the new developer tools for firefox, ie, etc. that perform validation on every page. I hope these are using caching mechanisms, but given the ease with which they can be implemented, they could quickly set up a distributed DOS.
It's a hard sell for a library developer to include a disk based cache mechanism as part of the parsing library when something like that really should be the responsibility of the program making use of the library. If every XML parsing library were to include file io as part of the package, that would be overkill. Certainly publishing tutorials about caching for performance reasons including sample code would be a good thing. Also, memory based caching should really be included by default for those libraries that can.
I think the best way to resolve the problem for w3c would be to create some tutorials for using some of the more common parsing libraries that exhibit undesired behavior by default and as part of those tutorials show performance benchmarks. As those kinds of things start hitting the top pages for google searches, developers will take notice and start to build better solutions.
I think short term fixes like 503 or even 404 errors will end up doing very little to resolve your issues long term, and probably will have very little effect short term as well.
I also think that setting up a subdomain for dtds is a great idea. (dtd.w3.org). It at least opens up some potential solutions down the road. (geo-responsive dns, etc).
One way that was suggested is to limit the throughput to the resource, and that may be fine, but that should be done dynamically to not introduce problems for "normal" applications. To make things worse - in some cases a lot of requests may appear from a single IP address, but that address may be a NAT firewall for a large company. Of course - such companies should have a web cache.
A limitation in throughput may not even be easy for the application developer to track down since the delay may be masked in a library and the developer may end up hunting problems in all other places than the DTD link.
The use of a separate DTD serving address may actually be a good idea, since it will allow for a distributed load. The downside is that it will take some time before it becomes effective and that it will require some resources. The good side is that the amount of data served is very small and also very predictable, so it's not very problematic to set up such a server.
Your list of questions omits two important ones:
- Are DTDs a bad idea and should we get rid of them?
- Is using URIs as identifiers that aren't meant to be dereferenced a bad idea and should we stop using them that way?
Sorry but I think the W3C is at fault here. If a namespace should not be retrievable than DON'T use an http URL to identify it.
If you want to foster or require the use of caching define some kind of optional cache identifier to define a namespace.
In general, I think the identity of a resource being retrievable is a good thing. It promotes the idea that the "definition" of an identified thing can be discovered by retrieving it. That makes sense - the only disconnect here is that we are talking about a "well-known" thing like HTML where the logic breaks down. However, for not very-well-known tag sets, this makes a lot of sense.
So, recommend you make an alternative syntax to specify that a local cached copy exists (or must exist). Than you can switch the default to that "cached copy URI".
If such a caching URI exists, my apologies for not researching it before posting.
Best wishes,
- Mike
Track the IP to an ISP and ask them to install a transparent proxy for your site, or to contact their user and tell them to configure one.
Convince Sun that the next update of Java 6 (and Apache commons) would install a local cache. Same for Python urllib2 and Perl's libwww.
I like the idea of slowing down offensive connections, but since that may be hard on the server level, you can just return a wrong DTD. Make it have a valid syntax, so the DTD parser would not fail, but contain no real elements. If that fails, use a completely invalid DTD.
My first experience with excessive network usage probably arose with various Java libraries, but it's true that Python's standard library has similar mechanisms, and if you look at tools like xsltproc there are actually command switches called --nonet and --novalid, implying that you'll probably fetch DTDs with that software unless you start using these switches.
Who is responsible? Well, I don't think you can put too much blame on the application authors. If using XML technologies has to involve a thorough perusal of the manual in order to know which switches, set in the "on" position by default, have to be turned off in order for the application to behave nicely, then the people writing the manual have to review their own decisions.
Some clearer language in various specifications would help, rather than having to read around the committee hedging their bets on most of the issues the whole time.
Start with individual IP addresses. If you're banning a significant number of IP addresses in a class (say, 50% of them), just ban that whole network. Keeping track of all the addresses/classes efficiently would require a bit of clever data structure usage, but it could be done. (I'd start with a trie with a child-count in each node.)
Anyway, just ban them.
The tarpitting idea sounds worth trying; does anyone have specific software to recommend that is able to keep tens of thousands of concurrent connections open on a typical cheap Linux box?
I actually wrote some software that reads XHTML documents into XML DOMs. As soon as the XML parser encounters an entity reference the URL will be loaded. So I created a local resolving mechanism with an entity resolver to read the DTDs from local, however:
- I had to go to all the individual specifications and download the individual specs there, and create my own full repository (I don't have the source-code here, but I am quite sure I ended up with over 50 files for 5 or specs).
- Create my own mapping file that goes from public (handling broken in .Net XML parsers) and system ids to my local files.
- And then of course implement entity resolving to actually pick up my local files.
Every time a developer implements an application that loads html documents using a standard XML parser (a quite common thing I would say), they need to perform these steps to alleviate stress on the W3C servers.
What I actually naively expected this article (found from slashdot) to contain when I opened it was a link to an archive with the files for all your stable specifications in one, with a id->path mapping, and some sample resolver code for common parser libraries in various languages. Does it exists?
(and caching it after one request is not usable for many situations, since my reason for caching was actually not to lessen W3C's Internet bill but allow the application to run without Internet access)
Just use lighttpd or nginx (obviously you should forget about Apache !)
Now here is my suggestion to get rid of the spamming.
You need two servers, a main server and a backup server. Both are machines suitable for serving a few thousands static requests per second, ie. the cheapest Core 2 boxes you can get. You will need to adjust the allowed number of sockets on both, of course, to allow as many concurrent connections as you can. Don't forget to enable zlib compression.
Now you implement some redirections :
- When www.w3.org sees a request for a DTD, it redirects to dtd.w3.org which runs on the "main" box.
- The "main" box has a good connection (like 100 Mbits)
- When the "main" box receives a HTTP request, it looks up the client IP address. If this address has submitted few requests, it serves the requests. However, if this IP has submitted say, more than 5 requests in a period of a few hours, it redirects it to the "backup" server, which is dtd2.w3.org, on a different IP.
To implement this you will need to code a simple C module for lighttpd.
- The "backup" server is connected to the internet via a completely separate, rather slow (10 Mbits) connection. It just serves static files.
So, the "main" server will always be fast and responsive, and the "backup" server will always have its connection horribly saturated.
Therefore, any client will get fast response on the first request from the "main" server. Well behaving clients will cache it, and it ends there. Badly behaving clients will not cache it and will request again, they will get redirected to the "backup" server and feel the pain.
Thank you all for the comments.
Scaling is not the long term solution as it does not address the cause, however it is something we will have to continue to do and appreciate suggestions made in this area.
By making this post we are trying to increase awareness so ideally this gets resolved as far upstream at the library level as possible since that will have the broadest effect. Community involvement with their respective development platforms of choice will help as we have had mixed success in identifying and contacting software and library maintainers. Some have been very responsive and are implementing caching catalogs or making static catalogs a default instead of an afterthought left to those installing and utilizing the library. Some developers have noticed our blocking scheme and have contacted us letting us know they have taken corrective steps on their own.
For those who wondered why these schemata and namespace resources are made available via HTTP to begin with is we intend for them to be derefenced as needed but expected this to be more reasonable given the caching directives in the HTTP spec. The performance cost of going over the network repeatedly for the same thing should be reason enough for developers to cache. Since many of these systems ignore response codes a tarpit solution might very well succeed in gaining their attention, plus has some entertainment value. If their application performance suffers substantially enough developers may take notice.
As mentioned many of these systems only understand a small handful of the various HTTP responses (200 OK, 302 Found, 401 Unauthorized, 403 Forbidden, 404 Not Found). We are more than slightly curious how the browser plugins, in cases we have for instance suspected a particular large scale ISP's webmail plugin is a traffic culprit, would handle HTTP 401 "Authorization Required" responses to their requests. Inside the realm part of the WWW-Authenticate header it would be quite tempting to give the technical support phone number to the ISP who has not listened to our repeated phone calls and emails on the subject. That would likely get their attention and potentially encourage them to correct the plugin.
Identifying the sources of W3C's abusive DTD traffic can be quite time consuming and difficult depending on the data in the HTTP request. One rather odd case we see often has the HTTP Referrer as "file:///C:/WINDOWS/fonts/set.ttf" and we have so far not found the related software. For identifying some we have found resources provided by various organizations (eg McAfee SiteAdvisor) that catalog browser plugins, software network interactions and viruses quite helpful. We would very much like to collaborate with such organizations or similar community efforts to help us identify more software responsible for this traffic. We have made a couple efforts to establish contacts within a couple such organizations but unfortunately emails and phone calls have not gone very far. Specific suggestions or contacts for us to follow up with would be appreciated.
It's just one more proof that the internet is gaining more widespread use.. the percentage of idiots online is now probably equal to the percentage of idiots in the world...
Martin Nicholls,
I cannot agree more with your sentiment towards poorly behaving bots/crawlers, they are getting out of hand. There has been talk around here at W3C and elsewhere on starting an activity for directives governing bot interactions with a website. There have been some scattered conventions which should be standardized and improved upon.
For instance polite bots could
- identity themselves properly including URI with various information
- how to submit complaints
- means to authenticate crawler and IP range it is coming from
- respect directives regarding frequency, concurrency, etc. given in response headers of site being crawled
- advertise peak hours with higher thresholds if the crawler would like to schedule it's return
- server being crawled could make available data of resources and last modified dates so more intelligent, minimal crawls can be made saving both sides resources
Those bots that do not abide by these conventions and overstep the boundaries spelled out can be spotted and blocked through automated means.
Those that do could do their indexing in as efficient a manner as is comfortable for the website being crawled.
<!DOCTYPE PUBLIC html "-//W3C//DTD XHTML 1.0 Strict//EN"
"dtd://TR/xhtml1/DTD/xhtml1-strict.dtd">Then it's known where the dtd is located in case it's actually needed, but would only be used by applications and libraries that would probably need it, since it would require special handling instead of blind handling. Slowing traffic for the http link wouldn't be great, because on a few occasions I've downloaded the DTD's for learning the document types(you do want valid html instead of what most tutorials provide, right?).
Still unique, still locatable, and hard to misinterpret.
However, I don't know if the tarpit solution is a good idea. What's John Q. Public who's running some misbehaving software going to think? "Oh, this must be what they mean when they say XML is slow. This problem never happened before I put the DOCTYPE on all my files. That guy who pushed us to adopt XHTML is a moron." Fix the problem going forward by changing the scheme for identifying DTDs, but think carefully before spreading the pain just to save the W3C some inconvenience.
In particular, reading the DTD is necessary even in non-validating mode in case the document contains entity references.
Of course, to read the DTD, one might be able to use an alternate URL based on the public identifier. Unfortunately, catalogs are not in wide-spread use, and W3C does nothing to promote them.
Where did I say System Ids are not downloadable resources? This post is about the frequency of the downloads, disregarding HTTP caching directives.
I think Martin was referring to the following excerpt, which does sound like "these URIs are identifiers, not for download".
Note that these are not hyperlinks, these URIs are used for identification. This is a machine-readable way to say -this is HTML-. In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over
As many have pointed, the data downloaded is needed, so it'd be great if W3C could provide basic catalogs/suggestions to be used as the sane default.
You have to keep in mind that caching is a great solution for many given scenarios, but low level tools/libraries cannot be expected to assume caching/catalogs are THE right thing to do when the spec includes checking against what the URI point to.
You saying that developers were supposed to implement caching due to their own performance concerns makes me wonder: what if most already do that? What will change when they do?
What if most hits you get are from software that scraps "http://[...]" from data and follow that? What if library per-process/thread cache is already there but the system forks for each URL? How about distributing batches of URLs to visit?
So IMO the W3C has a chance of simply postponing the issue if no steps are taken towards providing local, reliable catalogs to the community and changing the recommended http:// URIs to something else (like the dtd:// above).
A misunderstanding then, my apologies for us being ambiguous. We went with that wording to avoid going into the differences between DTDs and namespaces which parsers have no need to dereference as there may not be anything of use to them as is the case with xmlns="http://www.w3.org/1999/xhtml". DTDs are meant to be downloaded for machine processing but reasonably not incessantly by an application running on a machine. We are seeing XML processors grab these even when they are not using them.
Making catalogs available has come up before and we certainly will consider it. Catalogs still would need to make their way into the various tools and libraries, many of which do not come with any. We are just one of the many organizations and individuals making these sorts of resources (namespaces, DTD) available so tool and library developers will still have to collect these.
It is difficult for tool and library developers to know what markup that will run through their utilities for validations or transformations not to mention new schemata are always being created. Because of this the best solution is for a caching XML Catalog resolver as I understand is part of Glassfish. The library will add DTDs to it's cache as it needs them, caching is part of the HTTP protocol.
Thanks a lot for the discussion.
I like the caching idea, but believe distributing the load extends it. The general issue with caching is related to where to cache. Local (or in memory. per process, etc.) caches will be less efficient than shared ones. Shared (system, library) caches will have their own load of issues. So you might end with much nicer libraries and still being hammered by requests.
Notice that scaling up and mirroring amounts to an extremely-shared cache. I believe having the machinery for mirroring in place (checksums, compressed snapshots, change notifications) could lead to lower level mirroring (dtd-daemon, anyone?).
The benefits would be:
Network admins could save resources that lazy programmers forgot to (and legacy code would automagically stop being so nasty).
Users could get performance boosts by installing software that tricks dumb apps to fetch DTDs from a local cache, regardless of upstream actions.
Library developers (and even dumb programmers) would have a Darn Easy® recommended route to caching, as local-ish mirroring and checksums would be discussed all over the place (and faster, cheaper, tastier).
Also, maybe you should talk to Coral Content Distribution Network regarding forwarding traffic. It might be interesting for them to have such a huge source of input to their research.
On a meta note, I think it could be very useful to have a central location (wiki?) to gather resources and discussions on this issue.
One of the arguments against having caching resolvers in XML libraries has been this is attainable outside of the library, which it certainly is with a caching proxy server for instance. It is a very worthwhile solution and why we give the caching directives in the first place.
We have seen a number of corporate and large ISP HTTP proxies hammering us because of some XML application[s] running behind them. Sometimes the network admins would, if they were responsive at all to us, add caching to their proxy setup or less often track down the parties responsible for the software causing the traffic. More often they would refuse to add caching to their proxy or any other action citing cost or complexity. Bandwidth is cheaper than equipment and admin time I guess.
of many posters here) that noone mentioned the best solution on the
application side: catalogs. They have been part of SGML and XML for
many years, so there is good support for them. Any XML parser should
support catalogs and, then, the DTD would be retrieved on the local
disk and not through the network.
http://www.oasis-open.org/committees/entity/spec-2001-08-06.html
http://www.sagehill.net/docbookxsl/Catalogs.html
(Of course, there are always broken programs and sites where catalogs
will not be installed, so there is a still a need for other measures.)
not. The HTTP protocol has a If-Modified-Since header for precisely
this purpose and W3C's server honors it. You can set it, for instance,
with curl:
% curl -v --time-cond 20080201 -o html.dtd http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
...
< HTTP/1.0 304 Not Modified
[No download]
% curl -v --time-cond 20000201 -o html.dtd http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
...
< HTTP/1.0 200 OK
[And the file is actually downloaded]
Of course, this requires a program that sends it and has a local
storage to keep the DTD, but recommending this technique may help
(among other techniques like HTTP caching, XML catalogs, terminating
the offenders, etc).
I am using !DOCTYPE html SYSTEM "loose.dtd"
does the browser actually refer to the dtd I have placed on the computer which is by W3c OR do I need to put the Identifier in also, before the URL
as :
!DOCTYPE html PUBLIC "-W3C//DTD Html 4.0 Transitional//EN" "loose.dtd" ?
Adding to what Daniel already said:
I just happen to be stacking together a new flavor of modular XHTML in the spirit of XHTML+RDFa for the backend of a new website I'm working on.
I'm using libxml on MacOS X via MacPorts. MacPorts has a package with HTML4 dtd's, but not XHTML and it does not supply a catalog with the DTD's. I'll accept that I'll have to add a new entry to the catalog, but I'll still have to get the DTD's in the first place.
I have three options at this point: download each DTD (module) manually through my webbrowser at /MarkUp/DTD, let wget crawl the DTD directory, or download Debian's w3c-dtd-xhtml package and rip the files from that package. Hardly convenient.
I assume that a lot of developers will even ignore the speed problems whilst getting their new apps to work.
I think it would really help if W3C would package its DTD's in a tar.gz, and perhaps even pro-actively work with package maintainers to distribute these files.
Obviously, this will not be a quick fix to your bandwidth problems, but I think it does address the core of the problem: Too many developers are not aware the inner workings of XML validation (or validating parsing) and assume 'it just works'.
my 2 cents
I agree we can't just depend on w3 to persistently follow up as everyone has the responsibility to help.
While I sympathize with the bandwidth concerns discussed above, the question is what can we do about it? We're not the source of the offending app, Microsoft is, but we bear the consequences. Even if MS were to turn around a caching patch quickly, it probably wouldn't get widely deployed for years and I imagine the W3C admins will not lift the IE ban until then.
All I can think of to do on our end is to locally cache the DTD (and the entity files it references, IE also tries to fetch those) on our server and patch all of the documents to refer to those.
BTW, I don't see any reason this isn't affecting the combination of IE with every Docbook-generated XHTML doc in the world, if they're built using the standard stylesheet distribution.
Are there any other options within our control, that don't require cooperation from W3C or Microsoft?
So I notice in these man pages, the main frame (eg http://www.opengl.org/sdk/docs/man/xhtml/glBindAttribLocation.xml ) is already XHTML markup but served with HTTP header Content-Type: application/xhtml+xml
If you serve this [X]HTML markup as HTML instead of XML, MSIE will not call it's XML processor which in turn tries to dereference DTD from us. Try serving it with .html extension and/or Content-Type: text/html and your problem should be resolved.
If we serve our content as HTML instead of XML, as you suggest, then IE will not invoke the MathPlayer plugin to render MathML content and the pages aren't rendered correctly. Getting MathML displayed properly in the man pages is pretty much the whole point of this exercise, so I don't think that will work for us. We've modified the man pages to refer to a local cache of the DTD, and that seems to work well enough.
AAnd if the document is public surely it is obvious it is public. Surely the standardisation that matters technically is at the base mark up level not what someone wants to impose. It seems to me both creativity and competivity are being stifled.
Helen Gavaghan
can anybody tell from this what I need to remove from my registry files or what settings need to be changed in my browser or give me any sort of clue how I can fix the issue or is this all pretty normal? thanks in advance for taking time out from your other issue to enlighten my ignorance.
It's not a complete or ideal solution, but have you considered in-place editing of the relevant DTDs to make them smaller while maintaining their semantics? It's unpleasant, but Process allows for it.
Take for instance http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. It is 25kb. Instead, replace it with a DTD that only contains:
<!ENTITY % x SYSTEM "http://dtd.w3.org/xhtml1-strict">%xThat's 56 bytes, 455 smaller than the raw version you have to serve to those stupid libraries that often don't send Accept-Encoding: gzip (even if they support it), and still 120 times smaller than the gzip version.
Now this assumes that an important subset of the requests that are made don't actually do anything useful with the content and so don't make a second request to the actual content. I suspect it's worth a shot, or at least worth testing.
It has the additional advantage that using a different DNS means that you might be able to use load distribution tricks not available to you for the general website.
Anyway, just a thought!
https://facelets.dev.java.net/issues/show_bug.cgi?id=352
If you are impacted file a bug report with the developers of the library or utility you use asking them to implement a [caching] catalog solution. You may also put a caching proxy in front of your application for immediate remedy to your situation, populating the cache with a user agent we are not blocking DTD access to.
Java based applications and libraries are presently accounting for nearly 1/4th of our DTD traffic (in the hundred of millions a day). There is also another more substantial source of traffic which the vendor is working to correct in the hopefully near future.
http://saxon.wiki.sourceforge.net/XML+Catalogs
I can set the System user.agent, which URLConnection then uses in the request. It appends the Java "Java/vers" string to the one I provide, giving e.g. "DSS/1.0 Java/1.6.0_13". I believe this is the correct format and intention for UserAgent, indicating the primary system and version followed by any subsystem.
You're still denying this request. Are you searching for the Java identifier *anywhere* in the string? That precludes any Java-based system (at least, ones not controlling the headers all the way down, i.e. using most any libraries) from behaving properly and working.
Caveat: My understanding is limited.
Changing the user-agent is commendable especially if you post it somewhere it can be indexed and people can contact you if there is an issue with it.
Instead of writing something to maintain your own cache look to Xerces XML Catalog capabilities which I wish were the default instead of an after thought.
For the time being I am also relaxing the filtering based on your suggestion. There is one particular Java UA that prepends a string that is causing 80 million or so hits/day at present. We contacted them after researching the user-agent used.
This post has 3 feedbacks awaiting moderation...
Leave a comment:
Comment from: Ted Guild [Member] · http://www.w3.org/People/Ted