W3C's Excessive DTD Traffic
If you view the source code of a typical web page, you are likely to see something like this near the top:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
and/or
<html xmlns="http://www.w3.org/1999/xhtml" ...>
These refer to HTML DTDs and namespace documents hosted on W3C's site.
Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say "this is HTML". In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.
The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema.
Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.
A while ago we put a system in place to monitor our servers for abusive request patterns and send 503 Service Unavailable
responses with custom text depending on the nature of the abuse. Our hope was that the authors of misbehaving software and the administrators of sites who deployed it would notice these errors and make the necessary fixes to the software responsible.
But many of these systems continue to re-request the same DTDs from our site thousands of times over, even after we have been serving them nothing but 503
errors for hours or days. Why are these systems bothering to request these resources at all if they don't care about the response? (For repeat offenders we eventually block the IPs at the TCP level as well.)
We have identified some of the specific software causing this excessive traffic and have been in contact with the parties responsible to explain how their product or service is essentially creating a Distributed Denial of Service (DDoS) attack against W3C. Some have been very responsive, correcting the problem in a timely manner; unfortunately others have been dragging on for quite some time without resolution, and a number of sources remain unidentified.
We would like to see this issue resolved once and for all, not just for our own needs but also to improve the quality of software deployed on the Web at large. Therefore we have a number of suggestions for those writing and deploying such software:
- Pay attention to HTTP response codes
This is basic good programming practice: check your return codes, otherwise you have no idea when something goes wrong.
- Honor HTTP caching/expiry information
Resources on our site are served in a cache-friendly way: our DTDs and schemata generally have explicit expiry times of 90 days or more, so there's no reason to request these resources several times a day. (In one case we noticed, a number of IP addresses at one company were requesting DTDs from our site more than three hundred thousand times per day each, per IP address.)
Mark Nottingham's caching tutorial is an excellent resource to learn more about HTTP caching.
- If you implement HTTP in a software library, allow for caching
Any software that makes HTTP requests to other sites should make it straightforward to enable the use of a cache. Applications that use such libraries to contact other sites should clearly document how to enable caching, and preferably ship with caching enabled by default.
Many XML utilities have the ability to use an XML catalog to map URIs for external resources to a locally-cached copy of the files. For information on configuring XML applications to use a catalog, see Norman Walsh's Caching in with Resolvers article or Catalog support in libxml.
- Take responsibility for your outgoing network traffic
If you install software that interacts with other sites over the network, you should be aware how it works and what kind of traffic it generates. If it has the potential to make thousands of requests to other sites, make sure it uses an HTTP cache to prevent inflicting abuse on other sites. If the software doesn't make it straightforward to do so, file a bug report with the vendor, seek alternatives, or use an intercepting proxy server with a built-in cache.
- Don't fetch stuff unless you actually need it
Judging from the response to our 503 errors, much of the software requesting DTDs and schemata from our site doesn't even need them in the first place, so requesting them just wastes bandwidth and slows down the application. If you don't need it, don't fetch it!
- Identify your user agents
When deploying software that makes requests to other sites, you should set a custom
User-Agent
header to identify the software and provide a means to contact its maintainers. Many of the automated requests we receive have generic user-agent headers such asJava/1.6.0
orPython-urllib/2.1
which provide no information on the actual software responsible for making the requests.Some sites (e.g. Google, Wikipedia) block access to such generic user-agents. We have not done that yet but may consider doing so.
It is generally quite easy to set a custom User-Agent with most HTTP software libraries, see for example How to change the User-Agent of Python's urllib.
We are interested in feedback from the community on what else we can do to address the issue of this excessive traffic. Specifically:
-
Do we need to make our specifications clearer in terms of HTTP caching and best practices for software developers?
You might think something like "don't request the same resource thousands of times a day, especially when it explicitly tells you it should be considered fresh for 90 days" would be obvious, but unfortunately it seems not.
At the W3C Systems Team's request the W3C TAG has agreed to take up the issue of Scalability of URI Access to Resources.
-
Do you have any examples of specific applications that do things right/wrong by default, or pointers to documentation on how to enable caching in software packages that might be affecting us?
-
What do other medium/large sites do to detect and prevent abuse?
We are not alone in receiving excessive schema and namespace requests, take for example the stir when the DTD for RSS 0.91 disappeared.
For other types of excessive traffic, we have looked at software to help block or rate-limit requests, e.g. mod_cband, mod_security, Fail2ban.
Some of the community efforts in identifying abusive traffic are too aggressive for our needs. What do you use, and how do you use it?
-
Should we just ignore the issue and serve all these requests?
What if we start receiving 10 billion DTD requests/day instead of 100 million?
Authors: Gerald Oskoboiny and Ted Guild
I definitely like the idea of defining which DTD to use through something other than a straight HTTP URL.
Another possible idea is that of decentralizing where the DTDs are actually stored, coupled with the above idea. A doctype declaration could then just be the name of the actual type of document. The browser (and supporting libraries) could keep track of "DTD servers", which could possibly be set up by anyone. This solves both problems, both of having a URI for the doctype and for eliminating the strain of rogue applications who suck up W3's bandwidth.
Instead of rejecting the requests with 503, it might be a much better strategy to serve them, but very very slowly.
Failures and error codes are rutinely ignored in code, but things which basically work will happen, and if things are suddenly incredibly slow, people tend to notice.
Drip out the respons at around 10 char every other second, fast enough to keep the TCP connection open and the application running, but slow enough to totally wreck any hope of responsetime in the other end.
One level nastier, and in general one level too nasty, is to return buggy contents for repeat offenders, hoping to make their applications fail with interesting diagnostics.
Poul-Henning
(Who has far more experience with this problem in NTP context than he ever wanted)
One current problem is that all the DTDs are under http://www.w3.org/ which makes it difficult to distribute them around the Internet. Perhaps if new DTDs were placed at http://dtd.w3.org/ then DNS could be used to return the nearest server to the requesting agent - a technique used on a number of other sites of course.
In the short term, there may be little effect as most documents have www.w3.org in them, but over time people would start using the new hostname instead, and of course newer DTDs would not be available anywhere else.
You guys are completely right about this - the number of badly behaved robots, spiders and other tools around the internet these days is starting to get silly.
We block a few of the generic ones like libwww-perl and some others, and yet they keep comming back for more.
We had another spider looking for RSS feeds that didn't exist going round & round in circles and eventually putting enough load on our server for it to send me an alert at 4am so I had to get up and block it's IP range (that's another annoying trend - spiders over massive and disparate IP address ranges that won't go way.. *glares at Slurp*).
These aren't exactly the same thing, but they are kinda of the same ilk - badly behaved robots doing what they shouldn't.
There's one more thing that drives me nuts that's kinda part of this family of annoyances - robots that pick out URLs that don't exist and aren't linked from anywhere. The number of 404 requests we get from robots is insane. What I really wish is that robot developers would use the referer header so we can figure out where they got the URL from.
A little off topic I grant you but it's kinda the same thing at the same time, it's time robots were written properly so they don't hit servers in a way that seems like an attack - then they might not be treated as such all the time?
To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us/popular , etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
Please accept my humble observations. While attempting to learn about W3 and the evolving language of the hyperweb, I had certain misconceptions that must be common. I thought you were in the business of providing certain basic reference schemas that were critical to writing well formed web pages int the time of XML and xhtml. My first few compliant web pages were cut and pasted versions of example pages from W3. Only now after reading about yor problem in /. do I even have a clue that it is not so. I see two ways out of this. The first is education. You must make it clear how important schemas need to live somewhere else, keeping in mind that existing ones may have to last decades for the sake of legacy documents. The other possibility is that you accept responsibility for hosting these keystone documents, but push them out onto akamai servers or some other way of not being the bottleneck of the hyperverse. Although this is a terrible hack, your most common existing DTDs could be cached in the browser's themselves, not unlike the root certificates. It is a compliment that peope think you are the center of the Document universe, if only you can survive it. Thank you very much for your time. Doug
Wholeheartedly agree with this. I considered downloading a copy of the schemas I'd wanted and storing them in the "resources" directory of the website I was currently building. This way I could change "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" to "http://mywebsite/resourcespath/xhtml1-strict.dtd" but unfortunately, I received a 503. Go figure.
Downloading the DTD may or may not be a working idea in practice – it is certainly not recommended, since the purpose of the URI, as the text above clearly says, is *not* to give a place to download some file, it is to uniquely identify the DTD. This becomes especially important once you start combining documents using different (or maybe not so different) DTDs, but even for single files, is the best chance of n application to check whether it can handle this file.
What about getting the ISPs and major internet backbones involved. For example an ISP could scan the URL traffic and return a locally cached version of the DTD.
Also what if DTD requests were handled more like DNS requests? Where you have a handful of root servers on the back bones and each ISP has one or more DTD servers with cached copies of the DTD schemas. The various DTD servers would then only update their caches when the TTL settings of the DTD schema expire.
That would require deep packet inspection (or even transparent proxies), since the ISP usually does not see the URL. I'd rather my ISP didn't do that to my traffic.
I think the with a problem as massive as this, defeating it will require working from several different angles at once.
For starters, I'd try working with the people who make the libraries that get used by the offending apps to try and convince them to change their API's. If the python urllib API changed so that you had to use a custome user agent to initialize the library, it'd be easier to track down the rogue apps. (Or at least print a mode-deprecation warning to stderr if you don't give a custom User string)
Once you find the bad apples, try to get in touch with the developers. Ask them what materials they learned from. Was there a specific online tutorial that taught them to do things badly? If so, try and submit some revised text, so that a second printing will lead fewer astray, or a tutorial webpage can update its offerings.
Maybe try to dedicate a small group who spend at least one hour every week on giving the problem publicity. Posting a blog about the problem, submitting an article to slashdot, writing a tutorial on the right way to do things, actively searching out tutorials that sopmebody might use if they were learning how to write an app that could have these problems, and asking the tutorial to include some discussion of the issue, etc.
And, others have suggested making things painfully slow when abusive behavior is encountered. I agree with that. Break some apps, force the issue. I'd hate you if your broke one of my apps, but it still is probably the correct thing to do.
I think you should insite on librairy provider that a catalog is configure in the default setting. I remember last time I had to set one in Java, it was complex and time customing. I would find it better if a standard catalog was configure by default and a empty catalog could be use only when specificaly required.
Most developper don't have the time to play with these thing.
Tarpitting the response sounds like a good idea, but it would require the capacity to sustain hundreds of times more connections than the servers do presently. That's not free, either.
Rate limiting the requests from individual IPs is more appealing -- it would just require some decent packet filtering software in front of the servers, still not free but not a significant cost. This will reduce the bandwidth you're paying for, but it probably won't reduce the connection attempts significantly.
I think the mistake you're making here is that you only returned 503s for "hours or days". Broken software is often broken due to neglect. Days is nothing. Think months, years maybe.
OK, that's contra to the mission of the w3c, I understand.
Here's an alternative, if rate limiting requests turns out, as I suspect, to have zero effect after months:
Announce that the URIs are changing. dtd.w3.org is a good idea1. Wait a few months. Start returning 301s for the old request paths (far less bandwidth, and just as fast or faster than returning the DTD). Broken software that doesn't cache results probably won't follow the redirect (try it for a few hours to see). Humans that don't follow the w3c news, but do care, will check in a regular browser and see what's going on and update their software. If broken software follows the redirect, send them a special error page with instructions on how to clear their src address from the blacklist.
Even if all of the above fails, you'll still have gained by moving to a new host part of the URI -- as mentioned before, you can distribute the load to other hosts in different places, and you can put them behind slow connections without affecting www.w3.org.
1. dtd.w3.org (or any three-character nodename to replace "www") is important because the source to some of the software could be lost or otherwise unavailable. Replacing "www" with "dtd" in a binary editor is simple, but changing it to a string of a different length might not be. Obviously don't change the request path portions of the URIs either. :)
PS: Preview appears to be broken -- any changes I made, and then re-previewed, were lost in both the preview display and the textarea. Hopefully the edits will make it through when I 'Send' for real. If they do, this comment will be visible.
I think the main source of the problem is that the XML spec (section 5.1) implies that a validating parser without any persistent state is supposed to do this, and so they mostly do. For example, the default Java SAX handler, if given an XHTML document, will fetch all of the DTD parts, extremely slowly. Of course, it'll use a User-Agent like Java/1.6.0, because the application author probably doesn't realize that the application will be making any network connections at all.
Yes this is a huge problem. The applications that do these requests slow themselves down and the whole network. There is no need. For Java apps, see "Apache XML Commons resolver" http://xml.apache.org/commons/components/resolver/ which developed from Norm Walsh's work.
I think you have to combine solutions Poul-Henning Kamp and Ben Strawson wrote.
I think the good solution is to use DNS load-balancing with a separated domain (e.g. schema.w3c.org or dtd.w3c.org) and find partners who can operate reliable mirrors (I think you can find easily such partners). To solve current problem you should permanently redirect (http 301) requests to the separated domain. You can slow down serving requests for biggest crawlers.
Beside these solutions maybe you have to contact with major xml library vendors to ask them to disable validation or enable caching dtds by default and write best practices about validation in their documentation.
I can't understand why somebody use validation this way, it's the slowest I can imagine. :-)
Best regards,
ivan
I have to agree with a delay. After a user requests more than 100 in an hour, put them on a 2 second delay; after 1,000 a 10 second delay.
That may alert them to the fact that something is wrong.
And for those addresses that hit the 10 second delay and don't seem to notice or slow their requests after 3 months, cut 'em off after the first 100 requests in an hour.
Personally, I am also curious about all of the new developer tools for firefox, ie, etc. that perform validation on every page. I hope these are using caching mechanisms, but given the ease with which they can be implemented, they could quickly set up a distributed DOS.
It's good that this blog post is getting attention - I'm sure many of the developers of various libraries are starting to take notice. IMO, that's the biggest problem right there - libraries, their documentation, and their default behavior.
It's a hard sell for a library developer to include a disk based cache mechanism as part of the parsing library when something like that really should be the responsibility of the program making use of the library. If every XML parsing library were to include file io as part of the package, that would be overkill. Certainly publishing tutorials about caching for performance reasons including sample code would be a good thing. Also, memory based caching should really be included by default for those libraries that can.
I think the best way to resolve the problem for w3c would be to create some tutorials for using some of the more common parsing libraries that exhibit undesired behavior by default and as part of those tutorials show performance benchmarks. As those kinds of things start hitting the top pages for google searches, developers will take notice and start to build better solutions.
I think short term fixes like 503 or even 404 errors will end up doing very little to resolve your issues long term, and probably will have very little effect short term as well.
I also think that setting up a subdomain for dtds is a great idea. (dtd.w3.org). It at least opens up some potential solutions down the road. (geo-responsive dns, etc).
Some cases like "Java/1.6.0" user agent may be that it's a Java applet or application using the standard Java library. Changing the user agent string here may not always be possible and blocking it may cause harm to a lot of applications on the net.
One way that was suggested is to limit the throughput to the resource, and that may be fine, but that should be done dynamically to not introduce problems for "normal" applications. To make things worse - in some cases a lot of requests may appear from a single IP address, but that address may be a NAT firewall for a large company. Of course - such companies should have a web cache.
A limitation in throughput may not even be easy for the application developer to track down since the delay may be masked in a library and the developer may end up hunting problems in all other places than the DTD link.
The use of a separate DTD serving address may actually be a good idea, since it will allow for a distributed load. The downside is that it will take some time before it becomes effective and that it will require some resources. The good side is that the amount of data served is very small and also very predictable, so it's not very problematic to set up such a server.
Hi Folks,
Sorry but I think the W3C is at fault here. If a namespace should not be retrievable than DON'T use an http URL to identify it.
If you want to foster or require the use of caching define some kind of optional cache identifier to define a namespace.
In general, I think the identity of a resource being retrievable is a good thing. It promotes the idea that the "definition" of an identified thing can be discovered by retrieving it. That makes sense - the only disconnect here is that we are talking about a "well-known" thing like HTML where the logic breaks down. However, for not very-well-known tag sets, this makes a lot of sense.
So, recommend you make an alternative syntax to specify that a local cached copy exists (or must exist). Than you can switch the default to that "cached copy URI".
If such a caching URI exists, my apologies for not researching it before posting.
Best wishes,
- Mike
The standards encourage this behavior. The 4.01 spec says, "The URI in each document type declaration allows user agents to download the DTD and any entity sets that are needed." And the XHTML spec says, "If the user agent claims to be a validating user agent, it must also validate documents against their referenced DTDs". General purpose SGML or XML parsers will not embed the HTML DTDs, and even if they have the luxury of a cache it isn't likely to persist across process invocations. There is obviously a software component to this problem, and developers need to be aware. But as you point out, the problem is not limited to the W3C. The best way forward will be to improve infrastructure, and in particular, to find sustainable caching strategies. I would like to think that the Scalability TAG will come up with solutions, but the emails are not encouraging.
Track the IP to a person (use a court order if necessary) and find out what is the offending software. Then sue them for abuse. One case like that and you would get much more publicity than via slashdot.
Track the IP to an ISP and ask them to install a transparent proxy for your site, or to contact their user and tell them to configure one.
Convince Sun that the next update of Java 6 (and Apache commons) would install a local cache. Same for Python urllib2 and Perl's libwww.
I like the idea of slowing down offensive connections, but since that may be hard on the server level, you can just return a wrong DTD. Make it have a valid syntax, so the DTD parser would not fail, but contain no real elements. If that fails, use a completely invalid DTD.
If we ignore the intentional "dual purpose" of the URLs concerned (that they should be used as a unique identifier, yet can also be used to consult the DTD), probably the biggest reason why the URLs are getting so many hits is because many parsing toolkits have bad defaults: that for many implementations of APIs like SAX, you have to go out of your way to override various "resolver" classes in order for your application not to go onto the network and retrieve stuff. So it's quite possible that most people don't even know that their code is causing network requests until their application seems to be freezing for an unforeseen reason which they then discover to be related to slow network communications.
My first experience with excessive network usage probably arose with various Java libraries, but it's true that Python's standard library has similar mechanisms, and if you look at tools like xsltproc there are actually command switches called --nonet and --novalid, implying that you'll probably fetch DTDs with that software unless you start using these switches.
Who is responsible? Well, I don't think you can put too much blame on the application authors. If using XML technologies has to involve a thorough perusal of the manual in order to know which switches, set in the "on" position by default, have to be turned off in order for the application to behave nicely, then the people writing the manual have to review their own decisions.
Some clearer language in various specifications would help, rather than having to read around the committee hedging their bets on most of the issues the whole time.
Thanks to everyone for your comments. I'll try to reply in more detail later, just a quick followup for now:
The tarpitting idea sounds worth trying; does anyone have specific software to recommend that is able to keep tens of thousands of concurrent connections open on a typical cheap Linux box?
As Arman Sharif says you would say that the W3C would make it easy to download the schema-files they don't want you to access directly. But it's my experience that the W3C does not offer this properly.
I actually wrote some software that reads XHTML documents into XML DOMs. As soon as the XML parser encounters an entity reference the URL will be loaded. So I created a local resolving mechanism with an entity resolver to read the DTDs from local, however:
- I had to go to all the individual specifications and download the individual specs there, and create my own full repository (I don't have the source-code here, but I am quite sure I ended up with over 50 files for 5 or specs).
- Create my own mapping file that goes from public (handling broken in .Net XML parsers) and system ids to my local files.
- And then of course implement entity resolving to actually pick up my local files.
Every time a developer implements an application that loads html documents using a standard XML parser (a quite common thing I would say), they need to perform these steps to alleviate stress on the W3C servers.
What I actually naively expected this article (found from slashdot) to contain when I opened it was a link to an archive with the files for all your stable specifications in one, with a id->path mapping, and some sample resolver code for common parser libraries in various languages. Does it exists?
(and caching it after one request is not usable for many situations, since my reason for caching was actually not to lessen W3C's Internet bill but allow the application to run without Internet access)
130 million requests a day is about 1500 per second. On a few static files. A 3 years-old laptop running lighttpd can manage that easily (and actually a lot more). On my server which is the cheapest Core2, lighttpd handles 200 req/s and it uses a few percent CPU.
Just use lighttpd or nginx (obviously you should forget about Apache !)
Now here is my suggestion to get rid of the spamming.
You need two servers, a main server and a backup server. Both are machines suitable for serving a few thousands static requests per second, ie. the cheapest Core 2 boxes you can get. You will need to adjust the allowed number of sockets on both, of course, to allow as many concurrent connections as you can. Don't forget to enable zlib compression.
Now you implement some redirections :
- When www.w3.org sees a request for a DTD, it redirects to dtd.w3.org which runs on the "main" box.
- The "main" box has a good connection (like 100 Mbits)
- When the "main" box receives a HTTP request, it looks up the client IP address. If this address has submitted few requests, it serves the requests. However, if this IP has submitted say, more than 5 requests in a period of a few hours, it redirects it to the "backup" server, which is dtd2.w3.org, on a different IP.
To implement this you will need to code a simple C module for lighttpd.
- The "backup" server is connected to the internet via a completely separate, rather slow (10 Mbits) connection. It just serves static files.
So, the "main" server will always be fast and responsive, and the "backup" server will always have its connection horribly saturated.
Therefore, any client will get fast response on the first request from the "main" server. Well behaving clients will cache it, and it ends there. Badly behaving clients will not cache it and will request again, they will get redirected to the "backup" server and feel the pain.
Thank you all for the comments.
Scaling is not the long term solution as it does not address the cause, however it is something we will have to continue to do and appreciate suggestions made in this area.
By making this post we are trying to increase awareness so ideally this gets resolved as far upstream at the library level as possible since that will have the broadest effect. Community involvement with their respective development platforms of choice will help as we have had mixed success in identifying and contacting software and library maintainers. Some have been very responsive and are implementing caching catalogs or making static catalogs a default instead of an afterthought left to those installing and utilizing the library. Some developers have noticed our blocking scheme and have contacted us letting us know they have taken corrective steps on their own.
For those who wondered why these schemata and namespace resources are made available via HTTP to begin with is we intend for them to be derefenced as needed but expected this to be more reasonable given the caching directives in the HTTP spec. The performance cost of going over the network repeatedly for the same thing should be reason enough for developers to cache. Since many of these systems ignore response codes a tarpit solution might very well succeed in gaining their attention, plus has some entertainment value. If their application performance suffers substantially enough developers may take notice.
As mentioned many of these systems only understand a small handful of the various HTTP responses (200 OK, 302 Found, 401 Unauthorized, 403 Forbidden, 404 Not Found). We are more than slightly curious how the browser plugins, in cases we have for instance suspected a particular large scale ISP's webmail plugin is a traffic culprit, would handle HTTP 401 "Authorization Required" responses to their requests. Inside the realm part of the WWW-Authenticate header it would be quite tempting to give the technical support phone number to the ISP who has not listened to our repeated phone calls and emails on the subject. That would likely get their attention and potentially encourage them to correct the plugin.
Identifying the sources of W3C's abusive DTD traffic can be quite time consuming and difficult depending on the data in the HTTP request. One rather odd case we see often has the HTTP Referrer as "file:///C:/WINDOWS/fonts/set.ttf" and we have so far not found the related software. For identifying some we have found resources provided by various organizations (eg McAfee SiteAdvisor) that catalog browser plugins, software network interactions and viruses quite helpful. We would very much like to collaborate with such organizations or similar community efforts to help us identify more software responsible for this traffic. We have made a couple efforts to establish contacts within a couple such organizations but unfortunately emails and phone calls have not gone very far. Specific suggestions or contacts for us to follow up with would be appreciated.
Martin Nicholls,
I cannot agree more with your sentiment towards poorly behaving bots/crawlers, they are getting out of hand. There has been talk around here at W3C and elsewhere on starting an activity for directives governing bot interactions with a website. There have been some scattered conventions which should be standardized and improved upon.
For instance polite bots could
identity themselves properly including URI with various information
how to submit complaints
means to authenticate crawler and IP range it is coming from
respect directives regarding frequency, concurrency, etc. given in response headers of site being crawled
advertise peak hours with higher thresholds if the crawler would like to schedule it's return
server being crawled could make available data of resources and last modified dates so more intelligent, minimal crawls can be made saving both sides resources
Those bots that do not abide by these conventions and overstep the boundaries
spelled out can be spotted and blocked through automated means.
Those that do could do their indexing in as efficient a manner as is comfortable for the website being crawled.
Why not change the doctype tag to something like:
<!DOCTYPE PUBLIC html "-//W3C//DTD XHTML 1.0 Strict//EN"
"dtd://TR/xhtml1/DTD/xhtml1-strict.dtd">
Then it's known where the dtd is located in case it's actually needed, but would only be used by applications and libraries that would probably need it, since it would require special handling instead of blind handling. Slowing traffic for the http link wouldn't be great, because on a few occasions I've downloaded the DTD's for learning the document types(you do want valid html instead of what most tutorials provide, right?).
Still unique, still locatable, and hard to misinterpret.
There is some good smtp tarpit software in OpenBSD. It isnt hard to change the BSD specific c calls to linux based calls using the same functions from Rsync. I guess it would need to be modified for http as well however.
Serving requests very very slowly will unduly penalize the innocent bystanders. Many of us have no idea where to look when an application has slow response. Our systems are running software from dozens of vendors, and we will have no idea which of the vendors is running slowly. So we will just suffer. Or, we will decide our computer is old and needs to be replaced with a faster one: One which can hit your website even more frequently.
I've had similar problems with dumb crawlers that couldn't handle escaped '&' entities in URLs and would bombard the server with invalid requests for them. So I have sympathy.
However, I don't know if the tarpit solution is a good idea. What's John Q. Public who's running some misbehaving software going to think? "Oh, this must be what they mean when they say XML is slow. This problem never happened before I put the DOCTYPE on all my files. That guy who pushed us to adopt XHTML is a moron." Fix the problem going forward by changing the scheme for identifying DTDs, but think carefully before spreading the pain just to save the W3C some inconvenience.
I would like to point out that, contrary to what Ted says, the system identifier *is* a downloadable resource, not just an idenitification (unlike the namespace URI, which is mere identification). Specifically, XML 1.0, section 4.4.3, says that a parser MUST "include" references to parsed entities if they are validating (so it's no option not to read the DTD), and it MAY do so even if non-validating.
In particular, reading the DTD is necessary even in non-validating mode in case the document contains entity references.
Of course, to read the DTD, one might be able to use an alternate URL based on the public identifier. Unfortunately, catalogs are not in wide-spread use, and W3C does nothing to promote them.
Martin,
Where did I say System Ids are not downloadable resources? This post is about the frequency of the downloads, disregarding HTTP caching directives.
Ted,
I think Martin was referring to the following excerpt, which does sound like "these URIs are identifiers, not for download".
Note that these are not hyperlinks, these URIs are used for identification. This is a machine-readable way to say -this is HTML-. In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over
As many have pointed, the data downloaded is needed, so it'd be great if W3C could provide basic catalogs/suggestions to be used as the sane default.
You have to keep in mind that caching is a great solution for many given scenarios, but low level tools/libraries cannot be expected to assume caching/catalogs are THE right thing to do when the spec includes checking against what the URI point to.
You saying that developers were supposed to implement caching due to their own performance concerns makes me wonder: what if most already do that? What will change when they do?
What if most hits you get are from software that scraps "http://[...]" from data and follow that? What if library per-process/thread cache is already there but the system forks for each URL? How about distributing batches of URLs to visit?
So IMO the W3C has a chance of simply postponing the issue if no steps are taken towards providing local, reliable catalogs to the community and changing the recommended http:// URIs to something else (like the dtd:// above).
Daniel,
A misunderstanding then, my apologies for us being ambiguous. We went with that wording to avoid going into the differences between DTDs and namespaces which parsers have no need to dereference as there may not be anything of use to them as is the case with xmlns="http://www.w3.org/1999/xhtml". DTDs are meant to be downloaded for machine processing but reasonably not incessantly by an application running on a machine. We are seeing XML processors grab these even when they are not using them.
Making catalogs available has come up before and we certainly will consider it. Catalogs still would need to make their way into the various tools and libraries, many of which do not come with any. We are just one of the many organizations and individuals making these sorts of resources (namespaces, DTD) available so tool and library developers will still have to collect these.
It is difficult for tool and library developers to know what markup that will run through their utilities for validations or transformations not to mention new schemata are always being created. Because of this the best solution is for a caching XML Catalog resolver as I understand is part of Glassfish. The library will add DTDs to it's cache as it needs them, caching is part of the HTTP protocol.
Ted,
Thanks a lot for the discussion.
I like the caching idea, but believe distributing the load extends it. The general issue with caching is related to where to cache. Local (or in memory. per process, etc.) caches will be less efficient than shared ones. Shared (system, library) caches will have their own load of issues. So you might end with much nicer libraries and still being hammered by requests.
Notice that scaling up and mirroring amounts to an extremely-shared cache. I believe having the machinery for mirroring in place (checksums, compressed snapshots, change notifications) could lead to lower level mirroring (dtd-daemon, anyone?).
The benefits would be:
Network admins could save resources that lazy programmers forgot to (and legacy code would automagically stop being so nasty).
Users could get performance boosts by installing software that tricks dumb apps to fetch DTDs from a local cache, regardless of upstream actions.
Library developers (and even dumb programmers) would have a Darn Easy® recommended route to caching, as local-ish mirroring and checksums would be discussed all over the place (and faster, cheaper, tastier).
Also, maybe you should talk to Coral Content Distribution Network regarding forwarding traffic. It might be interesting for them to have such a huge source of input to their research.
On a meta note, I think it could be very useful to have a central location (wiki?) to gather resources and discussions on this issue.
One of the arguments against having caching resolvers in XML libraries has been this is attainable outside of the library, which it certainly is with a caching proxy server for instance. It is a very worthwhile solution and why we give the caching directives in the first place.
We have seen a number of corporate and large ISP HTTP proxies hammering us because of some XML application[s] running behind them. Sometimes the network admins would, if they were responsive at all to us, add caching to their proxy setup or less often track down the parties responsible for the software causing the traffic. More often they would refuse to add caching to their proxy or any other action citing cost or complexity. Bandwidth is cheaper than equipment and admin time I guess.
It is strange (and probably an indication of the lack of XML knowledge
of many posters here) that noone mentioned the best solution on the
application side: catalogs. They have been part of SGML and XML for
many years, so there is good support for them. Any XML parser should
support catalogs and, then, the DTD would be retrieved on the local
disk and not through the network.
http://www.oasis-open.org/committees/entity/spec-2001-08-06.html
http://www.sagehill.net/docbookxsl/Catalogs.html
(Of course, there are always broken programs and sites where catalogs
will not be installed, so there is a still a need for other measures.)
A few people mentioned testing that the file has been changed or
not. The HTTP protocol has a If-Modified-Since header for precisely
this purpose and W3C's server honors it. You can set it, for instance,
with curl:
% curl -v --time-cond 20080201 -o html.dtd http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
...
< HTTP/1.0 304 Not Modified
[No download]
% curl -v --time-cond 20000201 -o html.dtd http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
...
< HTTP/1.0 200 OK
[And the file is actually downloaded]
Of course, this requires a program that sends it and has a local
storage to keep the DTD, but recommending this technique may help
(among other techniques like HTTP caching, XML catalogs, terminating
the offenders, etc).
Ted,
Adding to what Daniel already said:
I just happen to be stacking together a new flavor of modular XHTML in the spirit of XHTML+RDFa for the backend of a new website I'm working on.
I'm using libxml on MacOS X via MacPorts. MacPorts has a package with HTML4 dtd's, but not XHTML and it does not supply a catalog with the DTD's. I'll accept that I'll have to add a new entry to the catalog, but I'll still have to get the DTD's in the first place.
I have three options at this point: download each DTD (module) manually through my webbrowser at /MarkUp/DTD, let wget crawl the DTD directory, or download Debian's w3c-dtd-xhtml package and rip the files from that package. Hardly convenient.
I assume that a lot of developers will even ignore the speed problems whilst getting their new apps to work.
I think it would really help if W3C would package its DTD's in a tar.gz, and perhaps even pro-actively work with package maintainers to distribute these files.
Obviously, this will not be a quick fix to your bandwidth problems, but I think it does address the core of the problem: Too many developers are not aware the inner workings of XML validation (or validating parsing) and assume 'it just works'.
my 2 cents
Well, this could be an example of rogue crawlers, bots and spiders causing an effect on the website. Large numbers of crawlers which take on pages and catches everything that seems like a link and cannot analyse the HTML and miss ignoring the links in the dtd and xmlns.
I agree we can't just depend on w3 to persistently follow up as everyone has the responsibility to help.
Can anyone comment on the combination of IE7, Docbook-generated XHTML, and the DTD http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd that's imbedded in such XHTML? We've generated a bunch of man pages built using Docbook 4.3 (see the link at top) which are now all failing in IE because (as best we can figure it out) w3.org is rejecting requests from the IE user agent.
While I sympathize with the bandwidth concerns discussed above, the question is what can we do about it? We're not the source of the offending app, Microsoft is, but we bear the consequences. Even if MS were to turn around a caching patch quickly, it probably wouldn't get widely deployed for years and I imagine the W3C admins will not lift the IE ban until then.
All I can think of to do on our end is to locally cache the DTD (and the entity files it references, IE also tries to fetch those) on our server and patch all of the documents to refer to those.
BTW, I don't see any reason this isn't affecting the combination of IE with every Docbook-generated XHTML doc in the world, if they're built using the standard stylesheet distribution.
Are there any other options within our control, that don't require cooperation from W3C or Microsoft?
Jon,
So I notice in these man pages, the main frame (eg http://www.opengl.org/sdk/docs/man/xhtml/glBindAttribLocation.xml ) is already XHTML markup but served with HTTP header Content-Type: application/xhtml+xml
If you serve this [X]HTML markup as HTML instead of XML, MSIE will not call it's XML processor which in turn tries to dereference DTD from us. Try serving it with .html extension and/or Content-Type: text/html and your problem should be resolved.
Hi Ted,
If we serve our content as HTML instead of XML, as you suggest, then IE will not invoke the MathPlayer plugin to render MathML content and the pages aren't rendered correctly. Getting MathML displayed properly in the man pages is pretty much the whole point of this exercise, so I don't think that will work for us. We've modified the man pages to refer to a local cache of the DTD, and that seems to work well enough.
Sure, they're ignoring the response status, but I'll betcha most of them are doing synchronous requests. If I were solving this problem for W3C, I'd be delaying the abusers by 5 or 6 *minutes*. Maybe respond to the first request from a given IP/user agent with no or little delay, but each subsequent request within a certain timeframe incurs triple the previous delay, or the throughput gets progressively throttled-down until you're drooling it out at 150bps. That would render the really abusive applications immediately unusable, and with any luck, the hordes of angry customers would get the vendors to fix their broken software.
Microsoft blog article on how to more efficiently invoke MSXML in your applications.
It's not a complete or ideal solution, but have you considered in-place editing of the relevant DTDs to make them smaller while maintaining their semantics? It's unpleasant, but Process allows for it.
Take for instance http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. It is 25kb. Instead, replace it with a DTD that only contains:
<!ENTITY % x SYSTEM "http://dtd.w3.org/xhtml1-strict">%x
That's 56 bytes, 455 smaller than the raw version you have to serve to those stupid libraries that often don't send Accept-Encoding: gzip (even if they support it), and still 120 times smaller than the gzip version.
Now this assumes that an important subset of the requests that are made don't actually do anything useful with the content and so don't make a second request to the actual content. I suspect it's worth a shot, or at least worth testing.
It has the additional advantage that using a different DNS means that you might be able to use load distribution tricks not available to you for the general website.
Anyway, just a thought!
Please tell people from Saxon not to reload the xhtml.dtd everytime you open an internet document with the xpath-document() function.
The source of the problem is that the URI actually exists. If the URI did not exist, then everything would be forced to implement local caches of the required files and there would be little sustained traffic to w3.org.
If the URI is really name and not a resource then it should not match to resource, e.g. xhtml1-strict could have been www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd instead of including the http:// and then this kind of issue would have been forced to resolve itself earlier in everyone's development and test cycles.
In case anyone experiences this issue when using Java facelets(!) look at this bug report:
https://facelets.dev.java.net/issues/show_bug.cgi?id=352
We are now seeing such extreme surges in traffic that our automatic and manual methods simply cannot keep up. Increases in serving capacity are readily consumed by this traffic and our site becomes overwhelmed. As such we are taking some more drastic temporary measures which we hope to be able to back down shortly. We are sorry for the impact this is causing the community. We continue experimenting with various methods including some of those suggested by posters here.
If you are impacted file a bug report with the developers of the library or utility you use asking them to implement a [caching] catalog solution. You may also put a caching proxy in front of your application for immediate remedy to your situation, populating the cache with a user agent we are not blocking DTD access to.
Java based applications and libraries are presently accounting for nearly 1/4th of our DTD traffic (in the hundred of millions a day). There is also another more substantial source of traffic which the vendor is working to correct in the hopefully near future.
To ensure that Saxon doesn't hit W3C Web site when transforming XHTML content, see:
http://saxon.wiki.sourceforge.net/XML+Catalogs
Trying to write a well-behaved system using Xerces DOM parser. Seem to be two things I need to do: set the UserAgent to indicated I'm not the raw Java libs, and manage a local cache.
I can set the System user.agent, which URLConnection then uses in the request. It appends the Java "Java/vers" string to the one I provide, giving e.g. "DSS/1.0 Java/1.6.0_13". I believe this is the correct format and intention for UserAgent, indicating the primary system and version followed by any subsystem.
You're still denying this request. Are you searching for the Java identifier *anywhere* in the string? That precludes any Java-based system (at least, ones not controlling the headers all the way down, i.e. using most any libraries) from behaving properly and working.
Caveat: My understanding is limited.
For those that (like me) run into this when using Ant (java build tool) xslt task - take a look at the xslt task manual and the section on xmlcatalog. That allowed me to keep the dtd files locally and use them from there. I was transforming XHTML, so that also required downloading several entities files as well.
Dan,
Changing the user-agent is commendable especially if you post it somewhere it can be indexed and people can contact you if there is an issue with it.
Instead of writing something to maintain your own cache look to Xerces XML Catalog capabilities which I wish were the default instead of an after thought.
For the time being I am also relaxing the filtering based on your suggestion. There is one particular Java UA that prepends a string that is causing 80 million or so hits/day at present. We contacted them after researching the user-agent used.
I use a variety of mostly Java libraries and tools. The authoring tools and IDEs are generally good about using local cached copies of the DTDs. The libraries and tools like Saxon do not. While there are many historical reasons for why we have the implementations and behavior we have today, the best fix is for the library maintainers to enable DTD caching by default. After all, it is the libraries that are fetching the documents in the first place. Of course, for that to work, the W3 will have to still serve the documents so that they can be cached, but that will likely continue to cause the current problem given the long delay likely to occur between the time the libraries are changed and the time when they largely replace the currently deployed libraries.In the meantime, I am trying to resolve my own problems by implementing the necessary XML catalogs. However, I am now struggling with the problem of assembling all of the DTDs and related documents I need to cache locally. (My difficulty is not an isolated case. See Validating XHTML Basic 1.1 (http://people.w3.org/~dom/archives/2009/06/validating-xhtml-basic-1-1/) for one other example.) My first attempt at a catalog simply included the XHTML DTDs, but then Saxon complained it could not find xhtml-lat1.ent. So then I needed to retrieve the referenced entities documents. Then I needed to do the same for each DTD I required. Facing the tedious prospect of pulling down each document individually, I went looking for an archive containing all current DTDs and related documents. After refining several searches and restricting them to the W3 site in the hope of finding an official—or, at least, semi-official—distribution, I finally located the DTD library (http://validator.w3.org/sgml-lib.tar.gz) made available as part of the Markup Validation Service (http://validator.w3.org/docs/install.html) distribution.That was too difficult and may not have been the best solution in any case. However, it does highlight the need to make available an official archive or library distribution and to make it clearly available from somewhere on the home page even if it is only listed on a page referenced from a “Downloads” link. If you want to encourage people to use local cached catalogs, help make it easier to assemble the necessary documents.
I arrived at the page "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" by following the link in the first sentence of section "A.1.2. XHTML-1.0-Transitional" in the W3C document "XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition)", where it states, "The file DTD/xhtml1-transitional.dtd is a normative part of this specification". The phrase "DTD/xhtml1-transitional.dtd" links through to the page in question.An annotated version of a DTD is available by following the link in following sentence in the same section, but, contrary to what is stated, this is clearly not an annotated version of the first file.The first file (the "normative" part of the specification) is either the phrase "see http://w3.org/brief/MTE2" or this page, neither of which, as displayed in my browser window, is well formed XML (if I understand the W3C XML specification correctly.)Perhaps the W3C would suffer fewer problems of the kind discussed above if it maintained accurate online documentation.
I am desperately sorry but I am to blame for some hundreds of those millions of requests. One of projects I did a while ago used PHP, called "the best tool for web" by some. And while I am aware of this problem I found no way to disable schema fetching in PHP without messing with PHP core itself. I had neither time to mess with it nor permissions to deploy those fixes. So, perhaps, you also need to contact authors of PHP XML parsers to persuade them to fix it (because I already gave up).
OK, so, how do I parse a XHTML file using only the JDK? This fails with 503:
DocumentBuilderFactory.newInstance().newDocumentBuilder().parse ("http://www.weststats.com/Items/right_arm/")
And how do I parse it within less than 10 lines of code?
As a developer, all I see is that code that used to work doesn't work anymore, because the naughty w3c decided to break it.
You're a bad w3c. Yes you are! :)
Now, I read the whole thread, I understood it, I understand w3c's position, but as an end user (or end programmer, whatever), I still have to wonder, why do hundreds of programmers have to implement complicated caching techniques because you didn't see this coming and didn't plan in advance?
Just complain to Xerces and Sun, and maybe Operating System manufacturers, and ask Sun and/or Xerces to cache such resources, at the JRE installation level. Or even better, at the operating system level.
As for me, I will try to stay as far as possible from w3c "standards" if at all possible.
OK. Problem fixed. How:
Installed Squid for Windows, from http://squid.acmeconsulting.it/download/squid-2.7.STABLE7-bin.zip to C:\squid.
Copied cachemgr.conf.default, mime.conf.default and squid.conf.default from C:\squid\etc to cachemgr.conf, mime.conf and squid.conf.
Modified this line in squid.conf:
http_access allow localnet
to:
http_access allow localhost
Run this commands at a command prompt:
c:\squid\sbin>.\squid.exe -i
c:\squid\sbin>.\squid.exe -z
c:\squid\sbin>net start squid
Modified my Java Application, adding this lines:
System.setProperty("http.proxyHost", "localhost");
System.setProperty("http.proxyPort", "3128");
System.setProperty("http.agent", "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.7.2) Gecko/20040803");
You can also set the properties from command line, with
java -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128 -Dhttp.agent="Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.7.2) Gecko/20040803" yourClass
After you successfully access the DTDs from your application and the cache is populated, you can remove the fake http.agent line, or replace it with something useful.
I am just trying to make a simple application do an XPath on a valid HTML document and it automatically pings you for the dtd. Don't block IPs. Get sun to fix their lousy library. There is no way that an XPath should create a connection to the internet to collect a w3c. And they don't give you any way to disable it. If they just had a nice IOC type framework it would be easy to fix, but they are just not that smart.
Go back and reread your specification. The DTD declaration consists of two parts, the public identifier and the system identifier. The system identifier is supposed to point to where the local system can find the DTD. It may be a bad idea using the w3.org address in all the examples and there may be misbehaving user agents out there, but it is correct behavior for user agents to look up the DTD.
The conformance section for XHTML actually uses the w3.org addresses without indicating those can technically change. Learn from that mistake for future specifications.
It looks like you're now blocking attempts by Microsoft's .Net XmlDocument object. To echo Jon Leech's comment back in February, we are at the mercy of whatever is going on at Microsoft.
Our use case is manipulating an html page using the .Net XML objects. Since the html page contains HTML entities, we need the DOCTYPE reference, but recently that has been generating the 503 error.
Does anyone know of a work around?
XmlReader and XmlDocument classes in Microsoft .Net Framework try to find the specified url, not only Java.
Anyway, I don't like the DTD identification takes 'http://...' form.
It's quite mistakable, isn't it?
Microsoft releases fix for Microsoft XML Core Services (MSXML).
Full release information
If you have a Windows platform being blocked access to W3C, ensure you have this upgrade installed.
I was surprised recently that a valid XML DOCTYPE declaration required a URI in addition to an FPI in a public identifier (unlike SGML which I understand just needs an FPI). I too have been bitten by the fervent attempts of Web software to dereference DTDs (for instance WebKit ostensibly does not have the XHTML+RDFa DTD in its catalogue).
I think, however, that it is entirely reasonable to expect to be able to dereference a URL (specifically), especially within a framework that affords the dereferencing of URIs. By my reading, §4.2.2 of the XML spec indeed discusses the dereferencing of URIs in system identifiers:
I cannot speak for the HTML equivalent, except to assume it would ideally follow the same rules as SGML in which a URI may be omitted from the public identifier.
XML Namespaces using HTTP URIs should ideally have something present on the other end at the very least as a courtesy, but a sane XML processor should not attempt to dereference them on sight.
This problem was also endemic to RDF before Linked Data gathered steam—there would be HTTP URIs used as identifiers everywhere but relatively few would correspond to live HTTP resources.
Do I think vendors of Web software would serve their customers better if they kept DTD catalogues up to date? Indubitably. Do I think they are doing a disservice to high-traffic targets like the W3C by not including their DTDs? Absolutely. Do I think that complying in this manner is the past of least resistance? Unfortunately not.
I think if you mint an ostensibly dereferenceable URI, you should expect attempts to dereference it. If you are in the business of dereferencing HTTP URIs, however, you should make an attempt to comply with their cache directives.
When using Internet-Explorer 7 and browse to a HTML document on a misbehaving server the following may happen...
The server responds with:
HTTP/1.0 200 OK
... http headers ....
Content-Type: text/xml;charset=utf-8
... http headers ....
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="nl" xmlns="http://www.w3.org/1999/xhtml">
.... the actual html content .....
</html>
This is a HTML document encoded as XML and sent with a content-type of text/xml.
Internet-Explorer 7 interpretes this as XML and tries to resolve the URI "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd".
Now see what happens when IE7 tries to resolve this:
GET http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd HTTP/1.1
Accept: */*
Referer: http://some-address/some-document.html
UA-CPU: x86
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
... other http headers ...
And the answer from W3C:
HTTP/1.0 503 Service Unavailable
Date: Thu, 07 Jan 2010 13:41:41 GMT
Server: Apache/2
Content-Location: msie7.asis
Vary: negotiate,User-Agent
TCN: choice
Retry-After: 864000
Cache-Control: max-age=21600
Expires: Thu, 07 Jan 2010 19:41:41 GMT
P3P: policyref="http://www.w3.org/2001/05/P3P/p3p.xml"
Content-Length: 28
Content-Type: text/plain
... other Http headers ...
see http://w3.org/brief/MTE2
One solution:
Tell Microsoft that this URI should be cached or tell Microsoft to create a catalog of DTDs inside its IE7 and IE8 browser.
I think it should be possible using the standard windows updates to distribute a patch that solves a large part of this problem.
Please tell Microsoft that they are part of this problem.
I am afraid I am guilty of a few unnecessary requests to W3C DTDs. I just realized that a SaxParser in Java downloads the DTD even if the parser is non-validating, which is the default. I don't think this is very well documented, so I post it here to warn others.
This code will create a parser that doesn't download the DTDs.
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
SAXParser parser = factory.newSAXParser();
I downloaded the Java JDK from Sun/Oracle website a couple days ago. Included in the JDK is a class called DocumentBuilder which provides access to the SAX xml parser (you see where this is going?). I compiled and ran a Java program which builds an object tree from the xml. The run failed in the SAX parser when processing the following xml: . That caused the parser to try to access this website, getting the 503 error.
Can somebody please contact Sun about replacing the SAX parser with one which does not cause the website access? I don't have the slightest idea who to talk to.
And can I get a copy of the SAX package which doesn't exhibit the problem, as discussed in the previous post by Henrik Solgaard?
Thanks.
I filed a bug report with Oracle (sun) on this. I included the URL for this blog. Hopefully they will react.
Workaround:
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
IE8 transforms the code of a page if a X-UA-Compatible HTTP header is served together with the page. The transformation have several effects: Uppercasing tag names (always), inserting a the <META content="IE=8.0000" http-equiv="X-UA-Compatible"> element (always). And, if the page contains the HTML5 doctype -
<!DOCTYPE html>
, then IE8 even replaces it with a legacy, non-official HTML4 doctype:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd">
he URL of that doctype, only to reach a page which said , with a message to contact staff about how I reached that page.Some hunches, based on the fact that Ted said that the misuse of these URIs had only increaseed, since he first posted this article:X-UA-Compatible has increased in popularityThe HTML5 doctype has also become more popular
IE8 has become more popular.
Chrome Frame has become more popular.
HTML5 forbids non-stnadard META elements, thus one must use the HTTP header to be valid as HTML5.
All the above points to factors that could explain why the abuse has increased, provided that IE8 is involved in this. (Note that even when a X-UA-Compatible header is used to request Chrome Frame, any IE8 without Chrome Frame installed, will be affected, regardless.)
This post seems to be giving some folks the wrong impression. It is not possible to correctly parse most XHTML documents using an XML parser without reading the DTD. You can cache the DTD so you don't read it more than once. You can read a local copy instead. You can point the DOCTYPE to a different copy of the DTD. But if you leave out the DTD completely and use a generic XML parser, you will lose information such as default attribute values (including namespace declarations) and entity definitions. This is arguably a design flaw in XML, but it is one we have to deal with.
I can't believe it, but Dreamweaver CS5 is causing this error. Here's the error text:
An exception occurred! Type:NetAccessorException, Message:Could not open file: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
So DW CS5 attempted to open this "URL". Amazing that some developer at Adobe was so clueless (and I have a lot of respect for the quality of development that goes on at Adobe).
I just found this thread. I read many of the comments but not all. Apologies if this duplicates what someone else has said.
As some commenters have said the XML spec encourages validating parsers to download the DTD from the location given, and a document might not be identified correctly if the original DTD is not given with the original URI.
Of course software can and should cache regularly accessed documents that are unlikely to change. But the other major error most software designers are making is not properly understanding or using the standalone declaration. If standalone="yes" the DTD probably doesn't need to be read in most cases. But the standalone declaration has to be set correctly in the document too.
I only found this issue after several months of my XML schema validation working fine. Basically I was validating the schema in .NET and assumed the default settings would be set in an appropriate way. Its unfortunate as I think a lot of developers will get caught out by this the first time they use the validator.
If the stick is to slow the servers, how about a carrot to go with it?
DTDs easier to find and download in zip files, along with a catalog.xml file.
OK, only half the issue, but it'd help. The other half is setting up a catalog resolver to read the catalog file.
Ditto David's comment. Is there a zip file I can get to 'seed' an off-line EntityResolver I am building? Something in the form of an XML Catalog would be great, but just a .zip of the actual dtd files from w3.org would be useful too.
Various commenters ask for a ZIP of DTDs. It took me a few minutes to figure out where to download them from so I thought it worth posting links:
HTML 4:
http://www.w3.org/TR/html401/html40.zip
(note no top-level directory in ZIP)
XHTML 1.0:
http://www.w3.org/TR/xhtml1/xhtml1.zip
We implemented an application to analyse a fairly large website's XHTML pages for Accessibility. Part of this process involved reading the XHTML into an XmlDocument using standard .Net methods. Because the DTD is stated in the way that it is, these methods automatically looked up the web address. This may be a noob mistake but there will be lots and lots of noobs who won't realise that by doing the above, they are going to be calling the web address.
Our application stopped working and it was only after trawling through Wireshark analysis that we worked out what was going wrong. We had to report the reason for the failure to the customer (government). I'm sure you can imagine the surprise and subsequent guffaws that went around the office when a high level manager of the customer insisted that we get in touch with the head of the W3C to resolve this issue. Needless to say, he was ignored and we developed a workaround.
Anyway, in my humble opinion, it is not always the fault of the people who make software that uses the specified location of the DTD to read the XHTML. Perhaps a more foolproof method is required. "dtd://TR/xhtml1/DTD/xhtml1-strict.dtd" to be used perhaps?
Or, maybe getting Oracle and Microsoft to make the standard .Net and Java XML parsers create the cache for the DTD out of the box would be most efficient.
Another idea, although maybe not workable, would be to return a page with a set of links to standard code that will cache the DTDs instead of returning 503 responses.
I apologize in advance, but I have a somewhat dissenting opinion.
If a schema on w3.org's site references other schemas, which in turn reference other schemas that redefine still other schemas, it's a little much to expect the coder or application to know which ones to download. I've tried downloading every reference I can find (in the VoiceXML 2.1 schema) and loading them for unit testing is still taking several minutes per file, which probably means that despite my best efforts, your site is still being hit.
If you put your URL at the top, that URL is gonna get hit. Come up with another address, or an alternative attribute at the top of the schema that tells the processing file not to hit you.
This is all very well and I can see why you would block IP addresses for badly written code. Howver the authors of the article are making huge assumptions about the people who might be trying to download your resources. I am suffering the 503 response at present. I'm not a web site author or programmer, I'm a Unix systems administrator trying to implement a local copy of your validator with HTML5 support. To do this, I have to install the validator.nu application. This won't build as it is attemting to download one of your DTD's. I work for a reasonably large University with several thousand users who mostly get onto the Web via a (very large) caching proxy. This effectively means that we will all appear to have the same IP address.
I don't have the skills to change the validator.nu application builder and I don't have time to learn Python only to forget it when I never use it again. So what do you suggest? I can't change my external IP address and can't change the code of (probably several hundred) students who may very well be writing poor code/html.
If I can successfully install the validator.nu engine then I may take away a considerable amount of traffic to both your and their website as all conformance checking will be done locally.
@MATT: thank you for the links. Yes I searched ZIP of DTDs to download them from dedicated servers with windows.
For those that don't have the links - there is a patch to the validator.nu build script that downloads the DTD's as zipfiles - the patch is available at https://bitbucket.org/validator/build/issue/2/received-error-retrying
Thanks to AlanJ for this information
Could someone post a full C# exemple on how to validate a xml against a xsd (and/or DTD ?) with local copies of all/some xsd (including recursive considerations and possible security issues)
msdn is a mess (as always...)
There is a C# library but it doesn't work too.
http://sourceforge.net/projects/w3cmarkupvalida/?source=dlp
Hi all.
This particular issue continues to perplex me. There are a number of XML Schemas and DTD documents that are required for general purpose XML processing (not just HTML, xhtml, etc), and the 'throttling' has impacted me in a number of ways. Despite the availability of web catalog systems, and other resources, it is 'painful' and inconvenient to manage.
I have decided to start up a new open-source project (in Java) specifically targeting this type of issue. It is intended to be a 'simple' EntityResolver. It respects the Cache-Control HTTP headers in the response headers in order to maintain an on-disk cache of web resources. Only if the cache-entry goes out of date will the source server be re-queried. The initial version is a functional resolver that is suitable for running in a multi-threaded environment, and in a way where multiple JVM's can share the same cache location.
In other words, given that the w3c.org standard cache-control timeout is 90 days, it should be possible to set up a single cache folder for all your Java programs and have each document only pulled once every 90 days from w3c.org.
Further, I anticipate that future versions will be expanded to allow a 'chain' of resolvers where 'catalog' type resolvers can be queried, and only if those fail, will the caching resolver be used.
This should allow for a well-behaved, easy to use, efficient EntityResolver system.
If anyone is interested in playing with the code you can grab it from github at (Apache 2.0 license) https://github.com/rolfl/Resolver , you can contact me through github, and feel free to offer suggestions, criticisms, etc.
Rolf,
A caching catalog is the ideal solution and you might want to look at this earlier start on the subject.
http://norman.walsh.name/2007/09/07/treadLightly
Ideally such a resolver would make its way upstream into JDK. Past lobbying efforts for that to happen have not been successful. I will gladly try to raise attention to your effort.
Hi Ted.
Thanks for the feedback, and encouragement. I don't think this is the right place for a full discussion on the requirements, etc. I have set up an 'issue' and a wiki page for the proposal, and I am hoping to maybe have people comment on it there.
It would be great to make it 'official' that way (part of the JRE). I guess I set my sights low by 'hoping' this could be an future apache-commons type tool.... ;-), and if not that, then I could possibly incorporate it in to the JDOM project. Right now it is too early though.
So, please head over to: https://github.com/rolfl/Resolver/issues/1
It would be great if you could contact me directly on this too. I would love to pick your brains on some ideas.
Thanks
Rolf
xhtml1-strict.dtd states "Copyright (c) 1998-2002 W3C (MIT, INRIA, Keio), All Rights Reserved." Therefore, a local copy of the DTD can't be obtained from another source than the W3C with legal permission. If the DTD URI would only be used for identification, there would still be a download attempt to just another W3C URL. For non-XHTML-specific, general XML tools, it is quite unlikely that they have local copies or know about URLs of all kinds of (even custom) DTDs, which might additionally be not that static as the XHTML DTDs, so they just attempt to download them from somewhere if the URI is an HTTP URL.
I've written a Java wrapper in order to do XSLT on the command line. Since the tool is intended to be usable for all kinds of XML input, it doesn't have special handling for XHTML. However, to avoid automatic download attempts, I've disabled URL resolving completely, because - as said - those URLs are intended as identifiers. As Elliotte Rusty Harold said here on 2010-05-08, it is impossible to successfully parse XHTML without the corresponding DTD and entity definitions. So I added a mechanism to resolve references locally based on a configuration file, where a user has to obtain the DTD and entity definition files from you manually (due to legal restrictions, they're not redistributable, so I can't pre-package them with my software). Therefore, I don't do any caching of automatic download attempts, because other DTDs than the XHTML DTD may change more frequently or may become inaccessible, and, as said, the URL is for identification, not for download.
This is how I've implemented it: xsltransformator1 (released under GNU AGPL 3 or any later version).
If you would change your licensing of the DTD and entity files (at least permit redistribution, you may still prohibit modifications to the files), you would do both, increase user experience and lower the traffic to your site.
Stephan,
We would very much like to see our schemata included in library and tool catalogs and you certainly can include them.
W3C document license
[[Permission to copy, and distribute the contents of this document, or the W3C document from which this statement is linked, in any medium for any purpose and without fee or royalty is hereby granted, provided that you include the following on ALL copies of the document, or portions thereof, that you use:]]
Thank you very much for your reply! I wasn't aware of the W3C Document License and I really appreciate the open approach of the W3C regarding licensing. Would then this modified header comment make the XHTML 1.0 Strict DTD freely redistributable? If so, I would gladly pre-package it with my XHTML-processing tools, so that no user or setup would need to download the DTD from W3C servers.
That is certainly fine and I'm trying to get clarification, hence the delay, whether a file in same directory instead of comment is sufficient in case you find that preferable.
That DTD hasn't changed since 2002 and is unlikely to change but you may want to check in the future.
Ideal would be XML processors that require these to be stored in a caching catalog and go off caching directives given in HTTP. Packaging manually is certainly better than going over the net thousands of times a day/hour whatever for the same resource so thank you for doing so.
Ted, thank you very much for your help! I won't put a file with only the notification in the pre-packaged DTD directory, but indeed place a corresponding notice as header comment into every single W3C document. The linked one was just for demonstration purposes how it would look in the actual DTD.
I got clarification and indeed as I suspected the license can be in a separate file instead of embedded as comment. Do whichever you prefer.
I've just applied the license to the W3C documents I'm using with my tools, so hopefully this commit isn't a license violation already, and if it is, please let me know how to comply. Maybe other people might use those license headers as well. In any case, thank you very much for your efforts, I really appreciate!
Reading even the most primitive XHTML 1.1 file with a XML processor requires no less than 38 files to be obtained either from W3C or from a local catalog due to modularization. It seems that none of the required W3C files complies with the W3C Document License in the header comment per default, so they're not easy to redistribute and downloading them for each software installation might seem to be an option - if downloading is even possible, since some system IDs don't provide a URL (and therefore won't be processable at all without a redistributed local catalog).
I've just committed the XHTML 1.1 files, all extended by the W3C Document License in their header comments. The way the W3C Document License works as a redistributable license for free software packages is that I've obtained those documents from the copyright holder (W3C), which granted me the right to distribute the documents for any purpose (including sublicensing) as long as I comply with the license, so I distribute them as part of a free software package while sublicensing the W3C documents to every recepient, who gets the right to distribute the package and the W3C documents from my sublicensing without the need to become a licensee from W3C directly. As the W3C Document License isn't particularly freedom protective in itself (allows restrictive sublicensing), it's on the other hand free enough (respecting the four essential freedoms for software), so I didn't sublicense the W3C documents to my users under the GNU AGPL 3 or later (yet).
I would really like to see the w3c or oasis publish sets of catalogs for download. This would save hours of time trying to configure each tool. Netbeans allows developers to import an existing catalog.
Sorry guys but I am a bit offended by ending up writing this. I have spent several days now trying to figure out how to write an XML Schema as it ought to be written. I have been quite dilligent, and have no idea what I have done wrong. It might be, that the java package I am using is broken (I doubt it) It might be that the schema I am trying to import (http://www.w3.org/TR/speech-grammar/grammar.xsd) does something wrong (but I doubt it). In particular I have been working to the w3c document: http://www.w3.org/TR/xmlschema-0/#SchemaInMultDocs
Why is it so complex and why is it such that silly mistakes (obviously) can have catestrophic consequences. I can't help feeling it is your own fault in some way. I am not doing anything werid;there are thousands trying to do the same and thus the problem I suspect. Please make it easier for us to use schemas the right way.
I would talk to the people behind the Java package or the library it is based on. Quite a few development platforms understand the issue, how it is inefficient and poor design not to mention potentially abusive to incessantly request a remote resource ignoring caching directives.
We see a high percent of Java user-agent strings. I have filled bug reports with some prominent libraries that have gone unanswered. I suspect some of these are not actively maintained.
Hello,
I am totally confused! I have not visited the W3C site for more than 2 weeks,yet i receive an abuse message telling me that I have attempted to use the site more than 500 times in 10 minutes! I am trying to set up a wordpress site for a local voluntary group. I selected this theme fmedicine because it is the only one that states it is w3c valadated! I believe in W3C goals and want to adhere to its high quality levels. COuld it be possible that something in the theme is accessing your site? I believe in w3c!
A heads up:
The internal XML validator of Java for Linux to validate XML against and XSD using this version of Java:
$ java -version
java version "1.7.0_121"
OpenJDK Runtime Environment (IcedTea 2.6.8) (7u121-2.6.8-1ubuntu0.14.04.3)
OpenJDK 64-Bit Server VM (build 24.121-b00, mixed mode)
Using this script, (from this location: https://github.com/amouat/xsd-validator/blob/master/xsdv.sh)
#!/bin/bash
# call xsdv
#First find out where we are relative to the user dir
callPath=${0%/*}
if [[ -n "${callPath}" ]]; then
callPath=${callPath}/
fi
echo java -cp ${callPath}build:${callPath}lib/xsdv.jar xsdvalidator.validate "$@"
java -cp ${callPath}build:${callPath}lib/xsdv.jar xsdvalidator.validate "$@"
Always makes a request to: "http://www.w3.org/2001/XMLSchema.xsd"
With the XSD I'm using (15118-2, I'm not at liberty to share this XSD)
This script is very common, and although I'm not that familiar with XML, I wouldn't be surprised if a lot of XSD's contain a reference to the XMLSchema.xsd. I have a work around for this, I do NOT need (or want) to be unbanned either since being banned allows me to identify when I'm accessing an external website easily.
This seems fundamental to the Java implementation of Linux - so if you can get that fixed, you might get a lot of requests eliminated.
I have to generate millions of XML messages for the particular project I'm working on. I had no idea that Java was stupid enough to be accessing your site every time I validated a single XML message I generate.
-Rich
This article has some good info on how to disable external entities in various XML software:
https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Prevention_Cheat_Sheet