Caching XML data at install time
The W3C web server is spending most of its time serving DTDs to various bits of XML processing software. In a follow-up comment on an item on DTD traffic, Gerald says:
To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us/popular , etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
Evidently there's software out there that makes a lot of use of the DTDs at W3C and they fetch a new copy over the Web for each use. As far as this software is concerned, these DTDs are just data files, much like the timezone database your operating system uses to convert between UTC and local times. The tz database is updated with respect to changes by various jurisdictions from time to time and the latest version is published on the Web, but your operating system doesn't go fetch it over the Web for each use. It uses a cached copy. A copy was included when your operating system was installed and your machine checks for updates once a week or so when it contacts the operating system vendor for security updates and such. So why doesn't XML software do likewise?
It's pretty easy to put together an application out of components in such a way that you don't even realize that it's fetching DTDs all the time. For example, if you use xsltproc like this...
$ xsltproc agendaData.xsl weekly-agenda.html >,out.xml
... you might not even notice that it's fetching the DTD and several related files. But with a tiny HTTP proxy, we can see the traffic. In one window, start the proxy:
$ python TinyHTTPProxy.py Any clients will be served... Serving HTTP on 0.0.0.0 port 8000 ...
And in another, run the same XSLT transformation with a proxy:
$ http_proxy=http://127.0.0.1:8000 xsltproc agendaData.xsl weekly-agenda.html
Now we can see what's going on:
connect to www.w3.org:80 localhost - - [05/Sep/2008 15:35:00] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd HTTP/1.0" - - connect to www.w3.org:80 localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent HTTP/1.0" - - bye bye connect to www.w3.org:80 localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent HTTP/1.0" - - bye connect to www.w3.org:80 localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent HTTP/1.0" - - bye
This is the default behaviour of xsltproc, but it's not the only choice:
- You can use xsltproc --novalid tells it to skip DTDs altogether.
- You can set up an XML catalog as a form of local cache.
To set up this sort of cache, first grab copies of what you need:
$ mkdir xhtml1 $ cd xhtml1/ $ wget http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd --15:29:04-- http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd => `xhtml1-transitional.dtd' Resolving www.w3.org... 128.30.52.53, 128.30.52.52, 128.30.52.51, ... Connecting to www.w3.org|128.30.52.53|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 32,111 (31K) [application/xml-dtd] 100%[====================================>] 32,111 170.79K/s 15:29:04 (170.65 KB/s) - `xhtml1-transitional.dtd' saved [32111/32111] $ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent ... $ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent ... $ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent ... $ ls xhtml1-transitional.dtd xhtml-lat1.ent xhtml-special.ent xhtml-symbol.ent
And then in a file such as xhtml-cache.xml, put a little catalog:
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <rewriteURI uriStartString="http://www.w3.org/TR/xhtml1/DTD/" rewritePrefix="./" /> </catalog>
Then point xsltproc to the catalog file and try it again:
$ export XML_CATALOG_FILES=~/xhtml1/xhtml-cache.xml $ http_proxy=http://127.0.0.1:8000 xsltproc agendaData.xsl weekly-agenda.html
This time, the proxy won't show any traffic. The data was all accessed from local copies.
While XSLT processors such as xsltproc and Xalan have no technical dependency on the XHTML DTDs, I suspect they're used with XHTML enough that shipping copies of the DTDs along with the XSLT processing software is a win all around. Or perhaps the traffic comes from the use of XSLT processors embedded in applications, and the DTDs should be shipped with those applications. Or perhaps shipping the DTDs with underlying operating systems makes more sense. I'd have to study the traffic patterns more to be sure.
p.s. I'd rather not deal with DTDs at all; newer schema technologies make them obsolete as far as I'm concerned. But
- some systems were designed before better schema technology came along, and W3C's commitment to persistence applies to those systems as well, and
- the point I'm making here isn't specific to DTDs; catalogs work for all sorts of XML data, and the general principle of caching at install time goes beyond XML altogether.
Isn't it possible to back-trace the IP addresses that fetches the DTD's and find the sources? If the top 90% of the hits come from just a few sources, it's possible to contact these sources directly and ask them to do non-validating parsing or otherwise offload W3C's servers from the constant DTD-downloading.
I know most XML libraries I've used (including System.Xml in Microsoft.NET) validates by default and thus downloads all DTD's referenced in XML documents it is told to parse. It also does this invisibly and unless you're behind a strict firewall, you won't notice anything (besides a bit latency during parsing). Even behind a strict firewall, Microsoft.NET won't tell you exactly what's causing the error; you only get a mystical TimeoutException.
I think the right place to bundle the DTD's is within the XML parsers themselves. They can then be defaulted to using local cached versions of the DTD's instead of fetching them online, and the cache can also be extended for all new DTD's it finds and kept locally at least until the online version has a new ETag or Last-Modified date.
Perhaps there would even be enough bytes to save if the XML parsers did conditional GETs (If-Modified-Since / If-None-Match) instead of ordinary GETs when fetching the DTD's? That still requires a local cache, but no pre-bundling is required and the cache logic is kept at a bare minimum complexity-wise.
I think it's worth mentioning that another solution would be to use a caching HTTP proxy in front of the client application. Thus, instead of accessing the w3.org site directly, the client application would access the proxy, which would use the cache control features of HTTP to decide whether to return a locally cached copy or fetch a new copy from w3.org. This has two advantages:
We do back-trace the IP addresses and contact the worst abusers, but (a) that takes a lot of time and we're not getting ahead of the curve and (b) much of the load is caused by software that is widely distributed so that the party we really should contact is upstream of the parties with the IP addresses.
Yes, distributing DTDs with XML parsers makes sense.
Using a caching HTTP proxy in front of the client application is likely to be effective in the case of one high-traffic client, but less so in these widely distributed cases, since it would require effort by lots of end users.
Outboard caching proxies make sense when the fetching software comes from one party and the proxy comes from another, but the point of this article is to to get developers of fetching software to bundle some caching mechanism and pre-seed the cache at install time.
Build in a one-second delay when serving the DTDs; any event-based server (including Jigsaw) should be able to take it...