Caching XML data at install time

Author(s) and publish date

By:
Published:
Skip to 4 comments

The W3C web server is spending most of its time serving DTDs to various bits of XML processing software. In a follow-up comment on an item on DTD traffic, Gerald says:

To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us/popular , etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.

Evidently there's software out there that makes a lot of use of the DTDs at W3C and they fetch a new copy over the Web for each use. As far as this software is concerned, these DTDs are just data files, much like the timezone database your operating system uses to convert between UTC and local times. The tz database is updated with respect to changes by various jurisdictions from time to time and the latest version is published on the Web, but your operating system doesn't go fetch it over the Web for each use. It uses a cached copy. A copy was included when your operating system was installed and your machine checks for updates once a week or so when it contacts the operating system vendor for security updates and such. So why doesn't XML software do likewise?

It's pretty easy to put together an application out of components in such a way that you don't even realize that it's fetching DTDs all the time. For example, if you use xsltproc like this...

$ xsltproc agendaData.xsl weekly-agenda.html >,out.xml

... you might not even notice that it's fetching the DTD and several related files. But with a tiny HTTP proxy, we can see the traffic. In one window, start the proxy:

$ python TinyHTTPProxy.py 
Any clients will be served...
Serving HTTP on 0.0.0.0 port 8000 ...

And in another, run the same XSLT transformation with a proxy:

$ http_proxy=http://127.0.0.1:8000 xsltproc agendaData.xsl weekly-agenda.html

Now we can see what's going on:

	connect to www.w3.org:80
localhost - - [05/Sep/2008 15:35:00] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd HTTP/1.0" - -
	connect to www.w3.org:80
localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent HTTP/1.0" - -
	bye
	bye
	connect to www.w3.org:80
localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent HTTP/1.0" - -
	bye
	connect to www.w3.org:80
localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent HTTP/1.0" - -
	bye

This is the default behaviour of xsltproc, but it's not the only choice:

 

  • You can use xsltproc --novalid tells it to skip DTDs altogether.
  • You can set up an XML catalog as a form of local cache.

 

To set up this sort of cache, first grab copies of what you need:

$ mkdir xhtml1
$ cd xhtml1/
$ wget http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
--15:29:04--  http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
           => `xhtml1-transitional.dtd'
Resolving www.w3.org... 128.30.52.53, 128.30.52.52, 128.30.52.51, ...
Connecting to www.w3.org|128.30.52.53|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32,111 (31K) [application/xml-dtd]

100%[====================================>] 32,111       170.79K/s             

15:29:04 (170.65 KB/s) - `xhtml1-transitional.dtd' saved [32111/32111]
$ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
...
$ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
...
$ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
...
$ ls
xhtml1-transitional.dtd  xhtml-lat1.ent  xhtml-special.ent  xhtml-symbol.ent

And then in a file such as xhtml-cache.xml, put a little catalog:

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <rewriteURI
      uriStartString="http://www.w3.org/TR/xhtml1/DTD/"
      rewritePrefix="./" />
</catalog>

Then point xsltproc to the catalog file and try it again:

$ export XML_CATALOG_FILES=~/xhtml1/xhtml-cache.xml
$ http_proxy=http://127.0.0.1:8000 xsltproc agendaData.xsl weekly-agenda.html

This time, the proxy won't show any traffic. The data was all accessed from local copies.

While XSLT processors such as xsltproc and Xalan have no technical dependency on the XHTML DTDs, I suspect they're used with XHTML enough that shipping copies of the DTDs along with the XSLT processing software is a win all around. Or perhaps the traffic comes from the use of XSLT processors embedded in applications, and the DTDs should be shipped with those applications. Or perhaps shipping the DTDs with underlying operating systems makes more sense. I'd have to study the traffic patterns more to be sure.

p.s. I'd rather not deal with DTDs at all; newer schema technologies make them obsolete as far as I'm concerned. But

 

  • some systems were designed before better schema technology came along, and W3C's commitment to persistence applies to those systems as well, and
  • the point I'm making here isn't specific to DTDs; catalogs work for all sorts of XML data, and the general principle of caching at install time goes beyond XML altogether.

 

Related RSS feed

Comments (4)

Comments for this post are closed.