The W3C web server is spending most of its time serving DTDs to
various bits of XML processing software. In a follow-up comment on an item on DTD traffic, Gerald says:
To try to help put these numbers into perspective, this blog post
is currently #1 on slashdot, #7 on reddit, the top page of
del.icio.us/popular , etc; yet www.w3.org is still serving more than
650 times as many DTDs as this blog post, according to a 10-min
sample of the logs I just checked.
Evidently there’s software out there that makes a lot of use of the
DTDs at W3C and they fetch a new copy over the Web for each use. As
far as this software is concerned, these DTDs are just data files,
much like the timezone database your operating system uses to convert
between UTC and local times. The tz database
is updated with respect to changes by various jurisdictions from time
to time and the latest version is published on the Web, but your
operating system doesn’t go fetch it over the Web for each use. It
uses a cached copy. A copy was included when your operating system
was installed and your machine checks for updates once a week or so
when it contacts the operating system vendor for security updates and
such. So why doesn’t XML software do likewise?
It’s pretty easy to put together an application out of components
in such a way that you don’t even realize that it’s fetching DTDs
all the time. For example, if you use xsltproc like this…
$ xsltproc agendaData.xsl weekly-agenda.html >,out.xml
… you might not even notice that it’s fetching the DTD and several
related files. But with a tiny HTTP
proxy, we can see the traffic. In one window, start the proxy:
$ python TinyHTTPProxy.py Any clients will be served... Serving HTTP on 0.0.0.0 port 8000 ...
And in another, run the same XSLT transformation with a proxy:
$ http_proxy=http://127.0.0.1:8000 xsltproc agendaData.xsl weekly-agenda.html
Now we can see what’s going on:
connect to www.w3.org:80 localhost - - [05/Sep/2008 15:35:00] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd HTTP/1.0" - - connect to www.w3.org:80 localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent HTTP/1.0" - - bye bye connect to www.w3.org:80 localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent HTTP/1.0" - - bye connect to www.w3.org:80 localhost - - [05/Sep/2008 15:35:01] "GET http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent HTTP/1.0" - - bye
This is the default behaviour of xsltproc, but
it’s not the only choice:
- You can use xsltproc --novalid tells it to skip DTDs altogether.
- You can set up an
as a form of local cache.
To set up this sort of cache, first grab copies of
what you need:
$ mkdir xhtml1 $ cd xhtml1/ $ wget http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd --15:29:04-- http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd => `xhtml1-transitional.dtd' Resolving www.w3.org... 184.108.40.206, 220.127.116.11, 18.104.22.168, ... Connecting to www.w3.org|22.214.171.124|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 32,111 (31K) [application/xml-dtd] 100%[====================================>] 32,111 170.79K/s 15:29:04 (170.65 KB/s) - `xhtml1-transitional.dtd' saved [32111/32111] $ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent ... $ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent ... $ wget http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent ... $ ls xhtml1-transitional.dtd xhtml-lat1.ent xhtml-special.ent xhtml-symbol.ent
And then in a file such as
xhtml-cache.xml, put a little catalog:
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <rewriteURI uriStartString="http://www.w3.org/TR/xhtml1/DTD/" rewritePrefix="./" /> </catalog>
Then point xsltproc to the catalog file and try it again:
$ export XML_CATALOG_FILES=~/xhtml1/xhtml-cache.xml $ http_proxy=http://127.0.0.1:8000 xsltproc agendaData.xsl weekly-agenda.html
This time, the proxy won’t show any traffic. The data was all
accessed from local copies.
While XSLT processors such as xsltproc and Xalan have no
technical dependency on the XHTML DTDs, I suspect they’re used with
XHTML enough that shipping copies of the DTDs along with the XSLT
processing software is a win all around. Or perhaps the traffic comes
from the use of XSLT processors embedded in applications, and the DTDs
should be shipped with those applications. Or perhaps shipping the
DTDs with underlying operating systems makes more sense. I’d have to
study the traffic patterns more to be sure.
p.s. I’d rather not deal with DTDs at all; newer schema technologies make them obsolete as far as I’m concerned. But
- some systems were designed before better schema technology came along, and W3C’s commitment to persistence applies to those systems as well, and
- the point I’m making here isn’t specific to DTDs; catalogs work for all sorts of XML data, and the general principle of caching at install time goes beyond XML altogether.