Writing a link checker looks simple enough. There are quite a few of them, including of course the now venerable W3C Link Checker. Ditto for web spiders, indexers. There are so many of them it has to be a sign that there is nothing very complex behind them. Basically, each of those programs just parses HTML documents, finds links, follows them. Lather, rinse, repeat. No?
Unfortunately, trouble and the devil are always in the details. Notwithstanding the fact that parsing HTML in the wild is quite a challenge in itself, there are tons of ways for these software to fail at their seemingly simple job:
- There are many HTML attributes which take a URI as a value. It’s not just <a href="…">
- There are some mechanisms to determine base URIs and dereference relative links which even some modern browsers don’t seem to know about.
- … and we haven’t started talking about IRIs yet
In order to see a little more clearly in all this, we have started hacking together a “Link Test Suite”, with a harness to run it against our W3C link checker, for a start. Join the thread on the public tools hacking list for more information (and some musings on unit testing in python), or check out the source. As usual, this is all under the W3C open source license, so do use, contribute, hack, or just take the code and run away with it.