« IE8 versioning snowstorm | Main | CSS Validator Translation - Polish and Chinese translators wanted! »
link test suite
Writing a link checker looks simple enough. There are quite a few of them, including of course the now venerable W3C Link Checker. Ditto for web spiders, indexers. There are so many of them it has to be a sign that there is nothing very complex behind them. Basically, each of those programs just parses HTML documents, finds links, follows them. Lather, rinse, repeat. No?
Unfortunately, trouble and the devil are always in the details. Notwithstanding the fact that parsing HTML in the wild is quite a challenge in itself, there are tons of ways for these software to fail at their seemingly simple job:
- There are many HTML attributes which take a URI as a value. It's not just <a href="…">
- There are some mechanisms to determine base URIs and dereference relative links which even some modern browsers don't seem to know about.
- ... and we haven't started talking about IRIs yet
In order to see a little more clearly in all this, we have started hacking together a "Link Test Suite", with a harness to run it against our W3C link checker, for a start. Join the thread on the public tools hacking list for more information (and some musings on unit testing in python), or check out the source. As usual, this is all under the W3C open source license, so do use, contribute, hack, or just take the code and run away with it.
Filed by olivier Théreaux on January 28, 2008 2:54 AM in Tools
| Permalink
| Comments (0)
| TrackBacks (0)
Leave a comment
Note: this blog is intended to foster a polite, on-topic and interesting discussion. Comments failing these requirements and spam will not get published; others will appear on the entry page after review by the staff. This may take some time: thank you for your patience.