DmozRdf

From W3C Wiki

The Open Directory Project make dumps of their Web directory available in an RDF-like XML format.

There have been some issues with the exact formatting of the data, however. See ODP/dmoz data dump error report for some problems. While the dumps are frequently referred to as being RDF, they are actually not, exactly. The ODP data dump format was created by one of the architechs of the w3c RDF standard but this happened before the standard was completed. As it turned out, the standard diverged quite a bit from the format used in the ODP dump by the time it was finalized. The ODP folks have never updated the dump format because, by the time anyone realized it was non-standard, it was already in regular use by a lot of sites (including Google). To parse the dump you can either use XML tools or custom ODP-RDF parsing tools but standard RDF tools are likely to have problems.

What tools, scripts etc are available for cleaning up the RDF representation of the ODP?

Most of the reports of formatting errors in the dump have been due to two issues: 1) The not-quite-RDF format being used causing parsing problems for people using RDF parsers and 2) Prior to 2003, there were very frequent occurances of bad UTF-8 encoding and/or illegal XML characters in the data. These errors occur because much of the ODP data comes from foreign language categories using local (not UTF-8) encodings. The editing scripts have had insufficient error checking to prevent local encoding data from leaking into the UTF-8 encoded data dump. This problem has almost been eliminated at this point due to improved error checking, a plan to convert all categories to native UTF-8 (this should be completed in 2003), and a weekly error report that alerts ODP staff to problems.

If the error report shows no errors, you can expect to have no problems parsing the ODP data dump as a UTF-8 encoded XML file. You will, of course, run into problems anytime you expect to parse a standard RDF file since the ODP dump isn't one.

If the error report shows errors but you don't want to wait for next week's dump, you will need to purge the errors. If the report shows only bad UTF-8 byte sequences, this will be easy to fix. At the command line on any modern OS, simply type:

 iconv -c -f UTF-8 -t UTF-8 dirty.rdf.u8 > clean.rdf.u8

This will strip any bad UTF-8 characters.

If the error report shows any illegal XML characters, you'll need to get rid of those too. A simple Perl script can do the job for you here:

 #!/usr/bin/perl
 #
 while(<>) { s/[\x00-\x08\x0b-\x0c\x0e-\x1f]//g; print; }

Use the script like so:

cat dirty.rdf.u8 | clean.pl > clean.rdf.u8

Other ODP data dump parsing and cleaning tools:

  • Steve's ODP/dmoz software Sample code to import the ODP data dump into an SQL database, and links to other ODP tools and information.
  • Sergey Melnik's scripts These are bit out of date and attempt to both purge illegal characters and mutate the encoding from ODP-specific to standard RDF. The ODP format has evolved since these scripts were written and they may not work with recent dumps. Use at your own risk.
  • others?

There don't seem to be any incoming links to this topic; it's an IslandTopic. Please don't do that!