W3C Mail Search Engine updated

W3C’s MASE search engine got a nice new year present, in the shape of a new server. The additional computing muscle will be well-used to index and search among the hundreds of thousands of messages on the W3C’s mailing-list forums: at the time of this writing, the public mailing-lists hold more than 600,000 messages, and that’s not counting the internal lists of W3C staff and collaborators.

Moving to a new server was an opportunity to update the Namazu engine which powers this search engine, create new, clean indexes, and to hack a little on new features.

Creating indexes from scratch made us discover a tricky little bug in the mail search’s sorting algorithm. So far, we had been using namazu--sort=date to sort results by date, but as we found out, this is not sorting results according to the Date: mail header, but the dated timestamp of the locally archived mail file. The proper syntax is actually --sort=field:utc. Can’t make this one up, and it isn’t properly documented, but fortunately the namazu mailing-lists got us the answer in no time.

The new feature was a low hanging fruit. Namazu was already indexing the Date: mail headers, and it supports field-specific searching, which we already use to search messages from a specific person or with a given Subject:. Adding Date: specific search and we now have a way to filter results and show only mails received in a given month, or a given year. Twisting this feature to not only allow filtering of a given month, but a time span would be even better, but… that will be for later.

In Theory, Namazu is not supposed to be used on such a massive set of data as the W3C lists, but it works: our hacking was, and is, mostly limited to interfacing the engine with our lists system, and make indexing of individual lists real-time. The real-time indexing is a clever hack, but nothing complicated: our lists server creates a queue of messages that need to be indexed, and we feed that to the namazu indexer. One would usually give the namazu indexer a whole directory to check and re-index, which is way too slow. And if giving it a single file, mknmz would generally replace the contents ofits existing index with this sole file.

The trick? Feed mknmz a single file, but run it with the --no-delete option, so that it would keep the existing index content and just append the new file to it. We were afraid this might eventually corrupt the indexes, but so far, so good.

namazu is developed as an open-source project, and the source of our MASE search system is open (albeit a little messy and undocumented, I’m afraid), too.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Before you comment here, note that your IP address is sent to Akismet, the plugin we use to mitigate spam comments.