W3C mailing list search services - Developer's information

This is the documentaition for the W3C Mailing list search service, a search system providing easy access to the hypertext archives of all messages sent to W3C mailing lists.

The search system is quite strongly tied to the W3C mailing list system, including its file structure and choice of archive generator (hypermail), but it should be possible for a developer/administrator with reasonable perl skills to adapt it to another system.

For this reason, we provide the code for MASE freely for others to re-use.

Principles and Architecture

This search system works under the assumption that raw e-mail messages contain a lot of well-structured data, and should be what a search system uses as a source of information, rather than the hypertext documents generated for the online archives.

A search system indexing a different source than what the user is navigating has an obvious disadvantage over more simple systems, because the search system must tie the two (source and generated online content) transparently for the user.

Access realms for the W3C mailing list archives

One of the details of the online mailing list archives is that it has several (cascaded) realms of access control.

The general public can access (and search) public lists, while a more limited audience can also access member-restricted lists, and a core audience (W3C Staff) can access all lists, including but not limited to Public and Member lists.

Archives with Group realm have special handling, with access for each archive restricted to the group in question. (W3C staff: see implementation details)

Information sources

The W3C Mail Archive Search Engine (MASE) uses, as a source, a repository of e-mail messages, in a maildir format (one file per message, one message per file). The online generated hypertext archives are also used as a minor indexing source

Indexes

MASE uses two types of indexes:

Indexes of information contained in the message. The body of the messages is fully indexes, as well as many of the header information. The Namazu is used as a back-end for this purpose.
An index of message-ids for all the e-mails in the archive, coupled to their location in the archive. A relational database system is used for that purpose.

Indexing mechanisms

Scheduled indexing

A preliminary script called lists.pl classifies lists between Realms, checking that for each list, both the repository of raw message and the generated archives are present. It outputs flat lists in an etc subdirectory. @@TODO - change the absoulte paths in search.pl and make that a variable @@
A general update script orchestrates the rest of the process:
1. For each list, update-list.sh first calls the Namazu indexing engine, which updates the message indexes with the content and metadata of messages received since the last scheduled run.
  
  Once a list has been indexed by Namazu, another script crawls the generated hypertext archive and, for each archived message, enters its message-id into the message-id database. That implies that the archiving software keeps this information, which fortunately Hypermail, used for W3C mailing list archives, does. This uses a double timestamp system (in a run subdirectory) in case new message arrive during the indexing process.
2. Finally, update-realm.sh then runs Namazu's indexer on all the lists for each of the main access realms. This will be used for queries on "All Lists" (or all lists of a specific realm).

Realtime indexing

The W3C setup uses a mysql database to enable real-time indexing of all incoming mails to our mailing lists.

On the lists server, the list dispatching and archiving script (update-archive) writes to a mysql database and queues messages
On the mail search server, a daemon (namazuIndexSingleMessage, launched by /etc/init.d/mase start|stop) polls the SQL queue and calls the namazu indexer (mknmz) with a specific option (--no-delete) allowing it to append a single message to a given index, rather than reindex the whole list each time, as namazu would to by default.
Once indexed, the new message is removed from the SQL queue.

Interfaces

A Search interface (source) uses Namazu to access the data-metadata message indexes in a format specified by templates.
The search script then manipulates the data and formats it into various usable outputs (HTML, RDF) (Source for the presenter).
The results of the search system link to message-id based URIs, which a message-id system (source) resolves to the actual location of the hypertext archive.

Getting the source

In March 2019 we moved this code into an internal repository. However, you can browse around the code and config data used on our site until then.

W3C MASE Developer's information