RDF for mail filtering: FOAF whitelists

Dan Brickley <danbri@w3.org>

This seems to be happening to a lot of us recently: we finally decide to do something about all the annoying spam messages cluttering up our mailboxes. This is a writeup of my work in progress on the use of RDF for exchanging mail-filtering information. I have an implementation, 'foafwhite' using my RubyRDF tools.

Background

A few weeks ago (2001-12-31) I decided (new years resolution) that spam was wasting so much of my time that it was worth making the effort to set up a filtering system. I read up on procmail and installed some filters that do basic content analysis to route the most obvious spams to a 'probably spam' folder, which I check periodically. While there were suprisingly few false positives (mostly friends using '!!!' too liberally), lots of spam escaped detection. So I've been looking for a better approach. Other folk have been using whitelist based filtering, which is based on the idea that you keep a 'whitelist' of known email addresses, and filter unknown senders into a folder for occasional scrutiny. After some bad spam weather, I decided to try combining this technique with content-based filtering, so that genuine messages from unknown addresses would also be separated from the most obvious spam. This document is mostly about the use of RDF to exchange whitelist data, so that we minimise false positives in whitelist based filtering.

Whitelist-based spam filtering

The basic whitelists idea is simple. Keep a text file with a list of a list of email addresses, and when email arrives, have a mail filtering tool check to see if you know the sender's (alleged) address.

Everything I learned about this came from Gerald's whitelist-based spam filtering page. Rather than repeat his words here, go read his writeup.

RDF Whitelist data exchange: the problem

Having installed whitelist filtering, you'll soon start seeing non-spam messages appear in your 'unknown senders' folder, since you won't have a complete list of all the email addresses of people who'll be writing to you. So the obvious idea is to share whitelists with friends and colleagues, to get more complete coverage of the addresses most likely to write to you. And the obvious problem is that people aren't entirely comfortable sharing text files that list every non-spam email address they're familiar with. There are privacy issues for the sharer, and for the owners of the addresses being shared.

This is an experiment based on the idea of sharing lists of garbled email addresses, ie instead of sharing 'mailto:danbri@w3.org' we might share '357fdd378d61684762ed88277192cfdf001189af', which is what we get when we feed that address to the sha1 algorithm. Consumers of this data can do the same thing with addresses from incoming mail, and then check to see if the resulting value is on the (garbled) whitelist.

There's no great call for using RDF or XML to exchange such simple data, but since I want to use the same infrastructure for exchanging other mail-related information (eg. descriptions of mailing lists, PGP keys), using RDF made sense for me.

FOAFWhite prototype

So I built a prototype. It was pretty easy. I've not hooked it up to my mail system yet since there's only one feed (mine). It works though.

Here's my foafwhite.xml file. This is generated from my private .whitelist file, by running ./foafwhite.rb --export

The 'foafwhite' implementation uses my RubyRDF package, and consists of a single script, 'foafwhite.rb' (which contains some more embedded documentation). It needs Ruby, the RDF utilities from the file basicrdf.rb, as well as an (external) RDF parser that can generate N-Triples. I use 'rdfdump' from (the CVS version of) Redland. N-Triples is a line-oriented text format that can represent RDF graphs; it isn't a formal W3C spec and isn't good for internationalisation etc. But for this app it got a prototype working (since there's no native Ruby RDF parser and I don't have API-based access to the Redland RDF parser yet). I'm working on improving the situation (eg. using an XSLT RDF parser).

Other Implementations

Some more tools for producing this data: foafwhite.py (mnot, users LDAP to generate list); shamail.py (aaronsw; takes list from STDIN). Note that simple-minded tools for consuming this data could probably be constructed with grep (so long as the sha1'd addresses weren't split over multiple lines).

Data format

The data looks like this:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/">

<foaf:NonSpamMailboxURI foaf:sha1Value="721ae0b3232bf1ce6486d952fa6629ff31e6edf6"/>

<foaf:NonSpamMailboxURI foaf:sha1Value="fb7efcdeb2e9ea622c8afd337299cd3b58cd35ec"/>

</rdf:RDF>

The corresponding RDF schema description of the foaf:NonSpamMailboxURI and foaf:sha1Value have not yet been written. Those should make it clear that asserting a mailbox is a non-spam-mailbox is no kind of guarantee or promise. All you're saying if you assert this is that the mailbox isn't known to be implicated in any spamming incidents (yet, etc...).

Case sensitive? Initially I downcased everything, before realising that the bit before the @ is case sensitive. Implementations can (but needn't -- this a bit fiddly) downcase the domain-name portion of the email address, for mailto: mailbox URIs.

What it does

Using an external RDF parser, we can read in RDF/XML whitelists in the above format (@@todo: add these constructs to the FOAF schema).

These are then stored on disk in an (S)DBM file in a form that's reasonably fast to check. I could use a general RDFdb but for now this is fine.

Usage: ./foafwhite.rb --check suspect-address@hmm.example.com

Sample whitelists

Here are some whitelists people have generated:

WARNING!

Before exposing any of your information (or your email correspondents) with these tools or techniques, be aware:

the software is barely tested; check it to your own satisfaction before use (eg. I might leave in debug statements that reveal un-scrambled mailboxes)

Think it through! the technique has subtleties, and we are dealing with personal data here (your own and that of others). Someone with a CD-ROM of a 30 million email addresses could try to brute-force reverse engineer your whitelist. Since a whitelist is just 'non-spam mailboxes you have heard of', this isn't too bad in theory (since you could for example maintain a vast aggregate whitelist based on using these tools). But consider someone with a small whitelist that reflected controversial or private interests.

After you've thought it through, let me know what you think! Send mail to www-rdf-interest@w3.org (big public list) or rdfweb-dev@yahoogroups.com (small public list), or stop by the RDF Interest Group IRC channel.

More scenarios (which could be either cool apps or disasters waiting to happen...)

If mailing list software exposed list member mail addresses (for public lists) in this fashion:

we could filter mailing list content for spam more effectively
cross-mailing lists membership patterns could be analysed: which (hashed) mailboxes connect IETF WG lists to W3C WG lists?
spammers could use this (plus conventional harvesting and targetted brute force) to find out more about mailbox owners
we could run 'friend of a friend' graph tools over the data to get a big picture of web communities
whitelist exchange could encourage spammers forged 'From:' headers, which can be misleading and distressing for those whose (previously trusted) addresses are misused in spam messages
forged 'From:' headers could encourage people to digitally sign their email
baddies could watch updates to your whitelist-in-rdf file, and learn about new additions to your whitelist (consider random order, adding noise etc)

Design Issues

Fragmented scribbles from IRC chat:

<danbri> I was thinking of a three-layer sort of deal. I'd keep a largeish DBM with mail whitelist. Organisations could host a bigger one (fallback before sending to 'unknown senders') and 3rd parties could flaunt soap services that offer this to world at large.

<danbri> privacy issues creep in. You could flag incoming mail by possible topic: ./foafwhite.rb --check who@where.example.com --database=webgeeks

<danbri> can do foaf stuff too... run graph traversal over it to find the hashes of people who link different communities

<danbri> One could query eg rdfweb/soap for people's email addresses, sha them, and find out which lists they're on. This info's mostly all public anyway, but pretty obscure...

What next?

So this will only be interesting if some people share their data... (and there are privacy / security issues still...)

Gerald has (privately) shared his whitelist in this format, and I'm looking into trying the lists.w3.org accept list, which contains a LOT of email addresses of people in the Web community. And possibly a few spammers too.

Code: rather scruffy, has some nasty hacks, but basically seems to work. But will I trust my mailbox to it? ;-)

What is FOAF?

I nearly forgot. FOAF stands for 'friend of a friend', and is part of the 'semantic web vapourware' project RDFWeb, which I'm working on with some friends. We're experimenting with the use of RDF 'in the wild', using hypertext references (rdfs:seeAlso) amongst a Web of RDF/XML documents, with the goal of getting a bunch of interesting, useful and varied RDF data out there for people to use. FOAF is a basic RDF vocabulary that describes people, documents, images etc., and is used (somewhat chaotically) as the RDF schema namespace to support various of these prototypes. The coolest thing we've done yet is some photo metadata stuff that marks up the people who appear in photos: see "the co-depiction experiment" for details of that. Also some PGP/GPG web of trust experiments, which will doubtless resurface in a mail-filtering context at some point.

Random asides: So this sub- project is FOAFWhite and the Seven Spammers, or something. Or maybe I'll think of another name for it... In general RDFWeb is stuff I do outside my (W3C, ILRT etc.) work, since it's (a) often silly (b) occasionally tasteless (c) I guess I don't need to give reasons. The point of RDFWeb is to build goofy fun stuff that's useful and interesting and inspires people to re-implement a better version in half the time... The sign of a good RDFWeb project is that it gets picked up and finished off elsewhere...

$Id: intro.html,v 1.18 2003/04/28 18:02:55 danbri Exp $