bayesfilter

Introduction

Bayesfilter is a bayesian spamfilter that's designed to be used in conjunction with an IMAP server using Maildir for storage, like Courier-IMAP. Once set up, the user can classify mail as spam, or as ham, simply by moving it into or out of the spam folder.

I started using a bayesian spam filter after reading Paul Graham's now famous article A plan for spam. There is also a follow up article exploring the subject some more, Better Bayesian Filtering. I was first using a Perl script, bayespam, which I changed some to adapt it to my needs, however it quickly became apparent that its performance was unsatisfactory; Perl tends to use huge amounts of memory when you start using a large number of anonymous data structures.

So I wrote a new filter from scratch in C (with help from Christophe Rolland, thanks!), making use of the low-level memory management capabilities of C, and the result is a filter that uses little memory (only a few MB while it's running) and is quite fast (it can analyze about 1000 Mails per second on a Pentium 4 2.4GHz).

Requirements

To use bayesfilter, you will need:

Files

Here you'll find the source code for bayesfilter. It's been tested on FreeBSD, Linux and Solaris, but probably works on any UNIX-like OS

Installation

Bayesfilter consists of two parts: bayesfilter_builddatabase and bayesfilter_checkmail. The former scans your existing mail and builds a database of spam and nonspam words, that is then used for evaluating incoming mail. bayesfilter_checkmail checks incoming mail against that database.

Unzip and build the code with the following commands

tar xfz bayesfilter-1.1.tar.gz
cd bayesfilter-1.1
make

Now you'll have two binaries, bayesfilter_builddatabase and bayesfilter_checkmail. Copy those to a convenient location, for example /usr/local/bin. If you don't have root access, any place in your home directory will work as well.

Next, will need to have procmail running. Install it, then set up your ~/.procmailrc recipe file. A basic recipe file can look like this:

# Please check if all the paths in PATH are reachable, remove the ones
# that are not.

PATH=$HOME/bin:/usr/bin:/bin:/usr/local/bin:.
MAILDIR=$HOME/Maildir # You'd better make sure it exists
DEFAULT=$MAILDIR/
LOGFILE=$MAILDIR/from
LOCKFILE=$HOME/.lockmail
VERBOSE=no

Now, enable procmail by setting up a ~/.forward file that looks like this:

| /usr/local/bin/procmail

Before proceding, make sure that your mail delivery works! Send yourself some testmails, and make sure they're accepted properly by checking the logfile, ~/Maildir/from.

Now, you'll have to create the bayesfilter configuration file, ~/.bayesfilter. In the tarball, you'll find an example file, bayesfilter.example. Copy it to ~/.bayesfilter, and modify it as you see fit. In particular, add the spam and nospam directories that exist in your Maildir. Note that bayesfilter does not descend into subdirectories when scanning the mail, so you'll have to specify each directory separately.

You can now generate the spam database by running bayesfilter_builddatabase. This shouldn't take more than a few seconds, depending on the speed of your computer.

The last step is to make procmail use bayesfilter_checkmail to evaluate incoming mail. An example recipe comes with the tarball, add the contents of procmail.recipe.example to your ~/.procmailrc to activate it. Note that it assumes that a "spam" folder exists on your IMAP server. If it doesn't, either create it, or modify the recipe.

Usage

Using bayesfilter is very simple. To classify mail as spam, simply move it into the spam filter. When bayesfilter_builddatabase is next run, your classification will have influence on the spam word database.

This of course requires bayesfilter_builddatabase to be run in regular intervals. The best way to do this is to define a cronjob, for example:

50 3 * * * /usr/local/bin/bayesfilter_builddatabase

Also, be aware that you will have to change ~/.bayesfilter if you add or remove folders from your IMAP server.

License

Bayesfilter is released under a BSD-style license. This means that you can do about anything with it, as long as you don't claim you wrote it. Also, if it breaks your system, you can't hold me responsible.

Copyright © 2003-2004 Benjamin Lutz. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.