Skip to content

Training sitewide spam filters

How does one enable end-user training of a site-wide Bayesian spam filter for SpamAssassin when the users are reading mail through Microsoft Exchange and the filtering takes place on several Linux MX servers?

We have created two public folders, should-be-spam and should-be-ham. We created an exchange user, spamiam, that has full rights to these folders. End-users move misclassified mail from their inbox or junk-mail folder into the appropriate should-be public folder.

At the top of every hour, this script is run on the one MX server:

/usr/local/scripts/get_ham_spam
#! /bin/sh
rm -f /var/spool/mail/spamiam
touch /var/spool/mail/spamiam
chown spamiam:mail /var/spool/mail/spamiam
su spamiam -c 'fetchmail -a -K -f
/usr/local/scripts/spamiam.fetchmailrc -r "Public Folders/should-
be-spam"'
cat /var/spool/mail/spamiam >> /var/www/html/spamstuff/should-be-spam
sa-learn --spam --mbox /var/www/html/spamstuff/should-be-spam
rm -f /var/spool/mail/spamiam
touch /var/spool/mail/spamiam
chown spamiam:mail /var/spool/mail/spamiam
su spamiam -c 'fetchmail -a -K -f
/usr/local/scripts/spamiam.fetchmailrc -r "Public Folders/should-
be-ham"'
cat /var/spool/mail/spamiam >> /var/www/html/spamstuff/should-be-ham
sa-learn --ham --mbox /var/www/html/spamstuff/should-be-ham

/usr/local/scripts/spamiam.fetchmailrc
poll exchange.xxxx.com
proto imap
user spamiam
password xxxxxxxxx
is spamiam here

At 15 past each hour, the two other mail servers use wget to grab the
should-be files to their local /tmp and run sa-learn.

get-ham-spam
#! /bin/sh
cd /tmp
rm -f should-be-spam should-be-ham
wget -q http://xxx.xxx.com/spamstuff/should-be-spam
wget -q http://xxx.xxx.com/spamstuff/should-be-ham
sa-learn --spam --mbox should-be-spam
sa-learn --ham --mbox should-be-ham

The files are included in logrotate on the source server, so they get zero’d every Sunday
morning.

Listen to this post Listen to this post

Technorati Tags , ,

{ 3 } Comments

  1. Charles | February 23, 2007 at 6:49 am | Permalink

    Awesome script, it does exactly what I need… but I am having a slight problem with it. When there is a large quantity of mail being pulled from the exchange server, I notice a delay between the time when the fetchmail is finished and when all the messages are delivered to the local user’s mbox folder.

    Lets say that 50 messages are pulled from exchange with fetchmail. Fetchmail completes its run, and I can actually see the /var/spool/spamuser message file growing in size for about 2 minutes.

    If I actually log on as that user and go into “mail”, it shows 17 messages. If I quit and go back into mail again, it shows 26, etc.

    It seems like the delivery process is a bit slow, so I am wondering if there is a way to make the script wait 5 minutes before processing the next command, “cat” and “sa-learn”…

  2. Charles | February 23, 2007 at 8:51 am | Permalink

    Nevermind, I totally forgot about “sleep”. Works perfectly now.

  3. Steve | February 23, 2007 at 9:41 am | Permalink

    Just FYI, there’s an issue with Fechmail and Exchange’s IMAP implementation. Occasionally, Exchange fails to report that fetchmail’s attempt to delete a message succeeded, causing fetchmail to error out. It always resolves itself on the next cycle.

Post a Comment

Your email is never published nor shared. Required fields are marked *