Steve’s Stuff

Faults in the clouds of delusion

Training sitewide spam filters

with 4 comments

How does one enable end-user training of a site-wide Bayesian spam filter for SpamAssassin when the users are reading mail through Microsoft Exchange and the filtering takes place on several Linux MX servers?

We have created two public folders, should-be-spam and should-be-ham. We created an exchange user, spamiam, that has full rights to these folders. End-users move misclassified mail from their inbox or junk-mail folder into the appropriate should-be public folder.

At the top of every hour, this script is run on the one MX server:

/usr/local/scripts/get_ham_spam
#! /bin/sh
rm -f /var/spool/mail/spamiam
touch /var/spool/mail/spamiam
chown spamiam:mail /var/spool/mail/spamiam
su spamiam -c 'fetchmail -a -K -f
/usr/local/scripts/spamiam.fetchmailrc -r "Public Folders/should-
be-spam"'
cat /var/spool/mail/spamiam >> /var/www/html/spamstuff/should-be-spam
sa-learn --spam --mbox /var/www/html/spamstuff/should-be-spam
rm -f /var/spool/mail/spamiam
touch /var/spool/mail/spamiam
chown spamiam:mail /var/spool/mail/spamiam
su spamiam -c 'fetchmail -a -K -f
/usr/local/scripts/spamiam.fetchmailrc -r "Public Folders/should-
be-ham"'
cat /var/spool/mail/spamiam >> /var/www/html/spamstuff/should-be-ham
sa-learn --ham --mbox /var/www/html/spamstuff/should-be-ham

/usr/local/scripts/spamiam.fetchmailrc
poll exchange.xxxx.com
proto imap
user spamiam
password xxxxxxxxx
is spamiam here

At 15 past each hour, the two other mail servers use wget to grab the
should-be files to their local /tmp and run sa-learn.

get-ham-spam
#! /bin/sh
cd /tmp
rm -f should-be-spam should-be-ham
wget -q http://xxx.xxx.com/spamstuff/should-be-spam
wget -q http://xxx.xxx.com/spamstuff/should-be-ham
sa-learn --spam --mbox should-be-spam
sa-learn --ham --mbox should-be-ham

The files are included in logrotate on the source server, so they get zero’d every Sunday
morning.

Listen to this post Listen to this post

Technorati Tags , ,

Written by Steve

July 14th, 2006 at 8:57 am

Posted in Spam, Tips

4 Responses to 'Training sitewide spam filters'

Subscribe to comments with RSS or TrackBack to 'Training sitewide spam filters'.

  1. Awesome script, it does exactly what I need… but I am having a slight problem with it. When there is a large quantity of mail being pulled from the exchange server, I notice a delay between the time when the fetchmail is finished and when all the messages are delivered to the local user’s mbox folder.

    Lets say that 50 messages are pulled from exchange with fetchmail. Fetchmail completes its run, and I can actually see the /var/spool/spamuser message file growing in size for about 2 minutes.

    If I actually log on as that user and go into “mail”, it shows 17 messages. If I quit and go back into mail again, it shows 26, etc.

    It seems like the delivery process is a bit slow, so I am wondering if there is a way to make the script wait 5 minutes before processing the next command, “cat” and “sa-learn”…

    Charles

    23 Feb 07 at 6:49 am

  2. Nevermind, I totally forgot about “sleep”. Works perfectly now.

    Charles

    23 Feb 07 at 8:51 am

  3. Just FYI, there’s an issue with Fechmail and Exchange’s IMAP implementation. Occasionally, Exchange fails to report that fetchmail’s attempt to delete a message succeeded, causing fetchmail to error out. It always resolves itself on the next cycle.

    Steve

    23 Feb 07 at 9:41 am

Leave a Reply