Feeding your mail gateway a proper spam diet

In a previous post I described the process of how to get a Linux based mail filtering gateway set up on your network to check for viruses and do some basic filtering, eventually delivering messages to your Exchange server.

In this post I will expand on the various ways to “train” and customize your SpamAssassin mail filter to do more checks to weed out spam and generally lower the amount of junk that is making its way to your users’ inbox.

There are a number of things that aren’t enabled by default in SpamAssassin.  Obviously this isn’t as efficient as we would like, so there is a little bit of extra leg work getting everything set up the way it should be.

Tightening up Postfix:

This is the first step to improving the efficiency of your filtering process.  There are a number of checks that can be enabled in the configuration file (/etc/postfix/main.cf) here to fight the incoming spam.  I have appended these various checks to the end of my configuration posted previously to lower the amount of spam getting through by ensuring proper sending addresses, valid recipients, proper domains, etc.

smtpd_helo_required = yes
smtpd_sender_restrictions =
 reject_non_fqdn_sender,
 reject_unknown_sender_domain
smtpd_recipient_restrictions =
 permit_mynetworks,
 reject_unauth_destination,
 reject_invalid_hostname,
 reject_unknown_sender_domain,
 reject_unknown_recipient_domain,
 reject_non_fqdn_recipient

Configuring these properties will cause an immediate drop in the amount of spam that makes its way through the filter so the importance of getting this implemented cannot be overstated.

Make your spam filter happy, feed it spam:

The next technique that I will discuss took me FOREVER to figure out, so I hope that by sharing what I have learned I will help people save time in their own implementations.  It didn’t help that IMAP wasn’t enabled on our Exchange server, but I will save that story for another day.

Essentially you want to get a good chunk of SPAM and HAM emails messages to your mail filter for SpamAssassin to apply it Bayesian filtering techniques to learn how to classify incoming messages(statistical analysis stuff, I don’t know a lot about the specifics).

My first thought was to have users copy SPAM messages into a public folder on my Exchange server and pull the messages down directly to my mail gateway.  BUT that dream was shattered when I discovered that IMAP support for public folders had been dropped in the version of Exchange I am using (Exchange 2010).

So I dabbled with a few ideas that weren’t very graceful, the most notable of which was copying the Exchange public folder into Thunderbird then copying the mbox file from Thunderbird to the mail gateway, yuck.  I finally got some help from my friends over at ServerFault.  I basically had to install and configure fetchmail to go out and look for two specified mailboxes on my Exchange, one SPAM account (a spam collection account I created) and one HAM account (my personal inbox).

To install fetchmail issue the following command:

sudo aptitude install fetchmail

Next, we need to configure fetchmail to look at our specified IMAP acounts, so we need to edit the config file ~/.fetchmailrc

poll mail.domain.com protocol IMAP port 993:
auth password user "domain/spamacct" with password "password" ssl
auth password user "domain/hamacct" with password "password" ssl

Modify the permissions so that only the specified user can read/write the config file

chmod 600 .fetchmailrc

Finally you should be able to pull the emails onto your mail gateway by issuing the following command:

fetchmail -a -v -n -k --folder inbox

At this point the mail should be on your mail server in the directory /var/spool/mail/USER.  The final step is to feed the mail into the Bayesian filter provided by SpamAssassin.  To do this, issue the following command:

sa-learn --showdots --mbox --spam spam
sa-learn --showdots --mbox --ham ham

I had to fool around with the mail file names when I first copied them to the server to read as “spam” and “ham” but that should be easy enough to accomplish.

To check how the learning process is going we need to check the sa-learn database for the tokens, ham and spam it has received.  There are a few ways to check the database but the easiest I have found is to enter the following into the command line:

sa-learn --dump magic

This will output a number of results, the most important of which are the nham, nham and ntoken outputs.  Here is a sample from the initial training stages from my spam filter:

bruticus@bruticus:~$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0        341          0  non-token data: nspam
0.000          0        210          0  non-token data: nham
0.000          0      69078          0  non-token data: ntokens
0.000          0 1318421928          0  non-token data: oldest atime
0.000          0 1319205954          0  non-token data: newest atime
0.000          0 1319142287          0  non-token data: last journal sync atime
0.000          0 1319142287          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count

Ideally you want the nham and nspam outputs around or above the 1000 message mark, but the filter can begin working with as little as 200 of each.

Also, I have read that the best way to train is to feed SpamAssassin the newest spam and ham messages that you have, so make sure to look for the newest messages to feed it.  I read that it has something to do with the Bayesian analysis.

NOTE:  Try to do the spam/ham learning step of the process in off hours or a slow time because it adds a tremendous amount of overhead to Postfix to process all the messages as well the machine itself taking up a large chunk of memory.

That’s it. The spam filter should be able to filter out even more messages now thanks to the bayesian filtering that we just enabled.

Final Step:

This one may or may not be overkill, I just implemented it yesterday and haven’t had a chance to get any feedback from it yet.  If you are in a multi-language  environment  this addition may not be feasible either.  With this step we are going to enable a SpamAssassin plugin to attempt to detect the email language and filter out everything that isn’t either English or Spanish.

To do this we need to enable the plugin so open up the SpamAssassin config, /etc/spamassassin/v310.pre and uncomment the following line,

loadplugin Mail::SpamAssassin::Plugin::TextCat

Then we need to edit the main SpamAssassin configuration file, /etc/spamassassin/local.cf to filter out all non English or Spanish languages, this line can be added anywhere, I chose to add it under the Bayesian filtering sections.

ok_languages en es
ok_locales en es

Conclusion:

That is pretty much it, at least for now. There are possibly a few other things to modify but I need to see how efficient the spam filter is at this point before I decide if I need to add any more layers.  I have a feeling that things are pretty good at this point and adding more filtering wouldn’t really add much value to the filter.

I am very satisfied with the results that I have attained with this project and hope to keep refining the process as I see fit.  Although, at some point I think I am just going to need to take a look at is and say “enough is enough”.  So, if you have any questions or ideas for improvement let me know, I would be glad to hear them.

Resources:

http://wiki.apache.org/spamassassin/SingleUserUnixInstall#Enable_IMAP_LearnAsSpam_folder
http://wiki.apache.org/spamassassin/RemoteImapFolder
http://www.byteplant.com/support/cleanmail/howtolearnexchange.html
http://faisal.com/docs/salearn
http://allaboutexchange.blogspot.com/2007/08/how-to-configure-spamassassin-bayesian.html
http://serverfault.com/questions/320594/bayesian-filtering-for-exchange-2010/320838#320838
http://www.howtoforge.com/debian_etch_fetchmail
http://spamassassinbook.packtpub.com/chapter9_preview.htm
http://www.linuxhomenetworking.com

Josh Reichardt

Josh is the creator of this blog, a system administrator and a contributor to other technology communities such as /r/sysadmin and Ops School. You can also find him on Twitter and Facebook.