|Spam Filtering for Mail Exchangers: How to reject junk mail in incoming SMTP transactions.|
|Prev||Chapter 2. Techniques||Next|
Time has come to look at the content of the message itself. This is what conventional spam and virus scanners do, as they normally operate on the message after it has been accepted. However, in our case, we perform these checks before issuing the final 250 response, so that we have a chance to reject the mail on the spot rather than later generating Collateral Spam.
If your incoming mail exchangers are very busy (i.e. large site, few machines), you may find that performing some or all of these checks directly in the mail exchanger is too costly. In particular, running Virus Scanners and Spam Scanners do take up a fair amount of CPU bandwidth and time.
If so, you will want to set up dedicated machines for these scanning operations. Most server-side anti-spam and anti-virus software can be invoked over the network, i.e. from your mail exchanger. More on this in the following chapters, where we discuss implementation for the various MTAs.
RFC 2822 mandates that a message should contain at least the following header lines:
From: ... To: ... Subject: ... Message-ID: ... Date: ...
Addresses presented in the message header (i.e. the To:, Cc:, From: ... fields) should be syntactically valid. Enough said.
For each address in the message header:
If the address is local, is the local part (before the @ sign) a valid mailbox?
If the address is remote, does the domain part (after the @ sign) exist?
One trait of junk mail is that it is sent to a large number of addresses. If 50 other recipients have already flagged a particular message as spam, why couldn't you use this fact to decide whether or not to accept the message when it is delivered to you? Better yet, why not set up Spam Traps that feed a public pool of known spam?
I am glad you asked. As it turns out, such pools do exist:
These tools have progressed beyond simple signature checks that only trigger if you receive an identical copy of a message that is known to be junk mail. Rather, they evaluate common patterns, to account for slight variations in the message header and body.
Messages containing non-printable characters are rare. When they do show up, the message is nearly always a virus, or in some cases spam written in a non-western language, without the appropriate MIME encoding.
One particular case is where the message contains NUL characters (ordinal zero). Even if you decide that figuring out what a non-printable character means is more complex than beneficial, you might consider checking for this character. That is because some Mail Delivery Agents, such as the Cyrus Mail Suite, will ultimately reject mails that contain it. . If you use such software, you should definitely consider getting rid of NUL characters.
On the other hand, the (now obsolete) RFC 822 specification did not explicitly prohibit NUL characters in the message. For this reason, as an alternative to rejecting mails containing it, you may choose to strip these characters from the message before delivering it to Cyrus.
Similarly, it might be worthwhile to validate the MIME structure of incoming message. MIME decoding errors or inconsistencies do not happen very often; but when they do, the message is definitely junk. Moreover, such errors may indicate potential problems in subsequent checks, such as File Attachment Checks, Virus Scanners, or Spam Scanners.
In other words, if the MIME encoding is illegal, reject the message.
When was the last time someone sent you a Windows screensaver (".scr" file) or Windows Program Information File (".pif") that you actually wanted?
Consider blocking messages with "Windows executable" file attachment(s) - i.e. file names that end with a period followed by any of a number of three-letter combinations such as the above. This check consumes significantly less resources on your server than Virus Scanners, and may also catch new virii for which a signature does not yet exist in your anti-virus scanner.
For a more-or-less comprehensive list of such "file name extensions", please visit: http://support.microsoft.com/default.aspx?scid=kb;EN-US;290497.
A number of different server-side virus scanners are available. To name a few:
In situations where you are not willing to block all potentially dangerous files based on their file names alone (consider ".zip" files), such scanners are helpful. Also, they will be able to catch virii that are not transmitted as file attachments, such as the "Bagle.R" virus that arrived in March, 2004.
In most cases, the machine performing the virus scan does not need to be your mail exchanger. Most of these anti-virus scanners can be invoked on a different host over a network connection.
Anti-virus software mainly detect virii based on a set of signatures for known virii, or virus definitions. These need to be updated regularly, as new virii are developed. Also, the software itself should at any time be up to date for maximum accuracy.
Similarly, anti-spam software can be used to classify messages based on a large set of heuristics, including their content, standards compliance, and various network checks such as DNS Blacklists and Junk Mail Signature Repository. In the end, such software typically assigns a composite "score" to each message, indicating the likelihood that the message is spam, and if the score is above a certain threshold, would classify it as such.
Two of the most popular server-side heuristic anti-spam filters are:
These tools undergo a constant evolution as spammers find ways to circumvent their various checks. For instance, consider "creative" spelling, such as "GR0W lO 1NCH35". So, just like anti-virus software, if you use anti-spam software, you should update it frequently for the highest level of accuracy.
I use SpamAssassin, although to minimize impact on machine resources, it is no longer my first line of defense. Out of approximately 500 junk mail delivery attempts to my personal address per day, about 50 reach the point where they are being checked by SpamAssassin (mainly because they are forwarded from one of my other accounts, so the checks described above are not effective). Out of these 50 messages, one message ends up in my inbox approximately every 2 or 3 days.
Some specialized MTAs, such as certain mailing list servers, do not automatically generate a Message-ID: header for "bounced" messages (Delivery Status Notifications). These messages are identified by an empty Envelope Sender.
The IMAP protocol does not allow for NUL characters to be transmitted to the mail user agent, so the Cyrus developers decided that the easiest way to deal with mails containing it was to reject them.