The Email Filter that "Bayes" at the Moon
Jul. 28th, 2004 07:26 pmSo, like any red-blooded American 'netter these days, I've been getting waaaaay too much spam email. At first, I attempted to get rid of most of it by sorting anything that was only bcc'd to me (that is, email that was only addressed to me "invisibly") into a separate "might be spam" folder to check every so often (after sorting all my mailing lists into their own folders). The problem was, I didn't check it often enough, and it got a lot of false positives—mailing lists I never bothered to set up filtering rules for, or lists for which the filtering rules weren't quite right, and other things. And after a while, spamming software started sending out individual emails to people instead of one single email to a thousand bcc'd, and so the messages would wind up in my inbox anyway.
After that, I set up SpamAssassin on my box, and it did a fairly decent job of sorting out the wheat from the chaff, but after a while it just wasn't working very well either. I was about at my wit's end; my in-box's usefulness was declining fast.
Until I finally got around to looking at how to set up SpamAssassin's Bayesian filtering function.
SpamAssassin, for folks who don't know and don't feel like clicking the link, is a scoring spam detector. It looks at each email that comes in and analyzes it for various spammy characteristics. It asks things like, "Is it all HTML?" "Is it from an email address or domain that gets used for spamming?" and so on, assigns a point value for each answer, then totals them up and sees if the mail makes it over a certain threshhold of spamminess. For Unixy computers (Linux, BSD, etc.), it's a must-have in today's spam-ridden environment.
The problem is that spammers are always finding ways to get around the files—breaking up key words with periods or other punctuation, replacing letters with numbers, and so on. V1a.6ra. C/al:s. And so on. And the scoring mechanisms just can't keep up.
But then along comes Bayesian filtering to save the day. Do you remember the advertising slogan for the ill-fated Dreamcast video game console, "it's thinking"? Well, a slogan for Bayesian filtering might be "it's learning"&mdashbecause that's what Bayesian filtering is all about.
Bayesian filtering is a way of letting you train your computer to recognize spam directly, without having to rely on scorefiles filled with bypassable rules. Much as police dogs are trained to sniff out drugs and explosives by letting them smell samples, a Bayesian filter is trained to sniff out spam and good email (sometimes called "ham," to contrast it to spam) and tell which is which. After it's had enough examples to compare, the filter starts to get a feel for what spam and good mail each look like, and can pass its judgment on to be considered with the rest of the SpamAssassin scores.
SpamAssassin only requires being shown 200 spam and 200 good-email ("ham") messages before the filters can start to work—but the bigger sample it gets, the more accurate it gets, so I've currently fed it 1580 spams and 4874 hams...and I'm still going through mailboxes to weed out spam and feed it more goodies.
And already I have evidence it's starting to work. Looking at an example of the score sheet that SpamAssassin appends to emails totalling up how spammy it thinks they are, I see this. Note the sections that I have bolded.
And even if something gets sorted into the wrong mailbox (a spam that should be a ham, or vice versa), I can just tell the filter "You sorted this wrong. Here's the right category," and it learns and remembers for next time.
Bayesian filtering is working well enough so far that I've gone ahead and disabled the ineffective bcc sorting, and moved SpamAssassin from being the last step in the sort (previously I had sorted all the mailing lists into their own mailboxes before I ran SpamAssassin on the rest, not wanting to take the chance it would declare mailing list posts to be spam) to the first (so hopefully now it will remove even those spam posts that come over mailing lists while leaving the rest of the lists untouched).
Maybe I should see whether any good Bayesian apps are available for Windows and get Mom & Dad set up with one. Or have Aaron do it.
After that, I set up SpamAssassin on my box, and it did a fairly decent job of sorting out the wheat from the chaff, but after a while it just wasn't working very well either. I was about at my wit's end; my in-box's usefulness was declining fast.
Until I finally got around to looking at how to set up SpamAssassin's Bayesian filtering function.
SpamAssassin, for folks who don't know and don't feel like clicking the link, is a scoring spam detector. It looks at each email that comes in and analyzes it for various spammy characteristics. It asks things like, "Is it all HTML?" "Is it from an email address or domain that gets used for spamming?" and so on, assigns a point value for each answer, then totals them up and sees if the mail makes it over a certain threshhold of spamminess. For Unixy computers (Linux, BSD, etc.), it's a must-have in today's spam-ridden environment.
The problem is that spammers are always finding ways to get around the files—breaking up key words with periods or other punctuation, replacing letters with numbers, and so on. V1a.6ra. C/al:s. And so on. And the scoring mechanisms just can't keep up.
But then along comes Bayesian filtering to save the day. Do you remember the advertising slogan for the ill-fated Dreamcast video game console, "it's thinking"? Well, a slogan for Bayesian filtering might be "it's learning"&mdashbecause that's what Bayesian filtering is all about.
Bayesian filtering is a way of letting you train your computer to recognize spam directly, without having to rely on scorefiles filled with bypassable rules. Much as police dogs are trained to sniff out drugs and explosives by letting them smell samples, a Bayesian filter is trained to sniff out spam and good email (sometimes called "ham," to contrast it to spam) and tell which is which. After it's had enough examples to compare, the filter starts to get a feel for what spam and good mail each look like, and can pass its judgment on to be considered with the rest of the SpamAssassin scores.
SpamAssassin only requires being shown 200 spam and 200 good-email ("ham") messages before the filters can start to work—but the bigger sample it gets, the more accurate it gets, so I've currently fed it 1580 spams and 4874 hams...and I'm still going through mailboxes to weed out spam and feed it more goodies.
And already I have evidence it's starting to work. Looking at an example of the score sheet that SpamAssassin appends to emails totalling up how spammy it thinks they are, I see this. Note the sections that I have bolded.
Content analysis details: (7.7 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
0.1 HTML_60_70 BODY: Message is 60% to 70% HTML
0.1 HTML_FONTCOLOR_BLUE BODY: HTML font color is blue
0.1 HTML_MESSAGE BODY: HTML included in message
5.4 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
[score: 1.0000]
0.3 MIME_HTML_ONLY BODY: Message only has text/html MIME parts
0.1 HTML_FONTCOLOR_RED BODY: HTML font color is red
0.6 MIME_HTML_NO_CHARSET RAW: Message text in HTML without charset
0.9 FORGED_YAHOO_RCVD 'From' yahoo.com does not match 'Received' headers
0.1 RCVD_IN_SORBS RBL: SORBS: sender is listed in SORBS
[80.161.113.251 listed in dnsbl.sorbs.net]Thus we see that the Bayesian filter judged this particular mail was spam...and added a whopping 5.4 points to the spam score, bringing it up to 7.7 points—2.7 more than the filter requires. If not for the Bayesian filter, it would only have scored 2.3 and ended up in my mailbox.And even if something gets sorted into the wrong mailbox (a spam that should be a ham, or vice versa), I can just tell the filter "You sorted this wrong. Here's the right category," and it learns and remembers for next time.
Bayesian filtering is working well enough so far that I've gone ahead and disabled the ineffective bcc sorting, and moved SpamAssassin from being the last step in the sort (previously I had sorted all the mailing lists into their own mailboxes before I ran SpamAssassin on the rest, not wanting to take the chance it would declare mailing list posts to be spam) to the first (so hopefully now it will remove even those spam posts that come over mailing lists while leaving the rest of the lists untouched).
Maybe I should see whether any good Bayesian apps are available for Windows and get Mom & Dad set up with one. Or have Aaron do it.
(no subject)
Date: 2004-07-29 04:22 am (UTC)(no subject)
Date: 2004-07-29 05:24 am (UTC)