robotech_master: (Default)
[personal profile] robotech_master
So, like any red-blooded American 'netter these days, I've been getting waaaaay too much spam email. At first, I attempted to get rid of most of it by sorting anything that was only bcc'd to me (that is, email that was only addressed to me "invisibly") into a separate "might be spam" folder to check every so often (after sorting all my mailing lists into their own folders). The problem was, I didn't check it often enough, and it got a lot of false positives—mailing lists I never bothered to set up filtering rules for, or lists for which the filtering rules weren't quite right, and other things. And after a while, spamming software started sending out individual emails to people instead of one single email to a thousand bcc'd, and so the messages would wind up in my inbox anyway.

After that, I set up SpamAssassin on my box, and it did a fairly decent job of sorting out the wheat from the chaff, but after a while it just wasn't working very well either. I was about at my wit's end; my in-box's usefulness was declining fast.

Until I finally got around to looking at how to set up SpamAssassin's Bayesian filtering function.

SpamAssassin, for folks who don't know and don't feel like clicking the link, is a scoring spam detector. It looks at each email that comes in and analyzes it for various spammy characteristics. It asks things like, "Is it all HTML?" "Is it from an email address or domain that gets used for spamming?" and so on, assigns a point value for each answer, then totals them up and sees if the mail makes it over a certain threshhold of spamminess. For Unixy computers (Linux, BSD, etc.), it's a must-have in today's spam-ridden environment.

The problem is that spammers are always finding ways to get around the files—breaking up key words with periods or other punctuation, replacing letters with numbers, and so on. V1a.6ra. C/al:s. And so on. And the scoring mechanisms just can't keep up.

But then along comes Bayesian filtering to save the day. Do you remember the advertising slogan for the ill-fated Dreamcast video game console, "it's thinking"? Well, a slogan for Bayesian filtering might be "it's learning"&mdashbecause that's what Bayesian filtering is all about.

Bayesian filtering is a way of letting you train your computer to recognize spam directly, without having to rely on scorefiles filled with bypassable rules. Much as police dogs are trained to sniff out drugs and explosives by letting them smell samples, a Bayesian filter is trained to sniff out spam and good email (sometimes called "ham," to contrast it to spam) and tell which is which. After it's had enough examples to compare, the filter starts to get a feel for what spam and good mail each look like, and can pass its judgment on to be considered with the rest of the SpamAssassin scores.

SpamAssassin only requires being shown 200 spam and 200 good-email ("ham") messages before the filters can start to work—but the bigger sample it gets, the more accurate it gets, so I've currently fed it 1580 spams and 4874 hams...and I'm still going through mailboxes to weed out spam and feed it more goodies.

And already I have evidence it's starting to work. Looking at an example of the score sheet that SpamAssassin appends to emails totalling up how spammy it thinks they are, I see this. Note the sections that I have bolded.
Content analysis details:   (7.7 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.1 HTML_60_70             BODY: Message is 60% to 70% HTML
 0.1 HTML_FONTCOLOR_BLUE    BODY: HTML font color is blue
 0.1 HTML_MESSAGE           BODY: HTML included in message
 5.4 BAYES_99               BODY: Bayesian spam probability is 99 to 100%
                            [score: 1.0000]
 0.3 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts
 0.1 HTML_FONTCOLOR_RED     BODY: HTML font color is red
 0.6 MIME_HTML_NO_CHARSET   RAW: Message text in HTML without charset
 0.9 FORGED_YAHOO_RCVD      'From' yahoo.com does not match 'Received' headers
 0.1 RCVD_IN_SORBS          RBL: SORBS: sender is listed in SORBS
                            [80.161.113.251 listed in dnsbl.sorbs.net]
Thus we see that the Bayesian filter judged this particular mail was spam...and added a whopping 5.4 points to the spam score, bringing it up to 7.7 points—2.7 more than the filter requires. If not for the Bayesian filter, it would only have scored 2.3 and ended up in my mailbox.

And even if something gets sorted into the wrong mailbox (a spam that should be a ham, or vice versa), I can just tell the filter "You sorted this wrong. Here's the right category," and it learns and remembers for next time.

Bayesian filtering is working well enough so far that I've gone ahead and disabled the ineffective bcc sorting, and moved SpamAssassin from being the last step in the sort (previously I had sorted all the mailing lists into their own mailboxes before I ran SpamAssassin on the rest, not wanting to take the chance it would declare mailing list posts to be spam) to the first (so hopefully now it will remove even those spam posts that come over mailing lists while leaving the rest of the lists untouched).

Maybe I should see whether any good Bayesian apps are available for Windows and get Mom & Dad set up with one. Or have Aaron do it.

(no subject)

Date: 2004-07-29 04:22 am (UTC)
From: [identity profile] bluelang.livejournal.com
I would be very much interested in hearing if you find a Bayesian filter for Windows....

(no subject)

Date: 2004-07-29 05:24 am (UTC)
From: [identity profile] robotech-master.livejournal.com
As it happens, there are quite a few of the little buggers, based on SpamAssassin alone. The Apache wiki has a listing page (http://wiki.apache.org/spamassassin/CommercialProducts). Of course, they're commercial products, as none of the open source/free software people particularly cared to put all that work into supporting Windows without monetary incentive, but if spam is that much of a problem for you then you probably won't mind kicking in some cash to solve it.

August 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
232425 26272829
3031     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags