Zimbra 8.6: Bayesian Poisoning

Zimbra 8.6: Bayesian Poisoning

Let me start by saying that this problem is not unique to Zimbra and it certainly isn’t unique the version 8.6, however I was using Zimbra 8.6 when I ran into this problem, so this is how I fixed it.

What is Bayesian Poisoning?

One of the core tenants of spam filtering using Bayesian probability to increase or decrease a particular messages score based on the likelihood it is spam.  This is done by compiling a database, often called the Bayes DB, which contains tokens resulting from the Bayesian filtering, these tokens are keywords and combinations that will either push up or down the probability that a given message is spam.  So Bayesian Poisoning is when that DB is intentionally populated with invalid references which result in either more spam being marked as not spam, or more legitimate mails being marked as spam.

https://en.wikipedia.org/wiki/Bayesian_poisoning

My Situation

Below you will see the version of Zimbra that I was running on at the time of the incident, though looking back over the situation I have been dealing with this problem to varying degrees as far back as 7.0.1 or so.

$ zmcontrol -v
Release 8.6.0_GA_1153.RHEL6_64_20141215151155 RHEL6_64 FOSS edition.

I originally was looking for clues as to the message origination, with my initial suspect being that I had a misconfiguration in my secondary MX, which was allowing spammers to flood through that then be granted the trust that that particular box had (that might have been the root of my poisoning, but I haven’t confirmed that yet).  But I didn’t find the real problem until I started looking at the X-Spam-Status header, specifically the tests section.

Analysis of Message Headers

Below is a sample section of the headers, prior to fixing the problem.

X-Spam-Flag: NO
X-Spam-Score: -0.383
X-Spam-Level:
X-Spam-Status: No, score=-0.383 tagged_above=-10 required=5
tests=[BAYES_00=-1.9, HTML_FONT_LOW_CONTRAST=0.001,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793,
SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
autolearn=no autolearn_force=no

Lets break down each component here.

The X-Spam-Flag is pretty self-explanatory, the spam filters have determined this message is not spam (it is ham) and as such it will be processed with the usual gusto of the email subsystems.

X-Spam-Flag: NO

The X-Spam-Score is the sum of all of the test results ran against this message.  This determines if it is flagged as spam.  Obviously this is a negative number, the higher the number the greater the chance it is actual spam.

X-Spam-Score: -0.383

The X-Spam-Level is simply a graphical representation of the score, in our case a negative number, doesn’t include any representation, if we had a score of 1.5 then it would be represented by “*” with the 1 asterisk being equivalent to the whole number in 1.5 or “1” conversely a score of 7.2 would have a level of “*******”.  Negative numbers are less than 1 therefore they are represented by 0 asterisks

X-Spam-Level:

The X-Spam-Status is where all the magic happens.  Here we have the “required” which is the numerical score which will result in the message being tagged as spam, and of more value is the actual tests that were run and how that affected the score.  This is where the problem is.  BAYES_00 is where the real problem is, this one test has a value of -1.9 which is huge, but that score is actually correct, because the “00” means that there is virtual no chance (0-1%) that this message is spam, and obviously all of your legitimate messages fall into this category as well, so we can’t just tinker with that score.  There is additionally a BAYES_05, BAYES_20, BAYES_40, BAYES_50, BAYES_60, BAYES_80, BAYES_95, and BAYES_99 which correspond to the top percentages that they serve.  So the core of this problem is that we have a spam filter that thinks beyond a doubt that actual spam has no chance of being spam.

X-Spam-Status: No, score=-0.383 tagged_above=-10 required=5
tests=[BAYES_00=-1.9, HTML_FONT_LOW_CONTRAST=0.001,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793,
SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
autolearn=no autolearn_force=no

Another interesting item of note.  In my environment I noticed that 90-95% of all of the spam I analyzed included HTML_FONT_LOW_CONTRAST, however the score for that was 0.001.  This particular test looks at the formatting of the message and looks for the background and text colors being close enough that they are difficult to read with the naked eye (often in my environment it was a white background with an off-white or light light gray text).  So in addition to fixing the Bayesian DB poisoning I also adjusted the score for this to reduce the likelihood that these messages get through the filter.  With all of these tests that came back increasing the spam score, they were not able to overcome the Bayesian DB poisoning.

Resolving Bayesian DB Poisoning

Bad news is that there is no way to “fix” Bayes DB poisoning.  Basically everything that the Bayes DB knows is wrong, so the proper fix is to start with a fresh DB and re-train it.

Below you will see the location of the Bayes DB.

[zimbra@mail:~]$ pwd
/opt/zimbra
[zimbra@mail:~]$ cd .spamassassin/
[zimbra@mail:~/.spamassassin]$ ls -lh
total 4.4M
-rw-------. 1 zimbra zimbra 332K Sep 8 09:47 bayes_seen
-rw-------. 1 zimbra zimbra 4.7M Sep 8 09:47 bayes_toks

Now to remove it we must stop the Zimbra services, I suspect stopping the amavis might be enough, but to be safe I just did the entire Zimbra service.

[root@mail:~]# service zimbra stop

Then simply remove both the bayes_seen and bayes_toks files.

[zimbra@mail:~/.spamassassin]# rm bayes_*

Once the files are gone, we can restart zimbra.

[root@mail:~]# service zimbra start

Lets look at those files again, and lets see the file size difference.  This was after some time and training, they might not show up immediately.

[zimbra@mail:~]$ cd .spamassassin/
[zimbra@mail:~/.spamassassin]$ ls -lh
total 288K
-rw-------. 1 zimbra zimbra 12K Sep 8 20:25 bayes_seen
-rw-------. 1 zimbra zimbra 332K Sep 8 22:00 bayes_toks

Training Zimbra

Training Zimbra is pretty simply, use the Mark as Spam button in the webmail application.  Of course it takes some time, because you need to wait for actual spam to come across in order to train based off it.  To speed up the process and actually see the progress I like to manually run zmtrainsa to learn based off of the messages marked as spam, this will show me how effective that learn was across the whole system.

[zimbra@mail:~]$ zmtrainsa
20150912120115 Starting spam/ham extraction from system accounts.
[] INFO: Total messages processed: 1
[] INFO: Total messages processed: 0
20150912120119 Finished extracting spam/ham from system accounts.
20150912120119 Starting spamassassin training.
Learned tokens from 1 message(s) (1 message(s) examined)
Learned tokens from 0 message(s) (0 message(s) examined)
bayes: synced databases from journal in 1 seconds: 3129 unique entries (3910 total entries)
20150912120124 Finished spamassassin training.

Above we see that it processed the 1 message that I marked as spam, and that from that message it was able to learn tokens.  This means that this was a good learn and it should increase the effectiveness of your spam filter.  It seemed like the first 4 hours I was getting almost no spam, and then were was a burst for the next 12-16 hours where it seemed like perhaps I had been poisoned again already (that was frustrating) but I just kept on training, and by the end of the second day my spam volume had dropped and approx 80-90% of the messages were being tagged and placed into the Junk folder.

To give you an idea of load so that you can extrapolate your expected timings based on my experiences, most days my system receives 3,000-4,000 messages a day with once a week bursts of up to 18,000 messages a day.

Customizing SpamAssassin Rules

This procedure is different for previous versions, please do your homework if you are not on Zimbra 8.6.

We need to add the following to sauser.cf.  This should only be done after an extensive analysis of your spam otherwise at best it will not have any effect.  You might need to create sauser.cf if you haven’t previously customized other rules.

[zimbra@mail:~/data/spamassassin/localrules]$ more sauser.cf
ifplugin Mail::SpamAssassin::Plugin::HTMLEval
# <gen:mutable>
# DEFAULT - score HTML_FONT_LOW_CONTRAST 0.713 0.001 0.786 0.001
score HTML_FONT_LOW_CONTRAST 1.5 1.5 1.5 1.5
# </gen:mutable>
endif

You can use this to modify other rules as you see fit as well.

https://wiki.apache.org/spamassassin/Rules/

Conclusion

So 4 days into this no complaints.  I still have spam that gets through, but that is the design it is a small number, and they are not BAYES_00 in other words the filter knows they are probably spam and would rather I take a look at it to confirm it rather then get over zealous.  I am still using the adjusted HTML_FONT_LOW_CONTRAST rule, even though it shoots the scores through the roof with the Bayesian filter actually doing its job.

Here is a header from a message that it let through.  The important thing here is that the absence of the BAYES_00 score, it is instead replaced by BAYES_20, which only discounts the spam score by 0.001 so it is unsure about the message and it has a largely neutral effect.

X-Spam-Flag: NO
X-Spam-Score: 1.935
X-Spam-Level: *
X-Spam-Status: No, score=1.935 tagged_above=-10 required=5
tests=[BAYES_20=-0.001, DATE_IN_FUTURE_06_12=1.947,
SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01,
URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no

Here is a header from a message that it caught and tagged.  Of course this one also has an absence of the BAYES_00, replaced by BAYES_80 indicating that the Bayesian DB has some good information in it.

X-Spam-Flag: YES
X-Spam-Score: 8.74
X-Spam-Level: ********
X-Spam-Status: Yes, score=8.74 tagged_above=-10 required=5
tests=[AC_HTML_NONSENSE_TAGS=1.999, BAYES_80=2, HTML_MESSAGE=0.001,
SPF_PASS=-0.001, STYLE_GIBBERISH=3.499, T_RP_MATCHES_RCVD=-0.01,
UNPARSEABLE_RELAY=0.001, URIBL_BLOCKED=0.001, URIBL_JP_SURBL=1.25]
autolearn=no autolearn_force=no

So the bottom line is that if you are dealing with an overwhelming amount of spam in your inbox then this warrants some investigation and hopefully this will help you to sort through that problem a little quicker than it took me.