SpamAssassin Tutorial

From Apis Networks Wiki

Jump to: navigation, search

Contents

Introduction

SpamAssassin is a tool designed to detect e-mails that may be potential spam. SpamAssassin detects spam by two methods. First, it analyzes e-mails for commonly used occurrences of phrases or subject matter commonly found in spams. It then adds a corresponding score to the e-mail.

SpamAssassin also may call upon Bayesian filtering to analyze the e-mail for special phrases or words that occur in spam that you commonly receive. This feature is personalized to content of the e-mails that you receive; that is to say, your database of tokens will differ significantly from the database of tokens another user will have collected over time.

SpamAssassin configuration files are stored in the user's home directory inside a hidden directory of your home directory called .spamassassin/.

Editing and Syntax Checking

Editing

Inside your SpamAssassin directory (/home/yourusername/.spamassassin), are several files of interest (quoted from the sa-learn manual man -S 1 sa-learn):

  • bayes_toks: The database of tokens, containing the tokens learnt, their count of occurrences in ham and spam, and the message count of the last message they were seen in. This database also contains some 'magic' tokens, as follows: the number of ham and spam messages learnt, the number of tokens in the database, the message-count of the last expiry run, the message-count of the oldest token in the database, and the message-count of the current message (to the nearest 5000). This is a database file, using the first one of the following database modules that SpamAssassin can find in your perl installation: "DB_File", "GDBM_File", "NDBM_File", or "SDBM_File".
  • bayes_seen: A map of message-ID to what that message was learnt as. This is used so that SpamAssassin can avoid re-learning a message it's already seen, and so it can reverse the training if you later decide that message was previously learnt incorrectly. This is a database file, using the first one of the following database modules that SpamAssassin can find in your perl installation: "DB_File", "GDBM_File", "NDBM_File", or "SDBM_File".
  • bayes_journal: While SpamAssassin is scanning mails, it needs to track which tokens it uses in its calculations. So that many processes can read the databases simultaneously, but only one can write at a time, this uses a 'journal' file. When you run "sa-learn --rebuild", the journal is read, and the tokens that were accessed during the journal's lifetime will have their last-access time updated in the "bayes_toks" database.
  • auto-whitelist: The auto-whitelist contains e-mail addresses that are whitelisted, or in other words, senders whose chances of being labeled as spam are greatly reduced. This is automatically updated whenever an e-mail is learned as ham or if you train SpamAssassin to learn the e-mail as ham through sa-learn.
  • user_prefs: Your SpamAssassin preferences file — the file we're interested in editing.

Pop open user_prefs in your favorite text editor and let's get to work on understanding the structure.

use_bayes 1
# How many hits before a mail is considered spam.
required_hits 5
bayes_auto_learn 1
add_header all Score _SCORE_

On the left-hand side of the configuration is a directive. A directive merely instructs SpamAssassin what default behaviors to override. Everything on the right acts as further instructions on how to handle the behavior. Here we see a use_bayes value. Immediately following the space, we see a value 1. This is equivalent to true or "enable this". SpamAssassin will enable the use_bayes feature, which if we look it up on SpamAssassin's online manual, enables Bayesian filtering. You may bring the manual up from your shell by typing perldoc Mail::SpamAssassin::Conf. In our example the following values are defined:

  • use_bayes: Enable Bayesian filtering
    • 1: We are enabling it
  • required_hits: Number of hits to mark an e-mail as spam
    • 5: our defined value
  • bayes_auto_learn: Whether to enable Bayesian auto-learning
    • 1: We are enabling it.
  • add_header: Add a customer header to certain types of e-mails
    • all Score _SCORE: Whether an e-mail is spam or ham, a header is added to the e-mail outlining the score determined by SpamAssassin. _SCORE_ is substituted with the score SpamAssassin determines.

Notice how we skipped over # How many hits before a mail is considered spam.? That is because it is a comment — just a mental note you leave in there explaining another line. Comments are not interpreted as variables by SpamAssassin. Comments are ubiquitous and seen in almost every application. Some begin with #, others //, while even still further, some use ;. /* comment code */ is seldomly seen in configuration scripts and more commonly seen in code. For the sake of SpamAssassin though, it uses # to mark a comment.

Syntax Checking

Now that we've modified our configuration a bit, let's make sure it works before using it permanently. This can be done by typing spamassassin --lint from the command-line.

spamassassin --lint
if [ $? -eq 0 ] ; then
   echo "Clean"
else
   echo "Error!"
fi

If you see "Clean", then no warnings have been emitted and it's understood by SpamAssassin.

Image:warning.gif
Important: Whenever you are writing custom rules (explained later on), it is important to run spamassassin --lint to ensure the configuration is understood by SpamAssassin fully. Failure to do so may result in your configuration file NOT being used when spamassassin is called upon.
 

Automatically Tagging International E-mails

If you only receive e-mails in English, Cyrillic, Greek, or whatever the language may be, SpamAssassin can be configured to automatically label those e-mails as spam, regardless of content. SpamAssassin has two modes of detecting and marking e-mails in unwanted characters as spam.

By Character Set

Marking e-mails as spam based upon characters sets is limited to the unique languages out there, but considerably more effective than by language. E-mails are analyzed for the proportion of the character set noticed in an e-mail.

ok_locales all 
ok_locales en
ok_locales en ja zh

First, all e-mails are accepted and no score is given to e-mails based upon a character set. This means that e-mails written solely in the Cyrillic character set (Russian), would only be labeled as spam if it contains other strings SpamAssassin recognizes and understands as spam. Secondly, solely English is permitted; this is a good general solution to stopping spam from other countries and languages you cannot recognize. Lastly, e-mails written in English, Japanese, and Chinese are permitted.

Only the following character sets are recognized as valid options. The left-hand is a value and right-hand is a description of the corresponding value:

en - Western character sets in general
ja - Japanese character sets
ko - Korean character sets
ru - Cyrillic character sets
th - Thai character sets
zh - Chinese (both simplified and traditional) character sets

Important: This is a lot more effective than using a language-based filter in SpamAssassin.

By Language

While not always accurate and not as marking by character set, SpamAssassin can attempt to decipher what language an e-mail is written in and award, by default, 2.8 points to the e-mail if it's in a non-preferred language.

ok_languages all 
ok_languages en 
ok_languages en ja zh 

The first setting permits all languages to come through with no additional score weighted on the e-mail. The subsequent setting adds 2.8 points to all e-mails not written in English. Lastly, you can add multiple languages by separating each language oode with a space. Finally, English, Japanese, and Chinese languages are accepted in e-mails.

The following lists all recognized languages and their corresponding name:

af - Afrikaans
am - Amharic
ar - Arabic
be - Byelorussian
bg - Bulgarian
bs - Bosnian
ca - Catalan
cs - Czech
cy - Welsh
da - Danish
de - German
el - Greek
en - English
eo - Esperanto
es - Spanish
et - Estonian
eu - Basque
fa - Persian
fi - Finnish
fr - French
fy - Frisian
ga - Irish Gaelic
gd - Scottish Gaelic
he - Hebrew
hi - Hindi
hr - Croatian
hu - Hungarian
hy - Armenian
id - Indonesian
is - Icelandic
it - Italian
ja - Japanese
ka - Georgian
ko - Korean
la - Latin
lt - Lithuanian
lv - Latvian
mr - Marathi
ms - Malay
ne - Nepali
nl - Dutch
no - Norwegian
pl - Polish
pt - Portuguese
qu - Quechua
rm - Rhaeto-Romance
ro - Romanian
ru - Russian
sa - Sanskrit
sco - Scots
sk - Slovak
sl - Slovenian
sq - Albanian
sr - Serbian
sv - Swedish
sw - Swahili
ta - Tamil
th - Thai
tl - Tagalog
tr - Turkish
uk - Ukrainian
vi - Vietnamese
yi - Yiddish
zh - Chinese (both Traditional and Simplified)
zh.big5 - Chinese (Traditional only)
zh.gb2312 - Chinese (Simplified only)
Image:warning.gif
Important: This is not always effective; SpamAssassin may generate a false negative on them.
 

Implementing Custom Rules

Suppose that the stock rules included with SpamAssassin are insufficient to meet your demands. You would like to try to include some other pre-written rules or even perhaps write your own ruleset. In either case, you will need to modify how maildrop handles the e-mails. Traditionally on the servers, maildrop will call spamc, which is a SpamAssassin client that connects to the server.

Including Existing Rules

Custom rules may be added through

SpamAssassin has a repository available of rulesets that are drop-ins, so all that is need is to download the ruleset and add it to your user_prefs file under the /home/username/.spamassassin/. Include the following line inside the user_prefs:

include /home/username/.spamassassin/rulefile.cf

Substitute username with your username and rulefile.cf with the ruleset file you have included. Substitute the full path out accordingly to wherever the file may be located if saved elsewhere other than inside your home directory. Remember that e-mails received are chroot'd.

Important: always make sure your rules are understood by SpamAssassin by linting them via

spamassassin --lint

Writing Your Own

Rarely the stock configuration from SpamAssassin is ineffective at catching new variants of spam. You may implement custom rules to catch these deviations; however, custom rules are strongly discouraged by the SpamAssassin authors. Typically retraining missed messages to SpamAssassin and ramping up Bayes scoring are much more effective. Rules may be added to your user_prefs file or to another file and included via the include /home/username/somefile.ext syntax. For a good primer, see SpamAssassin's wiki entry on custom rules.

Image:warning.gif
Important: always make sure your rules are understood by SpamAssassin by linting them via
spamassassin --lint
 

Whitelisting

It's occasionally useful to be able to whitelist either individual addresses or entire domains. This can be useful if you receive email from someone regularly that is frequently incorrectly marked as spam. To do this, use the whitelist_from option in your user_prefs file. The argument to whitelist_from can be either a specific email address or a wildcarded address, which gives you the ability to whitelist an entire subdomain. For example:

# Whitelist just someone@somewhere.com
whitelist_from someone@somewhere.com

# Whitelist mail from anyone at the myfriends.com domain
whitelist_from *@myfriends.com

You can whitelist your own domain, but be aware that spammers will often spoof addresses within your own domain, so this may not work well.

Creating a spam trap

See SMTP: Creating a spam trap

Using Apis Networks' SpamAssassin Configuration Wizard

The SpamAssassin Configuration Wizard is designed to simplify the complex options you can modify in SpamAssassin. Additionally, a maildrop recipe is created under certain conditions to move marked spam to certain IMAP folders or delete them entirely. This is an elegant solution for modifying some of the advanced options in SpamAssassin, plus if you have FTP access, the wizard will automatically upload the new configuration files to your site. This option is available to not only sub-users of a site, but also the primary user of a domain called the Site Administrator.

See Also

Personal tools