Spam filtering

Various anti-spam techniques are used to prevent email spam (unsolicited bulk email).

No technique is a complete solution to the spam problem, and each has trade-offs between incorrectly rejecting legitimate email (false positives) vs. not rejecting all spam (false negatives) – and the associated costs in time and effort.

Anti-spam techniques can be broken into four broad categories: those that require actions by individuals, those that can be automated by email administrators, those that can be automated by email senders and those employed by researchers and law enforcement officials.

People tend to be much less bothered by spam slipping through filters into their mail box (false negatives), than having desired email ("ham") blocked (false positives). Trying to balance false negatives (missed spams) vs false positives (rejecting good email) is critical for a successful anti-spam system. Some systems let individual users have some control over this balance by setting "spam score" limits, etc. Most techniques have both kinds of serious errors, to varying degrees. So, for example, anti-spam systems may use techniques that have a high false negative rate (miss a lot of spam), in order to reduce the number of false positives (rejecting good email).

Detecting spam based on the content of the email, either by detecting keywords such as "viagra" or by statistical means (content or non-content based), is very popular. Content based statistical means or detecting keywords can be very accurate when they are correctly tuned to the types of legitimate email that an individual gets, but they can also make mistakes such as detecting the keyword "cialis" in the word "specialist" (see also Internet censorship: Over- and under-blocking). Spam originators frequently seek to defeat such measures by employing typographical techniques such as replacing letters with accented variants or alternative characters which appear identical to the intended characters but are internally distinct (e.g., replacing a Roman 'A' with a Cyrillic 'A'), or inserting other characters such as whitespace, nonprinting characters, or bullets into a term to block pattern matching. This introduces an arms race which demands increasingly complex keyword-detection methods.

...
Wikipedia