*** Welcome to piglix ***

Bayesian spam filtering


Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag of words features to identify spam e-mail, an approach commonly used in text classification.

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.

Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low false positive spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s.

The first known mail-filtering program to use a naive Bayes classifier was Jason Rennie's ifile program, released in 1996. The program was used to sort mail into folders. The first scholarly publication on Bayesian spam filtering was by Sahami et al. in 1998. That work was soon thereafter deployed in commercial spam filters. However, in 2002 Paul Graham greatly decreased the false positive rate, so that it could be used on its own as a single spam filter.

Variants of the basic technique have been implemented in a number of research works and commercial software products. Many modern mail clients implement Bayesian spam filtering. Users can also install separate email filtering programs. Server-side email filters, such as DSPAM, SpamAssassin,SpamBayes,Bogofilter and ASSP, make use of Bayesian spam filtering techniques, and the functionality is sometimes embedded within mail server software itself. CRM114, oft cited as a Bayesian filter, is not intended to use a Bayes filter in production, but includes the ″unigram″ feature for reference.


...
Wikipedia

...