*** Welcome to piglix ***

Sørensen similarity index


The Sørensen–Dice index, also known by other names (see Name, below), is a statistic used for comparing the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen and Lee Raymond Dice, who published in 1948 and 1945 respectively. The Sørensen–Dice is also known as F1 score or Dice similarity coefficient (DSC).

The index is known by several other names, especially the Sørensen index or Dice's coefficient. Other variations include the "similarity coefficient" or "index". Common alternate spellings for Sørensen are Sorenson, Soerenson and Sörenson, and all three can also be seen with the –sen ending.

Other names include:

Sørensen's original formula was intended to be applied to presence/absence data, and is

where |X| and |Y| are the numbers of elements in the two samples. Based on what is written here,

as compared with the Jaccard index, which only counts true positives once in both the numerator and denominator. QS is the quotient of similarity and ranges between 0 and 1. It can be viewed as a similarity measure over sets.

Similarly to the Jaccard index, the set operations can be expressed in terms of vector operations over binary vectors A and B:

which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms.

For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities :

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:

where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

We would find the set of bigrams in each word:

Each set has four elements, and the intersection of these two sets has only one element: ht.


...
Wikipedia

...