Point-biserial correlation coefficient

The point biserial correlation coefficient (r_pb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be "naturally" dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable. In most situations it is not advisable to dichotomize variables artificially. When you artificially dichotomize a variable the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation.

The point-biserial correlation is mathematically equivalent to the Pearson (product moment) correlation, that is, if we have one continuously measured variable X and a dichotomous variable Y, r_XY = r_pb. This can be shown by assigning two distinct numerical values to the dichotomous variable.

To calculate r_pb, assume that the dichotomous variable Y has the two values 0 and 1. If we divide the data set into two groups, group 1 which received the value "1" on Y and group 2 which received the value "0" on Y, then the point-biserial correlation coefficient is calculated as follows:

where s_n is the standard deviation used when you have data for every member of the population:

M₁ being the mean value on the continuous variable X for all data points in group 1, and M₀ the mean value on the continuous variable X for all data points in group 2. Further, n₁ is the number of data points in group 1, n₀ is the number of data points in group 2 and n is the total sample size. This formula is a computational formula that has been derived from the formula for r_XY in order to reduce steps in the calculation; it is easier to compute than r_XY.

There is an equivalent formula that uses s_n₋₁:

where s_n₋₁ is the standard deviation used when you only have data for a sample of the population:

It's important to note that this is merely an equivalent formula. It is not a formula for use in the case where you only have sample data. There is no version of the formula for a case where you only have sample data. The version of the formula using s_n₋₁ is useful if you are calculating point-biserial correlation coefficients in a programming language or other development environment where you have a function available for calculating s_n₋₁, but don't have a function available for calculating s_n.

...
Wikipedia