*** Welcome to piglix ***

Point-biserial correlation coefficient


The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be "naturally" dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable. In most situations it is not advisable to dichotomize variables artificially. When you artificially dichotomize a variable the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation.

The point-biserial correlation is mathematically equivalent to the Pearson (product moment) correlation, that is, if we have one continuously measured variable X and a dichotomous variable Y, rXY = rpb. This can be shown by assigning two distinct numerical values to the dichotomous variable.

To calculate rpb, assume that the dichotomous variable Y has the two values 0 and 1. If we divide the data set into two groups, group 1 which received the value "1" on Y and group 2 which received the value "0" on Y, then the point-biserial correlation coefficient is calculated as follows:

where sn is the standard deviation used when you have data for every member of the population:

M1 being the mean value on the continuous variable X for all data points in group 1, and M0 the mean value on the continuous variable X for all data points in group 2. Further, n1 is the number of data points in group 1, n0 is the number of data points in group 2 and n is the total sample size. This formula is a computational formula that has been derived from the formula for rXY in order to reduce steps in the calculation; it is easier to compute than rXY.

There is an equivalent formula that uses sn−1:

where sn−1 is the standard deviation used when you only have data for a sample of the population:

It's important to note that this is merely an equivalent formula. It is not a formula for use in the case where you only have sample data. There is no version of the formula for a case where you only have sample data. The version of the formula using sn−1 is useful if you are calculating point-biserial correlation coefficients in a programming language or other development environment where you have a function available for calculating sn−1, but don't have a function available for calculating sn.


...
Wikipedia

...