In statistics, omitted-variable bias (OVB) occurs when a model created incorrectly leaves out one or more important factors. The "bias" is created when the model compensates for the missing factor by over- or underestimating the effect of one of the other factors.
More specifically, OVB is the bias that appears in the estimates of parameters in a regression analysis, when the assumed specification is incorrect in that it omits an independent variable that is correlated with both the dependent variable and one or more included independent variables.
Two conditions must hold true for omitted-variable bias to exist in linear regression:
Suppose the true cause-and-effect relationship is given by
with parameters a, b, c, dependent variable y, independent variables x and z, and error term u. We wish to know the effect of x itself upon y (that is, we wish to obtain an estimate of b). But suppose that we omit z from the regression, and suppose the relation between x and z is given by
with parameters d, f and error term e. Substituting the second equation into the first gives
If a regression of y is conducted upon x only, this last equation is what is estimated, and the regression coefficient on x is actually an estimate of (b+cf ), giving not simply an estimate of the desired direct effect of x upon y (which is b), but rather of its sum with the indirect effect (the effect f of x on z times the effect c of z on y). Thus by omitting the variable z from the regression, we have estimated the total derivative of y with respect to x rather than its partial derivative with respect to x. These differ if both c and f are non-zero.
As an example, consider a linear model of the form
where
We collect the observations of all variables subscripted i = 1, ..., n, and stack them one below another, to obtain the matrix X and the vectors Y, Z, and U: