In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal association. It was first introduced by Henri Theil and is based on the concept of information entropy.
Suppose we have samples of two discrete random variables, X and Y. By constructing the joint distribution, PX,Y(x, y), from which we can calculate the conditional distributions, PX|Y(x|y) = PX,Y(x, y)/PY(y) and PY|X(y|x) = PX,Y(x, y)/PX(x), and calculating the various entropies, we can determine the degree of association between the two variables.
The entropy of a single distribution is given as:
while the conditional entropy is given as:
The uncertainty coefficient or proficiency is defined as:
and tells us: given Y, what fraction of the bits of X can we predict? (The above expression makes clear that the uncertainty coefficient is a normalised mutual information I(X;Y).) In this case we can think of X as containing the "true" values. Note that the value of U (but not H!) is independent of the base of the log since all logarithms are proportional.
The uncertainty coefficient is useful for measuring the validity of a statistical classification algorithm and has the advantage over simpler accuracy measures such as precision and recall in that it is not affected by the relative fractions of the different classes, i.e., P(x) . It also has the unique property that it won't penalize an algorithm for predicting the wrong classes, so long as it does so consistently (i.e., it simply rearranges the classes). This is useful in evaluating clustering algorithms since cluster labels typically have no particular ordering.
Symmetrised: The uncertainty coefficient is not symmetric with respect to the roles of X and Y. The roles can be reversed and a symmetrical measure thus defined as a weighted average between the two: