*** Welcome to piglix ***

Kullback–Leibler divergence


In mathematical statistics, the Kullback–Leibler divergence is a measure of how one probability distribution diverges from a second expected probability distribution. Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. In contrast to variation of information, it is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread. In the simple case, a Kullback–Leibler divergence of 0 indicates that we can expect similar, if not the same, behavior of two different distributions, while a Kullback–Leibler divergence of 1 indicates that the two distributions behave in such a different manner that the expectation given the first distribution approaches zero. In somewhat simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience, and machine learning.

The Kullback–Leibler divergence was originally introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions; Kullback himself preferred the name discrimination information. The measure is discussed in Kullback's historic text, Information Theory and Statistics.

The Kullback–Leibler divergence from Q to P is often denoted DKL(PQ).

In the context of machine learning, DKL(PQ) is often called the information gain achieved if P is used instead of Q. By analogy with information theory, it is also called the relative entropy of P with respect to Q. In the context of coding theory, DKL(PQ) can be construed as measuring the expected number of extra bits required to code samples from P using a code optimized for Q rather than the code optimized for P.

Expressed in the language of Bayesian inference, DKL(PQ) is a measure of the information gained when one revises one's beliefs from the prior probability distribution Q to the posterior probability distribution P. In other words, it is the amount of information lost when Q is used to approximate P. In applications, P typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while Q typically represents a theory, model, description, or approximation of P. In order to find a distribution Q that is closest to P, we can minimize KL divergence and compute an information projection.


...
Wikipedia

...