In many problems, such as classical detection and estimation, or Bayesian inference, we need a measure of closeness/similarity between two distributions. For instance, in hypothesis testing, we need to measure the "distance" between the hypothesized distribution and the true sample distribution. For generative models in machine learning, we need to measure the similarity between the distribution of the generated data and that of the true data so that we can optimize the parameters our generation model. To solve these problems, various statistical divergences are introduced.
For simplicity, we will avoid the notation of probability space. We will introduce statistical divergences using proper probability density functions (or probability mass functions). A statistical divergence between two probability distributions and maps two distributions to , and must satisfy the following two properties:
- if and only if = .
Here we do not require to be symmetric (hence in general). The two distributions and are very similar if is very close to zero.
Note: a statistical divergence is not necessarily a metric. Technically it is possible to define a metric over probability distributions under some conditions, and use this metric as a divergence.
In the following, we will give a brief list of commonly used statistical divergences. You can skip the details and jump to the quick reference card here.
The Kullback–Leibler divergence is probably the most commonly used statistical divergence. It is defined as follows:
It has other names such as information gain and relative entropy. The KL divergence has wide applications in machine learning, whenever you are maximizing some likelihood or posterior function, performing variational inference.
Example 1. Given two univariate Gaussians and , the KL divergence between and is given by
Note that when and .
The Jensen-Shannon divergence can be viewed as the symmetrized version of the KL divergence, which is defined as follows:
When symmetry is required, Jensen-Shannon divergence can be used to replace the KL divergence. Interestingly, in the original generative adversarial network (GAN) paper, the Jensen-Shannon divergence appears in the proof of Theorem 1, implying that the training process can be viewed as minimizing the Jensen-Shannon divergence between the true data and the generated data under some assumptions. For more details and relevant readings, check the following:
- I. J. Goodfellow et al., "Generative Adversarial Networks."
- How to Train your Generative Models? And why does Adversarial Training work so well?
- : we obtain KL divergence.
- : we obtain the total variation distance.
- : we obtain the squared Hellinger distance: (or ).
- : we obtain the Pearson's chi-square divergence: .
- : we obtain the Jensen-Shannon divergence.
S. Nowozin et al. from Microsoft Research show that f-divergence can be used for training generative neural samplers. For more details, check the following paper:
- S. Nowozin, B. Cseke, and R. Tomioka, "f-GAN: training generative neural samplers using variational divergence minimization," in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 271–279.
For a more technical reading, check the following paper:
- M. D. Reid and R. C. Williamson, "Information, Divergence and Risk for Binary Experiments," Journal of Machine Learning Research, vol. 12, no. Mar, pp. 731–817, 2011.
in which the authors unify f-divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information.
Based on Rényi entropy, we can define another family of statistical divergences parameterized by . Given , the Rényi divergence is defined as
- When , we obtain the KL divergence.
- When , we obtain the doubled Bhattacharyya distance: . This can also be expressed in terms of the squared Hellinger distance: .
For more details, check the following paper:
- T. van Erven and P. Harremos, "Rényi divergence and Kullback-Leibler divergence," IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, Jul. 2014.
Bregman divergences defines a more general family of divergences. It can be defined between two general vectors (in the same vector space), not restricted to distributions. For completeness and its relationship to the KL divergence, we will include it here. Let be a differentiable strictly convex function, the Bregman divergence is defined as
Depending on the choice of , we obtain different divergences:
- : we obtain the squared Euclidean distance.
- : we obtain the squared Mahalanobis distance.
- : we obtain the discrete version of the KL divergence.
Bregman divergences find wide applications in machine learning. Moreover, in optimization, it is closely related to the mirror descent algorithm. For more details and relevant applications, check the following papers:
- A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, "Clustering with Bregman divergences," Journal of Machine Learning Research, vol. 6, no. Oct, pp. 1705–1749, 2005. (Appendix A provides a good list of properties of Bregman divergence)
- A. Beck and M. Teboulle, "Mirror descent and nonlinear projected subgradient methods for convex optimization," Operations Research Letters, vol. 31, no. 3, pp. 167–175, May 2003.
- Meet the Bregman Divergences has a good list of references.
Quick Reference Card
KL divergence: .
Jensen-Shannon divergence: , .
f-divergence: , is convex, .
- : KL divergence.
- : Total variation distance.
- : Squared Hellinger distance.
- : Pearson's chi-square divergence.
- : Jensen-Shannon divergence.
Rényi divergence: , .
- : KL divergence.
- : Doubled Bhattacharyya distance.
Bregman divergence: , : differentiable strictly convex.
- : Squared Euclidean distance.
- : Squared Mahalanobis distance.
- : Discrete KL divergence.
S. Kullback and R. A. Leibler, “On Information and Sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, Mar. 1951. ↩
S. I. Amari, "-divergence is unique, belonging to both -divergence and Bregman divergence classes," IEEE Transactions on Information Theory, vol. 55, no. 11, pp. 4925-4931, Nov. 2009. ↩
I. Taneja, "Relative information of type s, Csiszár’s f-divergence, and information inequalities," Information Sciences, vol. 166, no. 1-4, pp. 105–125, Oct. 2004. ↩
F. Nielsen and R. Nock, "On the chi square and higher-order chi distances for approximating f-divergences," IEEE Signal Processing Letters, vol. 21, no. 1, pp. 10–13, Jan. 2014. ↩
The total variation distance is usually defined using supremum. Here if we directly plug in , we obtain an integral over the over the absolute differences between the two pdfs. To see why they are equivalent, check the discussion here. ↩