# A Brief List of Statistical Divergences

Date: 3-9 2018

Tags: statistics, deep learning

In many problems, such as classical detection and estimation, or Bayesian inference, we need a measure of closeness/similarity between two distributions. For instance, in hypothesis testing, we need to measure the "distance" between the hypothesized distribution and the true sample distribution. For generative models in machine learning, we need to measure the similarity between the distribution of the generated data and that of the true data so that we can optimize the parameters our generation model. To solve these problems, various *statistical divergences* are introduced.

For simplicity, we will avoid the notation of probability space. We will introduce statistical divergences using proper probability density functions (or probability mass functions). A statistical divergence $D(P||Q)$ between two probability distributions $P$ and $Q$ maps two distributions to $\mathbb{R}$, and must satisfy the following two properties:

- $D(P\|Q) \geq 0$;
- $D(P\|Q) = 0$ if and only if $P$ = $Q$.

Here we do not require $D(\cdot\|\cdot)$ to be symmetric (hence $D(P\|Q) \neq D(Q\|P)$ in general). The two distributions $P$ and $Q$ are very similar if $D(P\|Q)$ is very close to zero.

Note:a statistical divergence is not necessarily a metric. Technically it is possible to define a metric over probability distributions under some conditions, and use this metric as a divergence.

In the following, we will give a brief list of commonly used statistical divergences. You can skip the details and jump to the quick reference card here.

## KL Divergence

The *Kullback–Leibler divergence*^{[1]} is probably the most commonly used statistical divergence. It is defined as follows:

It has other names such as *information gain* and *relative entropy*. The KL divergence has wide applications in machine learning, whenever you are maximizing some likelihood or posterior function, performing variational inference.

**Example 1.** Given two univariate Gaussians $P \sim \mathcal{N}(\mu_1, \sigma_1^2)$ and $Q \sim \mathcal{N}(\mu_2, \sigma_2^2)$, the KL divergence between $P$ and $Q$ is given by

Note that $D_{\mathrm{KL}}(P\|Q) = 0$ when $\mu_1 = \mu_2$ and $\sigma_1 = \sigma_2$.

## Jensen-Shannon Divergence

The *Jensen-Shannon divergence* can be viewed as the symmetrized version of the KL divergence, which is defined as follows:

where $M = (P + Q)/2$.

When symmetry is required, Jensen-Shannon divergence can be used to replace the KL divergence. Interestingly, in the original generative adversarial network (GAN) paper, the Jensen-Shannon divergence appears in the proof of Theorem 1, implying that the training process can be viewed as minimizing the Jensen-Shannon divergence between the true data and the generated data under some assumptions. For more details and relevant readings, check the following:

- I. J. Goodfellow et al., "Generative Adversarial Networks."
- How to Train your Generative Models? And why does Adversarial Training work so well?

## f-Divergence

Now we introduce a family of statistical divergences called *f-divergence*^{[2]}^{[3]}. Given a convex function $f: (0, \infty) \mapsto \mathbb{R}$ satisfying $f(1) = 0$, the f-divergence is defined as

The non-negativity can be verified by combining Jensen's inequality and the fact that $f(x) = 0$. By choosing different $f$, we can obtain different divergences^{[4]}:

- $f(u) = u\log u$: we obtain KL divergence.
- $f(u) = |u - 1|$: we obtain the total variation distance
^{[5]}. - $f(u) = \frac{1}{2}(\sqrt{u} - 1)^2$: we obtain the squared Hellinger distance: $\frac{1}{2}\int (\sqrt{p(x)} - \sqrt{q(x)})^2 \mathrm{d}x$ (or $1 - \int\sqrt{p(x)q(x)} \mathrm{d}x$).
- $f(u) = (u - 1)^2$: we obtain the Pearson's chi-square divergence: $\int (p(x) - q(x))^2 / q(x) \mathrm{d}x$.
- $f(u) = -(u + 1)\log\frac{1+u}{2} + u\log u$: we obtain the Jensen-Shannon divergence.

S. Nowozin et al. from Microsoft Research show that f-divergence can be used for training generative neural samplers. For more details, check the following paper:

- S. Nowozin, B. Cseke, and R. Tomioka, "f-GAN: training generative neural samplers using variational divergence minimization," in
*Advances in Neural Information Processing Systems 29*, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 271–279.

For a more technical reading, check the following paper:

- M. D. Reid and R. C. Williamson, "Information, Divergence and Risk for Binary Experiments,"
*Journal of Machine Learning Research*, vol. 12, no. Mar, pp. 731–817, 2011.

in which the authors unify f-divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information.

## Rényi Divergence

Based on Rényi entropy, we can define another family of statistical divergences parameterized by $\alpha$. Given $\alpha \in (0, 1) \cup (1, \infty)$, the *Rényi divergence* is defined as

Specifically,

- When $\alpha \to 1$, we obtain the KL divergence.
- When $\alpha = 1/2$, we obtain the doubled Bhattacharyya distance: $-2\log\int \sqrt{p(x)q(x)} \mathrm{d}x$. This can also be expressed in terms of the squared Hellinger distance: $-2\log(1 - \mathrm{Hell}^2(P, Q))$.

For more details, check the following paper:

- T. van Erven and P. Harremos, "Rényi divergence and Kullback-Leibler divergence,"
*IEEE Transactions on Information Theory*, vol. 60, no. 7, pp. 3797–3820, Jul. 2014.

## Bregman Divergence

*Bregman divergences* defines a more general family of divergences. It can be defined between two general vectors (in the same vector space), not restricted to distributions. For completeness and its relationship to the KL divergence, we will include it here. Let $\phi(\mathbf{x})$ be a differentiable strictly convex function, the Bregman divergence is defined as

Depending on the choice of $\phi$, we obtain different divergences:

- $\phi(\mathrm{x}) = \|\mathrm{x}\|_2^2$: we obtain the squared Euclidean distance.
- $\phi(\mathrm{x}) = \mathrm{x}^T \mathrm{Q} \mathrm{x}$: we obtain the squared Mahalanobis distance.
- $\phi(\mathrm{x}) = \sum_i x_i \log x_i$: we obtain the discrete version of the KL divergence.

Bregman divergences find wide applications in machine learning. Moreover, in optimization, it is closely related to the mirror descent algorithm. For more details and relevant applications, check the following papers:

- A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, "Clustering with Bregman divergences,"
*Journal of Machine Learning Research*, vol. 6, no. Oct, pp. 1705–1749, 2005. (Appendix A provides a good list of properties of Bregman divergence) - A. Beck and M. Teboulle, "Mirror descent and nonlinear projected subgradient methods for convex optimization,"
*Operations Research Letters*, vol. 31, no. 3, pp. 167–175, May 2003. - Meet the Bregman Divergences has a good list of references.

## Quick Reference Card

**KL divergence:** $D_{\mathrm{KL}}(P\|Q) = \int p(x) \log \frac{p(x)}{q(x)} \mathrm{d}x$.

**Jensen-Shannon divergence:** $D_{\mathrm{JS}}(P\|Q) = \frac{1}{2}(D_{\mathrm{KL}}(P\|M) + D_{\mathrm{KL}}(P\|M))$, $M = (P+Q)/2$.

**f-divergence:** $D_f(P\|Q) = \int f\left(\frac{p(x)}{q(x)}\right) q(x) \mathrm{d}x$, $f: (0, \infty) \mapsto \mathbb{R}$ is convex, $f(1) = 0$.

- $f(u) = u\log u$: KL divergence.
- $f(u) = |u - 1|$: Total variation distance.
- $f(u) = \frac{1}{2}(\sqrt{u} - 1)^2$: Squared Hellinger distance.
- $f(u) = (u - 1)^2$: Pearson's chi-square divergence.
- $f(u) = -(u + 1)\log\frac{1+u}{2} + u\log u$: Jensen-Shannon divergence.

**Rényi divergence:** $D_{\alpha}(P\|Q) = \frac{1}{\alpha - 1} \log \int p(x)^\alpha q(x)^{1 - \alpha} \mathrm{d}x$, $\alpha \in (0, 1) \cup (1, \infty)$.

- $\alpha \to 1$: KL divergence.
- $\alpha = 1/2$: Doubled Bhattacharyya distance.

**Bregman divergence:** $D_{\phi}(\mathbf{x} \| \mathbf{y}) = \phi(\mathbf{x}) - \phi(\mathbf{y}) - \langle \mathbf{x} - \mathbf{y}, \nabla\phi(\mathbf{y}) \rangle$, $\phi$: differentiable strictly convex.

- $\phi(\mathrm{x}) = \|\mathrm{x}\|_2^2$: Squared Euclidean distance.
- $\phi(\mathrm{x}) = \mathrm{x}^T \mathrm{Q} \mathrm{x}$: Squared Mahalanobis distance.
- $\phi(\mathrm{x}) = \sum_i x_i \log x_i$: Discrete KL divergence.

S. Kullback and R. A. Leibler, “On Information and Sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, Mar. 1951. ↩

S. I. Amari, "$\alpha$-divergence is unique, belonging to both $f$-divergence and Bregman divergence classes,"

*IEEE Transactions on Information Theory*, vol. 55, no. 11, pp. 4925-4931, Nov. 2009. ↩I. Taneja, "Relative information of type s, Csiszár’s f-divergence, and information inequalities,"

*Information Sciences*, vol. 166, no. 1-4, pp. 105–125, Oct. 2004. ↩F. Nielsen and R. Nock, "On the chi square and higher-order chi distances for approximating f-divergences,"

*IEEE Signal Processing Letters*, vol. 21, no. 1, pp. 10–13, Jan. 2014. ↩The total variation distance is usually defined using supremum. Here if we directly plug in $f(u)$, we obtain an integral over the over the absolute differences between the two pdfs. To see why they are equivalent, check the discussion here. ↩