# Mianzhi Wang

Ph.D. in Electrical Engineering

# A Brief List of Statistical Divergences

In many problems, such as classical detection and estimation, or Bayesian inference, we need a measure of closeness/similarity between two distributions. For instance, in hypothesis testing, we need to measure the "distance" between the hypothesized distribution and the true sample distribution. For generative models in machine learning, we need to measure the similarity between the distribution of the generated data and that of the true data so that we can optimize the parameters our generation model. To solve these problems, various statistical divergences are introduced.

For simplicity, we will avoid the notation of probability space. We will introduce statistical divergences using proper probability density functions (or probability mass functions). A statistical divergence $D(P||Q)$ between two probability distributions $P$ and $Q$ maps two distributions to $\mathbb{R}$, and must satisfy the following two properties:

1. $D(P\|Q) \geq 0$;
2. $D(P\|Q) = 0$ if and only if $P$ = $Q$.

Here we do not require $D(\cdot\|\cdot)$ to be symmetric (hence $D(P\|Q) \neq D(Q\|P)$ in general). The two distributions $P$ and $Q$ are very similar if $D(P\|Q)$ is very close to zero.

Note: a statistical divergence is not necessarily a metric. Technically it is possible to define a metric over probability distributions under some conditions, and use this metric as a divergence.

In the following, we will give a brief list of commonly used statistical divergences. You can skip the details and jump to the quick reference card here.

## KL Divergence

The Kullback–Leibler divergence[1] is probably the most commonly used statistical divergence. It is defined as follows:

$D_{\mathrm{KL}}(P\|Q) = \int p(x) \log \frac{p(x)}{q(x)} \mathrm{d}x.$
(1)

It has other names such as information gain and relative entropy. The KL divergence has wide applications in machine learning, whenever you are maximizing some likelihood or posterior function, performing variational inference.

Example 1. Given two univariate Gaussians $P \sim \mathcal{N}(\mu_1, \sigma_1^2)$ and $Q \sim \mathcal{N}(\mu_2, \sigma_2^2)$, the KL divergence between $P$ and $Q$ is given by

$D_{\mathrm{KL}}(P\|Q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}.$

Note that $D_{\mathrm{KL}}(P\|Q) = 0$ when $\mu_1 = \mu_2$ and $\sigma_1 = \sigma_2$.

## Jensen-Shannon Divergence

The Jensen-Shannon divergence can be viewed as the symmetrized version of the KL divergence, which is defined as follows:

$D_{\mathrm{JS}}(P\|Q) = \frac{1}{2}(D_{\mathrm{KL}}(P\|M) + D_{\mathrm{KL}}(P\|M)),$
(2)

where $M = (P + Q)/2$.

When symmetry is required, Jensen-Shannon divergence can be used to replace the KL divergence. Interestingly, in the original generative adversarial network (GAN) paper, the Jensen-Shannon divergence appears in the proof of Theorem 1, implying that the training process can be viewed as minimizing the Jensen-Shannon divergence between the true data and the generated data under some assumptions. For more details and relevant readings, check the following:

## f-Divergence

Now we introduce a family of statistical divergences called f-divergence[2][3]. Given a convex function $f: (0, \infty) \mapsto \mathbb{R}$ satisfying $f(1) = 0$, the f-divergence is defined as

$D_f(P\|Q) = \int f\left(\frac{p(x)}{q(x)}\right) q(x) \mathrm{d}x.$
(3)

The non-negativity can be verified by combining Jensen's inequality and the fact that $f(x) = 0$. By choosing different $f$, we can obtain different divergences[4]:

• $f(u) = u\log u$: we obtain KL divergence.
• $f(u) = |u - 1|$: we obtain the total variation distance[5].
• $f(u) = \frac{1}{2}(\sqrt{u} - 1)^2$: we obtain the squared Hellinger distance: $\frac{1}{2}\int (\sqrt{p(x)} - \sqrt{q(x)})^2 \mathrm{d}x$ (or $1 - \int\sqrt{p(x)q(x)} \mathrm{d}x$).
• $f(u) = (u - 1)^2$: we obtain the Pearson's chi-square divergence: $\int (p(x) - q(x))^2 / q(x) \mathrm{d}x$.
• $f(u) = -(u + 1)\log\frac{1+u}{2} + u\log u$: we obtain the Jensen-Shannon divergence.

S. Nowozin et al. from Microsoft Research show that f-divergence can be used for training generative neural samplers. For more details, check the following paper:

For a more technical reading, check the following paper:

in which the authors unify f-divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information.

## Rényi Divergence

Based on Rényi entropy, we can define another family of statistical divergences parameterized by $\alpha$. Given $\alpha \in (0, 1) \cup (1, \infty)$, the Rényi divergence is defined as

$D_{\alpha}(P\|Q) = \frac{1}{\alpha - 1} \log \int p(x)^\alpha q(x)^{1 - \alpha} \mathrm{d}x.$
(4)

Specifically,

• When $\alpha \to 1$, we obtain the KL divergence.
• When $\alpha = 1/2$, we obtain the doubled Bhattacharyya distance: $-2\log\int \sqrt{p(x)q(x)} \mathrm{d}x$. This can also be expressed in terms of the squared Hellinger distance: $-2\log(1 - \mathrm{Hell}^2(P, Q))$.

For more details, check the following paper:

## Bregman Divergence

Bregman divergences defines a more general family of divergences. It can be defined between two general vectors (in the same vector space), not restricted to distributions. For completeness and its relationship to the KL divergence, we will include it here. Let $\phi(\mathbf{x})$ be a differentiable strictly convex function, the Bregman divergence is defined as

$D_{\phi}(\mathbf{x} \| \mathbf{y}) = \phi(\mathbf{x}) - \phi(\mathbf{y}) - \langle \mathbf{x} - \mathbf{y}, \nabla\phi(\mathbf{y}) \rangle.$
(5)

Depending on the choice of $\phi$, we obtain different divergences:

• $\phi(\mathrm{x}) = \|\mathrm{x}\|_2^2$: we obtain the squared Euclidean distance.
• $\phi(\mathrm{x}) = \mathrm{x}^T \mathrm{Q} \mathrm{x}$: we obtain the squared Mahalanobis distance.
• $\phi(\mathrm{x}) = \sum_i x_i \log x_i$: we obtain the discrete version of the KL divergence.

Bregman divergences find wide applications in machine learning. Moreover, in optimization, it is closely related to the mirror descent algorithm. For more details and relevant applications, check the following papers:

## Quick Reference Card

KL divergence: $D_{\mathrm{KL}}(P\|Q) = \int p(x) \log \frac{p(x)}{q(x)} \mathrm{d}x$.

Jensen-Shannon divergence: $D_{\mathrm{JS}}(P\|Q) = \frac{1}{2}(D_{\mathrm{KL}}(P\|M) + D_{\mathrm{KL}}(P\|M))$, $M = (P+Q)/2$.

f-divergence: $D_f(P\|Q) = \int f\left(\frac{p(x)}{q(x)}\right) q(x) \mathrm{d}x$, $f: (0, \infty) \mapsto \mathbb{R}$ is convex, $f(1) = 0$.

• $f(u) = u\log u$: KL divergence.
• $f(u) = |u - 1|$: Total variation distance.
• $f(u) = \frac{1}{2}(\sqrt{u} - 1)^2$: Squared Hellinger distance.
• $f(u) = (u - 1)^2$: Pearson's chi-square divergence.
• $f(u) = -(u + 1)\log\frac{1+u}{2} + u\log u$: Jensen-Shannon divergence.

Rényi divergence: $D_{\alpha}(P\|Q) = \frac{1}{\alpha - 1} \log \int p(x)^\alpha q(x)^{1 - \alpha} \mathrm{d}x$, $\alpha \in (0, 1) \cup (1, \infty)$.

• $\alpha \to 1$: KL divergence.
• $\alpha = 1/2$: Doubled Bhattacharyya distance.

Bregman divergence: $D_{\phi}(\mathbf{x} \| \mathbf{y}) = \phi(\mathbf{x}) - \phi(\mathbf{y}) - \langle \mathbf{x} - \mathbf{y}, \nabla\phi(\mathbf{y}) \rangle$, $\phi$: differentiable strictly convex.

• $\phi(\mathrm{x}) = \|\mathrm{x}\|_2^2$: Squared Euclidean distance.
• $\phi(\mathrm{x}) = \mathrm{x}^T \mathrm{Q} \mathrm{x}$: Squared Mahalanobis distance.
• $\phi(\mathrm{x}) = \sum_i x_i \log x_i$: Discrete KL divergence.

1. S. Kullback and R. A. Leibler, “On Information and Sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, Mar. 1951.

2. S. I. Amari, "$\alpha$-divergence is unique, belonging to both $f$-divergence and Bregman divergence classes," IEEE Transactions on Information Theory, vol. 55, no. 11, pp. 4925-4931, Nov. 2009.

3. I. Taneja, "Relative information of type s, Csiszár’s f-divergence, and information inequalities," Information Sciences, vol. 166, no. 1-4, pp. 105–125, Oct. 2004.

4. F. Nielsen and R. Nock, "On the chi square and higher-order chi distances for approximating f-divergences," IEEE Signal Processing Letters, vol. 21, no. 1, pp. 10–13, Jan. 2014.

5. The total variation distance is usually defined using supremum. Here if we directly plug in $f(u)$, we obtain an integral over the over the absolute differences between the two pdfs. To see why they are equivalent, check the discussion here.