Mianzhi Wang

Ph.D. in Electrical Engineering

The Counterintuitive Behavior of High-Dimensional Gaussian Distributions

We all love Gaussian distributions. In statistical modeling, we love addition Gaussian noises. In Bayesian inference, we have Gaussian priors. In deep learning, we initialize weights using Gaussian distributions. In theoretical derivations, we make Gaussian assumptions. We are familiar with what 2D and 3D Gaussian distributions look like:

Gaussian 2d Gaussian 3d

Because we cannot easily visualize higher dimensional spaces, we have the impression that Gaussian samples should concentrate around the mean, and that the overall shape formed by these samples is a (hyper)ellipsoid determined by the covariance matrix.

In practice, we usually use Gaussian distributions with much higher dimensions. Does our intuition really hold for high-dimensional Gaussian distributions? Well, not really. Check the short write up by Ferenc Huszár first:

Counterintuitively, Gaussian distributions look like soap bubbles in high-dimensional spaces. Too see this, let us generate some normal samples (XN(0,I)\mathbf{X} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})) and plot the distributions of their norms (X2\|\mathbf{X}\|_2):

Gaussian distance to origin

We can observe that for higher dimensional Gaussian samples, the mode of the distribution of their norms shifts far away from the origin. This implies that most of these samples are concentrated around a sphere, forming a "soap bubble". In fact, for this simple case, we mathematically derive the distribution of the norms. Given X(0D,ID)\mathbf{X} \sim \mathcal(\mathbf{0}_D, \mathbf{I}_D), we have

X2=i=1DXi2χD\|\mathbf{X}\|_2 = \sqrt{\sum_{i=1}^D X_i^2} \sim \chi_D

Here χD\chi_D denotes the chi-distribution with DD degrees of freedom. The mode of χD\chi_D is given by D1\sqrt{D - 1} for D>1D > 1, which explains why the the samples are concentrated around a sphere in higher dimensions.

This counterintuitive behavior has another implication: using Euclidean space in high-dimensional spaces for clustering may not be a good idea. Let us consider a simple example, where we have three clusters of Gaussian samples generated from the following three distributions:

X1(μ1,I), μ1=[0,0,0,]T,X2(μ2,I), μ2=[5,0,0,]T,X4(μ3,I), μ3=[10,0,0,]T.\begin{aligned} \mathbf{X}_1 \sim \mathcal(\mathbf{\mu}_1, \mathbf{I})&,\ \mu_1 = [0, 0, 0, \cdots]^T,\\ \mathbf{X}_2 \sim \mathcal(\mathbf{\mu}_2, \mathbf{I})&,\ \mu_2 = [5, 0, 0, \cdots]^T,\\ \mathbf{X}_4 \sim \mathcal(\mathbf{\mu}_3, \mathbf{I})&,\ \mu_3 = [10, 0, 0, \cdots]^T. \end{aligned}

If we project the samples along the first two dimensions, we obtain the following:

3 Gaussians Scatter

Clearly, these three 2D clusters can be easily identified by common clustering algorithms using the Euclidean distance. If we follow our original impression of Gaussian distributions, we may have the following thoughts: in higher dimensional spaces, these three clusters are three concentrated balls just like the 2D case, and we should be able to distinguish them using the Euclidean space without any problem. However, due to the counterintuitive behavior of high-dimensional Gaussian distributions, these thoughts are no longer valid.

Let us first plot the distributions of Euclidean distances from these Gaussian samples to the origin:

Distance to origin for 3 Gaussians

When the number of dimensions is three, we can clearly observe three different distributions. However, as the number of dimensions increase, these three distributions begin to shift away from the origin and merge into each other. When the number of dimensions is 500, the three distributions are not easily distinguishable.

We can further plot the distance matrix of these Gaussian samples as follows:

Distance matrix of 3 Gaussians

We can observe that in lower dimensional spaces, the three clusters are clearly distinguishable. However, as the number of dimensions increase, the distance matrix gradually loses its structure. When the number of dimensions is set to 500, the pair-wise distances are almost the same and the three clusters are almost indistinguishable. If we run clustering algorithms using the Euclidean distance in these cases, we will clearly get unsatisfying results.

The above observations are also counterintuitive. We can clearly identify the three clusters when projecting the samples along first two dimensions! Well, you now learned the importance of dimensionality reduction. In fact, for our contrived example, useful information is concentrated along the first dimension, while the remaining dimensions can be regarded as "noise dimensions". Instead of blindly applying clustering algorithms to the original data, we should apply PCA first (Here we have Gaussian samples so PCA naturally fits. When working with real-world samples, you may need to use non-linear dimensionality reduction methods, which leads to another interesting topic: manifold learning.)

For more some further readings, I recommend checking the following:

The Jupyter notebook generating all the figures in this article can be download here.