Poisson approximations to the negative binomial distribution

This post is an introduction to the negative binomial distribution and a discussion of different ways of approximating the negative binomial distribution.

The negative binomial distribution describes the number of times a coin lands on tails before a certain number of heads are recorded. The distribution depends on two parameters \(p\) and \(r\). The parameter \(p\) is the probability that the coin lands on heads and \(r\) is the number of heads. If \(X\) has the negative binomial distribution, then \(X = x\) means in the first \(x+r-1\) tosses of the coin, there were \(r-1\) heads and that toss number \(x+r\) was a head. This means that the probability that \(X=x\) is given by

\(\displaystyle{f(x) = \binom{x+r-1}{r-1}p^{r}\left(1-p\right)^x}\)

Here is a plot of the function \(f(x)\) for different values of \(r\) and \(p\).

Poisson approximations

When the parameter \(r\) is large and \(p\) is close to one, the negative binomial distribution can be approximated by a Poisson distribution. More formally, suppose that \(r(1-p)=\lambda\) for some positive real number \(\lambda\). If \(r\) is large then, the negative binomial random variable with parameters \(p\) and \(r\), converges to a Poisson random variable with parameter \(\lambda\). This is illustrated in the picture below where three negative binomial distributions with \(r(1-p)=5\) approach the Poisson distribution with \(\lambda =5\).

Total variation distance is a common way to measure the distance between two discrete probability distributions. The log-log plot below shows that the error from the Poisson approximation is on the order of \(1/r\) and that the error is bigger if the limiting value of \(r(1-p)\) is larger.

It turns out that is is possible to get a more accurate approximation by using a different Poisson distribution. In the first approximation, we used a Poisson random variable with mean \(\lambda = r(1-p)\). However, the mean of the negative binomial distribution is \(r(1-p)/p\). This suggests that we can get a better approximation by setting \(\lambda = r(1-p)/p\).

The change from \(\lambda = r(1-p)\) to \(\lambda = r(1-p)/p\) is a small because \(p \approx 1\). However, this small change gives a much better approximation, especially for larger values of \(r(1-p)\). The below plot shows that both approximations have errors on the order of \(1/r\), but the constant for the second approximation is much better.

Second order accurate approximation

It is possible to further improve the Poisson approximation by using a Gram–Charlier expansion. A Gram–Charlier approximation for the Poisson distribution is given in this paper.1 The approximation is

\(\displaystyle{f_{GC}(x) = P_\lambda(x) – \frac{1}{2}q\left((x-\lambda)P_\lambda(x)-(x-1-\lambda)P_\lambda(x-1)\right)},\)

where \(\lambda = \frac{k(1-p)}{p}\), \(q=1-p\), and \(P_\lambda(x)\) is the Poisson pmf evaluated at \(x\).

The Gram–Charlier expansion is considerably more accurate than either Poisson approximation. The errors are on the order of \(1/r^2\). This higher accuracy means that the error curves for the Gram–Charlier expansion has a steeper slope.

  1. The approximation is given in equation (4) of the paper and is stated in terms of the CDF instead of the PMF. The equation also contains a small typo, it should say \(\frac{1}{2}q\) instead of \(\frac{1}{2}p\). ↩︎

Monte Carlo integration in high dimensions

Monte Carlo integration

John Cook recently wrote a cautionary blog post about using Monte Carlo to estimate the volume of a high-dimensional ball. He points out that if \(\mathbf{X}=(X_1,\ldots,X_n)\) are independent and uniformly distributed on the interval \([-1,1]\) then

\(\displaystyle{\mathbb{P}(X_1^2 + X_2^2 + \cdots + X_n^2 \le 1) = \frac{v_n}{2^n}},\)

where \(v_n\) is the volume of an \(n\)-dimensional ball with radius one. This observation means that we can use Monte Carlo to estimate \(v_n\).

To do this we repeatedly sample vectors \(\mathbf{X}_m = (X_{m,1},\ldots,X_{m,n})\) with \(X_{m,i}\) uniform on \([-1,1]\) and \(m\) ranging from \(1\) to \(M\). Next, we count the proportion of vectors \(\mathbf{X}_m\) with \(X_{1,m}^2 + \cdots + X_{n,m}^2 \le 1\). Specifically, if \(Z_m\) is equal to one when \(X_{1,m}^2 + \cdots + X_{n,m}^2 \le 1\) and equal to zero otherwise, then by the law of large numbers

\(\displaystyle{\frac{1}{M}\sum_{m=1}^M Z_m \approx \frac{v_n}{2^n}}\)

Which implies

\(v_n \approx 2^n \frac{1}{M}\sum_{m=1}^M Z_m\)

This method of approximating a volume or integral by sampling and counting is called Monte Carlo integration and is a powerful general tool in scientific computing.

The problem with Monte Carlo integration

As pointed out by John, Monte Carlo integration does not work very well in this example. The plot below shows a large difference between the true value of \(v_n\) with \(n\) ranging from \(1\) to \(20\) and the Monte Carlo approximation with \(M = 1,000\).

The problem is that \(v_n\) is very small and the probability \(v_n/2^n\) is even smaller! For example when \(n = 10\), \(v_n/2^n \approx 0.0025\). This means that in our one thousand samples we only expect two or three occurrences of the event \(X_{m,1}^2 + \cdots + X_{m,n}^2 \le 1\). As a result our estimate has a high variance.

The results get even worse as \(n\) increases. The probability \(v_n/2^n\) does to zero faster than exponentially. Even with a large value of \(M\), our estimate will be zero. Since \(v_n \neq 0\), the relative error in the approximation is 100%.

Importance sampling

Monte Carlo can still be used to approximate \(v_n\). Instead of using plain Monte Carlo, we can use a variance reduction technique called importance sampling (IS). Instead of sampling the points \(X_{m,1},\ldots, X_{m,n}\) from the uniform distribution on \([-1,1]\), we can instead sample the from some other distribution called a proposal distribution. The proposal distribution should be chosen so that that the event \(X_{m,1}^2 +\cdots +X_{m,n}^2 \le 1\) becomes more likely.

In importance sampling, we need to correct for the fact that we are using a new distribution instead of the uniform distribution. Instead of the counting the number of times \(X_{m,1}^2+\cdots+X_{m,n}^2 \le 1\) , we give weights to each of samples and then add up the weights.

If \(p\) is the density of the uniform distribution on \([-1,1]\) (the target distribution) and \(q\) is the density of the proposal distribution, then the IS Monte Carlo estimate of \(v_n\) is

\(\displaystyle{\frac{1}{M}\sum_{m=1}^M Z_m \prod_{i=1}^n \frac{p(X_{m,i})}{q(X_{m,i})}},\)

where as before \(Z_m\) is one if \(X_{m,1}^2+\cdots +X_{m,n}^2 \le 1\) and \(Z_m\) is zero otherwise. As long as \(q(x)=0\) implies \(p(x)=0\), the IS Monte Carlo estimate will be an unbiased estimate of \(v_n\). More importantly, a good choice of the proposal distribution \(q\) can drastically reduce the variance of the IS estimate compared to the plain Monte Carlo estimate.

In this example, a good choice of proposal distribution is the normal distribution with mean \(0\) and variance \(1/n\). Under this proposal distribution, the expected value of \(X_{m,1}^2 +\cdots+ X_{m,n}^2\) is one and so the event \(X_{m,1}^2 + \cdots + X_{m,n}^2 \le 1\) is much more likely.

Here are the IS Monte Carlo estimates with again \(M = 1,000\) and \(n\) ranging from \(1\) to \(20\). The results speak for themselves.

The relative error is typically less than 10%. A big improvement over the 100% relative error of plain Monte Carlo.

The next plot shows a close agreement between \(v_n\) and the IS Monte Carlo approximation on the log scale with \(n\) going all the way up to \(100\).

Notes

  1. There are exact formulas for \(v_n\) (available on Wikipedia). I used these to compare the approximations and compute relative errors. There are related problems where no formulas exist and Monte Carlo integration is one of the only ways to get an approximate answer.
  2. The post by John Cook also talks about why the central limit theorem can’t be used to approximate \(v_n\). I initially thought a technique called large deviations could be used to approximate \(v_n\) but again this does not directly apply. I was happy to discover that importance sampling worked so well!

More sampling posts

The fast Fourier transform as a matrix factorization

The discrete Fourier transform (DFT) is a mapping from \(\mathbb{C}^n\) to \(\mathbb{C}^n\). Given a vector \(u \in \mathbb{C}^n\), the discrete Fourier transform of \(u\) is another vector \(U \in \mathbb{C}^n\) given by,

\(\displaystyle{U_k = \sum_{j=0}^{n-1} \exp\left(-2 \pi i kj/m\right)u_j}.\)

If we let \(\omega_n = \exp\left(-2 \pi i /m\right)\), then \(U\) can be written more compactly as

\(\displaystyle{U_k = \sum_{j=0}^{n-1} \omega^{kj}_n u_j }.\)

The map from \(u\) to \(U\) is linear and can therefor be represented as multiplication by an \(n \times n\) matrix \(D\). Looking at the equation above, we can see that if we set \(D_{kj} = \omega^{kj}_n\) and use \(0\)-indexing, then,

\(U = Du.\)

This suggests that calculating the vector \(U = Du\) would require roughly \(n^2\) operations. However, the famous fast Fourier transform shows that \(U\) can be computed in roughly \(n\log_2(n)\) operations. There are many derivations of the fast Fourier transform but I wanted to focus on one which is based on factorizing the matrix \(D\) into sparse matrices.

Matrix factorizations

A factorization of a matrix \(A\) is a way of writing \(A\) as a product of matrices \(A = A_L\cdots A_2A_1\) where typically \(L\) is \(2\) or \(3\). Matrix factorizations can tell us information about the matrix \(A\). For example, the singular value decomposition tells us the spectral norm and singular vectors \(A\) (and lots more). Other matrix factorizations can help us compute things related to the matrix \(A\). The QR decomposition helps us solve least-squares problems and the LU factorization let us invert \(A\) in (almost) the same amount of time it takes to solve the equation \(Ax = b\).

Each of the factorizations above applied to arbitrary matrices. Every possible matrix has an singular value decomposition and a QR factorization. Likewise all square matrices have an LU factorization. The DFT matrix \(D\) has a factorization which most matrices other matrices do not. This factorization lets us write \(D = D_L\cdots D_2D_1\) where each matrix \(D_l\) is sparse.

Sparse matrices

A matrix \(A\) is sparse if many of the entries of \(A\) are zero. There isn’t a precise definition of “many”. A good heuristic is that if \(A\) has size \(m\) by \(n\), then \(A\) is sparse if the number of non-zero entries is of the order \(m\) or \(n\). In contrast, a dense matrix is one where the number of non-zero entries is of the same order of \(mn\).

The structure of sparse matrices makes them especially nice to work with. They take up a lot less memory and they are quicker to compute with. If \(A \in \mathbb{C}^{m \times n}\) has only \(k\) non-zero elements, then for any vector \(x \in \mathbb{C}^n\), the matrix-vector product \(Ax\) can be computed in order \(k\) operations. If \(k\) is much less than \(mn\), then this can be a huge speed up.

This shows why a sparse factorization of the DFT matrix \(D\) will give a fast way of calculating \(Du\). Specifically, suppose that we have \(D = D_L \cdots D_2D_1\) and each matrix \(D_l\) has just \(k_l\) non-zero entries. Then, for any vector \(u \in \mathbb{C}^n\), we can compute \(U= Du=D_L \cdots D_2 D_1 u\) in stages. First we multiply \(u\) by \(D_1\) to produce \(u^{(1)}\). The output \(u^{(1)}\) is then multiplied by \(D_2\) to produce \(u^{(2)}\). We continue in this way until we finally multiply \(u^{(L-1)}\) by \(D_L\) to produce \(u = D_L u^{(L-1)}\). The total computation cost is of the order \(k_1 + k_2 + \cdots + k_L\) which can be drastically less than \(n^2\).

The DFT factorization

We’ll now show that \(D\) does indeed have a sparse matrix factorization. To avoid some of the technicalities, we’ll assume that \(n\) is a power of \(2\). That is, \(n = 2^m\) for some \(m\). Remember that the discrete Fourier transform of \(u\) is the vector \(U\) given by

\(\displaystyle{U_k = \sum_{j=0}^{n-1} \omega^{jk}_n u_j},\)

where \(\omega_n = \exp(-2\pi i/n)\). The idea behind the fast Fourier transform is to split the above sum into two terms. One where we add up the even values of \(j\) and one where we add up the odd values of \(j\). Specifically, \(U_k\) is equal to

\(\displaystyle{\sum_{j=0, \text{j even }}^{n-1} \omega_n^{jk} u_j + \sum_{j = 0, j \text{ odd}}^{n-1} \omega_n^{jk}u_j = \sum_{j=0}^{n/2 – 1} \omega^{2jk}_n u_{2j} + \sum_{j=0}^{n/2-1} \omega^{(2j+1)k}_n u_{2j+1}}\)

These two sums can now be treated separately. In the first sum, the terms are of the form \(\omega^{2 jk}_n\). Each of these can by rewritten as follows,

\(\omega^{2 jk}_n = \exp(-2 \pi i /n)^{2 jk} = \exp(-2 \pi i /(n/2))^{jk} = \omega_{n/2}^{jk}\).

This can also be used to re-write \(\omega^{(2j+1)k}_n\),

\(\omega^{(2j+1)k}_n = \omega_n^k\omega^{2jk}_n = \omega_n^k \omega^{jk}_{n/2}.\)

By pluggin both of these back into the expression for \(U_k\), we get

\(\displaystyle{U_k = \sum_{j=0}^{n/2-1} \omega_{n/2}^{jk} u_{2j} + \omega_n^k \sum_{j=0}^{n/2-1}\omega_{n/2}^{jk} u_{2j+1}. }\)

This shows that the Fourier transform of the \(n\)-dimensional vector \(u\) can be done by computing and combining the Fourier transforms of the \(n/2\)-dimensional vectors. One of these vectors contain the even indexed elements of \(u\) and the others contain the odd indexed elements of \(u\).

There is one more observation that we need. If \(k\) is between \(n/2\) and \(n-1\), then we can simplify the above expression for \(U_k\). Specifically, if we set \(k’=k-n/2\), then

\(\omega_{n/2}^{jk} = \omega_{n/2}^{j(n/2+k’)} = \omega_{n/2}^{jn/2}\omega_{n/2}^{jk’} = \omega_{n/2}^{jk’},\)

since \(\omega_{n/2}^{n/2}=1\). By plugging this into the above formula for \(U_k\), we get that if \(k\) is one of \(n/2, n/2+1,\ldots, n-1\), then

\(\displaystyle{U_k = \sum_{j=0}^{n/2-1} \omega_{n/2}^{j(k-n/2)} u_{2j} + \omega_n^k \sum_{j=0}^{n/2-1}\omega_{n/2}^{j(k-n/2)} u_{2j+1}}.\)

This reduces the computation since there are some terms we only have to compute once but that are used twice. The algebra we did above shows that we can compute the discrete Fourier transform of \(u \in \mathbb{C}^n\) in the following step.

  1. Form the vectors \(u^{\text{even}} = (u_0,u_2,\ldots,u_{n-2}) \in \mathbb{C}^{n/2}\) and \(u^{\text{odd}} = (u_1,u_3,\ldots,u_{n-1}) \in \mathbb{C}^{n/2}\).
  2. Calculate the discrete Fourier transforms of \(u^{\text{even}}\) and \(u^{\text{odd}}\) to give two vectors \(U^{\text{even}}, U^{\text{odd}} \in \mathbb{C}^{n/2}\).
  3. Compute \(U \in \mathbb{C}^n\) by,

\(\displaystyle{U_k = \begin{cases} U^{\text{even}}_k + \omega_n^k U^{\text{odd}}_k & \text{for } k = 0,1,\ldots, n/2-1,\\ U^{\text{even}}_{k-n/2}+\omega_n^k U^{\text{odd}}_{k-n/2} &\text{for } k=n/2,n/2+1,\ldots,n-1 \end{cases}}\)

The steps above break down computing \(U=Du\) into three simpler steps. Each of these simpler steps are also linear maps and can therefore be written as multiplication by a matrix. If we call these matrices \(D_1,D_2\) and \(D_3\), then

  • The matrix \(D_1\) maps \(u\) to \(\begin{bmatrix} u^{\text{even}} \\ u^{\text{odd}}\end{bmatrix}\). That is

\(D_1 = \begin{bmatrix} 1&0&0&0&\cdots&0&0\\0&0&1&0&\cdots &0&0 \\\vdots& \vdots &\vdots & \vdots & \ddots & \vdots & \vdots \\0&0&0&0&\cdots&1&0 \\ 0&1&0&0&\cdots&0&0 \\ 0&0&0&1&\cdots&0&0 \\ \vdots&\vdots &\vdots & \vdots & \ddots & \vdots & \vdots \\ 0&0&0&0&\cdots&0&1\end{bmatrix} \in \mathbb{C}^{n \times n}.\)

  • The matrix \(D_2\) is block-diagonal and each block is the \(n/2\)-dimensional discrete Fourier transformation matrix. That is

\(D_2 = \begin{bmatrix}D^{(n/2)} & 0\\ 0 & D^{(n/2)}\end{bmatrix}\in \mathbb{C}^{n \times n},\)

where \(D^{(n/2)}\) is given by \(D^{(n/2)}_{jk} = \omega_{n/2}^{jk}\).

  • The matrix \(D_3\) combines two \(n/2\)-vectors according to the formula in step 3 above,

\(D_3 = \begin{bmatrix} 1&0&\cdots & 0 & \omega_n^0 & 0 & \cdots & 0\\ 0&1&\cdots&0 &0& \omega_n^1&\cdots & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots& \vdots &\ddots&\vdots\\ 0&0&\cdots & 1 & 0 & 0 & \cdots & \omega_n^{n/2-1}\\1&0&\cdots & 0 & \omega_n^{n/2} & 0 & \cdots & 0\\ 0&1&\cdots&0 &0& \omega_n^{n/2+1}&\cdots & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots& \vdots &\ddots&\vdots\\ 0&0&\cdots & 1 & 0 & 0 & \cdots & \omega_n^{n-1}\end{bmatrix}\in \mathbb{C}^{n \times n} \)

The three-step procedure above for calculating \(U=Du\) is equivalent to iteratively multiplying by \(D_1\), then \(D_2\) and the finally by \(D_3\). Another way of saying this is that \(D=D_3D_2D_1\) is a matrix factorization!

From the definitions, we can see that each of the matrices \(D_1,D_2\) and \(D_3\) have a lot of zeros. Specifically, \(D_1\) has just \(n\) non-zero entries (one in each row), \(D_2\) has \(2(n/2)^2\) non-zero entries (two blocks of \(n/2 \times n/2\) non-zeros) and \(D_3\) has \(2n\) non-zero entries (two in each row). This means that the product \(U=D_3D_2D_1u\) can be computed in roughly

\(n+2(n/2)^2 + 2n = 3n+2(n/2)^2,\)

operations. This is far away from the \(n\log_2(n)\) operations promised at the start of this blog post. However, the matrix \(D_2\) contains two copies of the \(n/2 \times n/2\) discrete Fourier transform matrix. This means we could apply the exact same argument to the \(n/2 \times n/2\) matrix \(D^{(n/2)}\). This lets us replace \((n/2)^2\) with \(\frac{3}{2}n + 2(n/4)^2\). This brings the number of operations down to

\(3n+2\left(\frac{3}{2}n + 2(n/4)^2\right) = 6n + 4(n/4)^2.\)

We can again keep going! If we recall that \(n=2^m\), then we get to repeat the proccess \(m\) times in total giving approximately,

\(3mn + 2^m = 3\log_2(n)n+n\) operations.

So by using lots of sparse matrices, we can compute \(U=Du\) in \(\approx n\log_2(n)\) operations instead of \(\approx n^2\) operations.

Concavity of the squared sum of square roots

The \(p\)-norm of a vector \(x\in \mathbb{R}^n\) is defined to be:

\(\displaystyle{\Vert x \Vert_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}}.\)

If \(p \ge 1\), then the \(p\)-norm is convex. When \(0<p<1\), this function is not convex and not actually a norm. In fact, the function \(x \mapsto \Vert x \Vert_p\) actually concave when \(0 < p < 1\) and all the entries of \(x\) are non-negative. This can be seen in the feature image for this post which shows that the contours of this function are concave for \(n = 2\).

On a recent exam for the course Convex Optimization, we were asked to prove this when \(p = 1/2\). In this special case, the mathematics simplifies nicely.

When \(p = 1/2\), the \(p\)-norm can be described as the squared sum of square roots. Specifically,

\(\displaystyle{\Vert x \Vert_{1/2} = \left(\sum_{i=1}^n \sqrt{|x_i|} \right)^2}.\)

Note that we can expand the square and rewrite the function as follows

\(\displaystyle{\Vert x \Vert_{1/2} = \sum_{i=1}^n\sum_{k=1}^n\sqrt{\left|x_i\right|} \sqrt{|x_k|} =\sum_{i=1}^n\sum_{k=1}^n \sqrt{|x_ix_k|}}.\)

If we restrict to \(x \in \mathbb{R}^n\) with \(x_i \ge 0\) for all \(i\), then this function simplifies to

\(\displaystyle{\Vert x \Vert_{1/2} =\sum_{i=1}^n\sum_{k=1}^n \sqrt{x_ix_k}},\)

which is a sum of geometric means. The geometric mean function \((u,v) \mapsto \sqrt{uv}\) is concave when \(u,v \ge 0\). This can be proved by calculating the Hessian of this function and verifying that it is negative semi-definite.

From this, we can conclude that each function \(x \mapsto \sqrt{x_ix_k}\) is also concave. This is because \(x \mapsto \sqrt{x_ix_k}\) is a linear function followed by a concave function. Finally, any sum of concave functions is also concave and thus \(\Vert x \Vert_{1/2}\) is concave.

Hellinger distance

A similar argument can be used to show that Hellinger distance is a convex function. Hellinger distance, \(d_H(\cdot,\cdot)\) is defined on pairs of probability distributions \(p\) and \(q\) on a common set \(\{1,\ldots,n\}\). For such a pair,

\(\displaystyle{d_H(p,q) = \sum_{i=1}^n \left(\sqrt{p_i}-\sqrt{q_i}\right)^2},\)

which certainly doesn’t look convex. However, we can expand the square and use the fact that \(p\) and \(q\) are probability distributions. This shows us that Helligener distance can be written as

\(\displaystyle{d_H(p,q) = \sum_{i=1}^n p_i – 2\sum_{i=1}^n \sqrt{p_iq_i}+\sum_{i=1}^n q_i = 2 – 2\sum_{i=1}^n\sqrt{p_iq_i}}.\)

Again, each function \((p,q) \mapsto \sqrt{p_iq_i}\) is concave and so the negative sum of such functions is convex. Thus, \(d_H(p,q)\) is convex.

The course

As a final comment, I’d just like to say how much I am enjoying the class. Prof. Stephen Boyd is a great lecturer and we’ve seen a wide variety of applications in the class. I’ve recently been reading a bit of John D Cook’s blog and agree with all he says about the course here.

The Singular Value Decomposition (SVD)

The singular value decomposition (SVD) is a powerful matrix decomposition. It is used all the time in statistics and numerical linear algebra. The SVD is at the heart of the principal component analysis, it demonstrates what’s going on in ridge regression and it is one way to construct the Moore-Penrose inverse of a matrix. For more SVD love, see the tweets below.

A tweet by Women in Statistics and Data Science about the SVD.
The full thread is here https://twitter.com/WomenInStat/status/1285610321747611653?s=20&t=Elj62mGSc9gINHvbwt82Zw

In this post I’ll define the SVD and prove that it always exists. At the end we’ll look at some pictures to better understand what’s going on.

Definition

Let \(X\) be a \(n \times p\) matrix. We will define the singular value decomposition first in the case \(n \ge p\). The SVD consists of three matrix \(U \in \mathbb{R}^{n \times p}, \Sigma \in \mathbb{R}^{p \times p}\) and \(V \in \mathbb{R}^{p \times p}\) such that \(X = U\Sigma V^T\). The matrix \(\Sigma\) is required to be diagonal with non-negative diagonal entries \(\sigma_1 \ge \sigma_2 \ge \ldots \ge \sigma_p \ge 0\). These numbers are called the singular values of \(X\). The matrices \(U\) and \(V\) are required to orthogonal matrices so that \(U^TU=V^TV = I_p\), the \(p \times p\) identity matrix. Note that since \(V\) is square we also have \(VV^T=I_p\) however we won’t have \(UU^T = I_n\) unless \(n = p\).

In the case when \(n \le p\), we can define the SVD of \(X\) in terms of the SVD of \(X^T\). Let \(\widetilde{U} \in \mathbb{R}^{p \times n}, \widetilde{\Sigma} \in \mathbb{R}^{n \times n}\) and \(\widetilde{V} \in \mathbb{R}^{n \times n}\) be the SVD of \(X^T\) so that \(X^T=\widetilde{U}\widetilde{\Sigma}\widetilde{V}^T\). The SVD of \(X\) is then given by transposing both sides of this equation giving \(U = \widetilde{V}, \Sigma = \widetilde{\Sigma}^T=\widetilde{\Sigma}\) and \(V = \widetilde{U}\).

Construction

The SVD of a matrix can be found by iteratively solving an optimisation problem. We will first describe an iterative procedure that produces matrices \(U \in \mathbb{R}^{n \times p}, \Sigma \in \mathbb{R}^{p \times p}\) and \(V \in \mathbb{R}^{p \times p}\). We will then verify that \(U,\Sigma \) and \(V\) satisfy the defining properties of the SVD.

We will construct the matrices \(U\) and \(V\) one column at a time and we will construct the diagonal matrix \(\Sigma\) one entry at a time. To construct the first columns and entries, recall that the matrix \(X\) is really a linear function from \(\mathbb{R}^p\) to \(\mathbb{R}^n\) given by \(v \mapsto Xv\). We can thus define the operator norm of \(X\) via

\(\Vert X \Vert = \sup\left\{ \|Xv\|_2 : \|v\|_2 =1\right\},\)

where \(\|v\|_2\) represents the Euclidean norm of \(v \in \mathbb{R}^p\) and \(\|Xv\|_2\) is the Euclidean norm of \(Xv \in \mathbb{R}^n\). The set of vectors \(\{v \in \mathbb{R} : \|v\|_2 = 1 \}\) is a compact set and the function \(v \mapsto \|Xv\|_2\) is continuous. Thus, the supremum used to define \(\Vert X \Vert\) is achieved at some vector \(v_1 \in \mathbb{R}^p\). Define \(\sigma_1 = \|X v_1\|_2\). If \(\sigma_1 \neq 0\), then define \(u_1 = Xv_1/\sigma_1 \in \mathbb{R}^n\). If \(\sigma_1 = 0\), then define \(u_1\) to be an arbitrary vector in \(\mathbb{R}^n\) with \(\|u\|_2 = 1\). To summarise we have

  • \(v_1 \in \mathbb{R}^p\) with \(\|v_1\|_2 = 1\).
  • \(\sigma_1 = \|X\| = \|Xv_1\|_2\).
  • \(u_1 \in \mathbb{R}^n\) with \(\|u_1\|_2=1\) and \(Xv_1 = \sigma_1u_1\).

We have now started to fill in our SVD. The number \(\sigma_1 \ge 0\) is the first singular value of \(X\) and the vectors \(v_1\) and \(u_1\) will be the first columns of the matrices \(V\) and \(U\) respectively.

Now suppose that we have found the first \(k\) singular values \(\sigma_1,\ldots,\sigma_k\) and the first \(k\) columns of \(V\) and \(U\). If \(k = p\), then we are done. Otherwise we repeat a similar process.

Let \(v_1,\ldots,v_k\) and \(u_1,\ldots,u_k\) be the first \(k\) columns of \(V\) and \(U\). The vectors \(v_1,\ldots,v_k\) split \(\mathbb{R}^p\) into two subspaces. These subspaces are \(S_1 = \text{span}\{v_1,\ldots,v_k\}\) and \(S_2 = S_1^\perp\), the orthogonal compliment of \(S_1\). By restricting \(X\) to \(S_2\) we get a new linear map \(X_{|S_2} : S_2 \to \mathbb{R}^n\). Like before, the operator norm of \(X_{|S_2}\) is defined to be

\(\|X_{|S_2}\| = \sup\{\|X_{|S_2}v\|_2:v \in S_2, \|v\|_2=1\}\).

Since \(S_2 = \text{span}\{v_1,\ldots,v_k\}^\perp\) we must have

\(\|X_{|S_2}\| = \sup\{\|Xv\|_2:v \in \mathbb{R}^p, \|v\|_2=1, v_j^Tv = 0 \text{ for } j=1,\ldots,k\}.\)

The set \(\{v \in \mathbb{R}^p : \|v\|_2=1, v_j^Tv=0\text{ for } j=1,\ldots,k\}\) is a compact set and thus there exists a vector \(v_{k+1}\) such that \(\|Xv_{k+1}\|_2 = \|X_{|S_2}\|\). As before define \(\sigma_{k+1} = \|Xv_{k+1}\|_2\) and \(u_{k+1} = Xv_{k+1}/\sigma_{k+1}\) if \(\sigma_{k+1}\neq 0\). If \(\sigma_{k+1} = 0\), then define \(u_{k+1}\) to be any vector in \(\mathbb{R}^{n}\) that is orthogonal to \(u_1,u_2,\ldots,u_k\).

This process repeats until eventually \(k = p\) and we have produced matrices \(U \in \mathbb{R}^{n \times p}, \Sigma \in \mathbb{R}^{p \times p}\) and \(V \in \mathbb{R}^{p \times p}\). In the next section, we will argue that these three matrices satisfy the properties of the SVD.

Correctness

The defining properties of the SVD were given at the start of this post. We will see that most of the properties follow immediately from the construction but one of them requires a bit more analysis. Let \(U = [u_1,\ldots,u_p]\), \(\Sigma = \text{diag}(\sigma_1,\ldots,\sigma_p)\) and \(V= [v_1,\ldots,v_p]\) be the output from the above construction.

First note that by construction \(v_1,\ldots, v_p\) are orthogonal since we always had \(v_{k+1} \in \text{span}\{v_1,\ldots,v_k\}^\perp\). It follows that the matrix \(V\) is orthogonal and so \(V^TV=VV^T=I_p\).

The matrix \(\Sigma\) is diagonal by construction. Furthermore, we have that \(\sigma_{k+1} \le \sigma_k\) for every \(k\). This is because both \(\sigma_k\) and \(\sigma_{k+1}\) were defined as maximum value of \(\|Xv\|_2\) over different subsets of \(\mathbb{R}^p\). The subset for \(\sigma_k\) contained the subset for \(\sigma_{k+1}\) and thus \(\sigma_k \ge \sigma_{k+1}\).

We’ll next verify that \(X = U\Sigma V^T\). Since \(V\) is orthogonal, the vectors \(v_1,\ldots,v_p\) form an orthonormal basis for \(\mathbb{R}^p\). It thus suffices to check that \(Xv_k = U\Sigma V^Tv_k\) for \(k = 1,\ldots,p\). Again by the orthogonality of \(V\) we have that \(V^Tv_k = e_k\), the \(k^{th}\) standard basis vector. Thus,

\(U\Sigma V^Tv_k = U\Sigma e_k = U\sigma_k e_k = \sigma_k u_k.\)

Above, we used that \(\Sigma\) was a diagonal matrix and that \(u_k\) is the \(k^{th}\) column of \(U\). If \(\sigma_k \neq 0\), then \(\sigma_k u_k = Xv_k\) by definition. If \(\sigma_k =0\), then \(\|Xv_k\|_2=0\) and so \(Xv_k = 0 = \sigma_ku_k\) also. Thus, in either case, \(U\Sigma V^Tv_k = Xv_k\) and so \(U\Sigma V^T = X\).

The last property we need to verify is that \(U\) is orthogonal. Note that this isn’t obvious. At each stage of the process, we made sure that \(v_{k+1} \in \text{span}\{v_1,\ldots,v_k\}^\perp\). However, in the case that \(\sigma_{k+1} \neq 0\), we simply defined \(u_{k+1} = Xv_{k+1}/\sigma_{k+1}\). It is not clear why this would imply that \(u_{k+1}\) is orthogonal to \(u_1,\ldots,u_k\).

It turns out that a geometric argument is needed to show this. The idea is that if \(u_{k+1}\) was not orthogonal to \(u_j\) for some \(j \le k\), then \(v_j\) couldn’t have been the value that maximises \(\|Xv\|_2\).

Let \(u_{k}\) and \(u_j\) be two columns of \(U\) with \(j < k\) and \(\sigma_j,\sigma_k > 0\). We wish to show that \(u_j^Tu_k = 0\). To show this we will use the fact that \(v_j\) and \(v_k\) are orthonormal and perform “polar-interpolation“. That is, for \(\lambda \in [0,1]\), define

\(v_\lambda = \sqrt{1-\lambda}\cdot v_{j}-\sqrt{\lambda} \cdot v_k.\)

Since \(v_{j}\) and \(v_k\) are orthogonal, we have that

\(\|v_\lambda\|_2^2 = (1-\lambda)\|v_{j}\|_2^2+\lambda\|v_k\|_2^2 = (1-\lambda)+\lambda = 1.\)

Furthermore \(v_\lambda\) is orthogonal to \(v_1,\ldots,v_{j-1}\). Thus, by definition of \(v_j\),

\(\|Xv_\lambda\|_2^2 \le \sigma_j^2 = \|Xv_j\|_2^2.\)

By the linearity of \(X\) and the definitions of \(u_j,u_k\),

\(\|Xv_\lambda\|_2^2 = \|\sigma_j\sqrt{1-\lambda}\cdot u_j+\sigma_{k+1}\sqrt{\lambda}\cdot u_{k+1}\|_2^2\).

Since \(Xv_j = \sigma_ju_j\) and \(Xv_{k+1}=\sigma_{k+1}u_{k+1}\), we have

\((1-\lambda)\sigma_j^2 + 2\sqrt{\lambda(1-\lambda)}\cdot \sigma_1\sigma_2 u_j^Tu_{k}+\lambda\sigma_k^2 = \|Xv_\lambda\|_2^2 \le \sigma_j^2\)

Rearranging and dividing by \(\sqrt{\lambda}\) gives,

\(2\sqrt{1-\lambda}\cdot \sigma_1\sigma_2 u_j^Tu_k \le \sqrt{\lambda}\cdot(\sigma_j^2-\sigma_k^2).\) for all \(\lambda \in (0,1]\)

Taking \(\lambda \to 0\) gives \(u_j^Tu_k \le 0\). Performing the same polar interpolation with \(v_\lambda’ = \sqrt{1-\lambda}v_j – \sqrt{\lambda}v_k\) shows that \(-u_j^Tu_k \le 0\) and hence \(u_j^Tu_k = 0\).

We have thus proved that \(U\) is orthogonal. This proof is pretty “slick” but it isn’t very illuminating. To better demonstrate the concept, I made an interactive Desmos graph that you can access here.

This graph shows example vectors \(u_j, u_k \in \mathbb{R}^2\). The vector \(u_j\) is fixed at \((1,0)\) and a quarter circle of radius \(1\) is drawn. Any vectors \(u\) that are outside this circle have \(\|u\|_2 > 1 = \|u_j\|_2\).

The vector \(u_k\) can be moved around inside this quarter circle. This can be done either cby licking and dragging on the point or changing that values of \(a\) and \(b\) on the left. The red curve is the path of

\(\lambda \mapsto \sqrt{1-\lambda}\cdot u_j+\sqrt{\lambda}\cdot u_k\).

As \(\lambda\) goes from \(0\) to \(1\), the path travels from \(u_j\) to \(u_k\).

Note that there is a portion of the red curve near \(u_j\) that is outside the black circle. This corresponds to a small value of \(\lambda > 0\) that results in \(\|X v_\lambda\|_2 > \|Xv_j\|_2\) contradicting the definition of \(v_j\). By moving the point \(u_k\) around in the plot you can see that this always happens unless \(u_k\) lies exactly on the y-axis. That is, unless \(u_k\) is orthogonal to \(u_j\).