Category: applied-mathematics

  • Poisson approximations to the negative binomial distribution

    This post is an introduction to the negative binomial distribution and a discussion of different ways of approximating the negative binomial distribution.

    The negative binomial distribution describes the number of times a coin lands on tails before a certain number of heads are recorded. The distribution depends on two parameters \(p\) and \(r\). The parameter \(p\) is the probability that the coin lands on heads and \(r\) is the number of heads. If \(X\) has the negative binomial distribution, then \(X = x\) means in the first \(x+r-1\) tosses of the coin, there were \(r-1\) heads and that toss number \(x+r\) was a head. This means that the probability that \(X=x\) is given by

    \(\displaystyle{f(x) = \binom{x+r-1}{r-1}p^{r}\left(1-p\right)^x}\)

    Here is a plot of the function \(f(x)\) for different values of \(r\) and \(p\).

    Poisson approximations

    When the parameter \(r\) is large and \(p\) is close to one, the negative binomial distribution can be approximated by a Poisson distribution. More formally, suppose that \(r(1-p)=\lambda\) for some positive real number \(\lambda\). If \(r\) is large then, the negative binomial random variable with parameters \(p\) and \(r\), converges to a Poisson random variable with parameter \(\lambda\). This is illustrated in the picture below where three negative binomial distributions with \(r(1-p)=5\) approach the Poisson distribution with \(\lambda =5\).

    Total variation distance is a common way to measure the distance between two discrete probability distributions. The log-log plot below shows that the error from the Poisson approximation is on the order of \(1/r\) and that the error is bigger if the limiting value of \(r(1-p)\) is larger.

    It turns out that is is possible to get a more accurate approximation by using a different Poisson distribution. In the first approximation, we used a Poisson random variable with mean \(\lambda = r(1-p)\). However, the mean of the negative binomial distribution is \(r(1-p)/p\). This suggests that we can get a better approximation by setting \(\lambda = r(1-p)/p\).

    The change from \(\lambda = r(1-p)\) to \(\lambda = r(1-p)/p\) is a small because \(p \approx 1\). However, this small change gives a much better approximation, especially for larger values of \(r(1-p)\). The below plot shows that both approximations have errors on the order of \(1/r\), but the constant for the second approximation is much better.

    Second order accurate approximation

    It is possible to further improve the Poisson approximation by using a Gram–Charlier expansion. A Gram–Charlier approximation for the Poisson distribution is given in this paper.1 The approximation is

    \(\displaystyle{f_{GC}(x) = P_\lambda(x) – \frac{1}{2}(1-p)\left((x-\lambda)P_\lambda(x)-(x-1-\lambda)P_\lambda(x-1)\right)},\)

    where \(\lambda = \frac{k(1-p)}{p}\) as in the second Poisson approximation and \(P_\lambda(x)\) is the Poisson pmf evaluated at \(x\).

    The Gram–Charlier expansion is considerably more accurate than either Poisson approximation. The errors are on the order of \(1/r^2\). This higher accuracy means that the error curves for the Gram–Charlier expansion has a steeper slope.

    1. The approximation is given in equation (4) of the paper and is stated in terms of the CDF instead of the PMF. The equation also contains a small typo, it should say \(\frac{1}{2}q\) instead of \(\frac{1}{2}p\). ↩︎
  • Monte Carlo integration in high dimensions

    Monte Carlo integration

    John Cook recently wrote a cautionary blog post about using Monte Carlo to estimate the volume of a high-dimensional ball. He points out that if \(\mathbf{X}=(X_1,\ldots,X_n)\) are independent and uniformly distributed on the interval \([-1,1]\) then

    \(\displaystyle{\mathbb{P}(X_1^2 + X_2^2 + \cdots + X_n^2 \le 1) = \frac{v_n}{2^n}},\)

    where \(v_n\) is the volume of an \(n\)-dimensional ball with radius one. This observation means that we can use Monte Carlo to estimate \(v_n\).

    To do this we repeatedly sample vectors \(\mathbf{X}_m = (X_{m,1},\ldots,X_{m,n})\) with \(X_{m,i}\) uniform on \([-1,1]\) and \(m\) ranging from \(1\) to \(M\). Next, we count the proportion of vectors \(\mathbf{X}_m\) with \(X_{1,m}^2 + \cdots + X_{n,m}^2 \le 1\). Specifically, if \(Z_m\) is equal to one when \(X_{1,m}^2 + \cdots + X_{n,m}^2 \le 1\) and equal to zero otherwise, then by the law of large numbers

    \(\displaystyle{\frac{1}{M}\sum_{m=1}^M Z_m \approx \frac{v_n}{2^n}}\)

    Which implies

    \(v_n \approx 2^n \frac{1}{M}\sum_{m=1}^M Z_m\)

    This method of approximating a volume or integral by sampling and counting is called Monte Carlo integration and is a powerful general tool in scientific computing.

    The problem with Monte Carlo integration

    As pointed out by John, Monte Carlo integration does not work very well in this example. The plot below shows a large difference between the true value of \(v_n\) with \(n\) ranging from \(1\) to \(20\) and the Monte Carlo approximation with \(M = 1,000\).

    The problem is that \(v_n\) is very small and the probability \(v_n/2^n\) is even smaller! For example when \(n = 10\), \(v_n/2^n \approx 0.0025\). This means that in our one thousand samples we only expect two or three occurrences of the event \(X_{m,1}^2 + \cdots + X_{m,n}^2 \le 1\). As a result our estimate has a high variance.

    The results get even worse as \(n\) increases. The probability \(v_n/2^n\) does to zero faster than exponentially. Even with a large value of \(M\), our estimate will be zero. Since \(v_n \neq 0\), the relative error in the approximation is 100%.

    Importance sampling

    Monte Carlo can still be used to approximate \(v_n\). Instead of using plain Monte Carlo, we can use a variance reduction technique called importance sampling (IS). Instead of sampling the points \(X_{m,1},\ldots, X_{m,n}\) from the uniform distribution on \([-1,1]\), we can instead sample the from some other distribution called a proposal distribution. The proposal distribution should be chosen so that that the event \(X_{m,1}^2 +\cdots +X_{m,n}^2 \le 1\) becomes more likely.

    In importance sampling, we need to correct for the fact that we are using a new distribution instead of the uniform distribution. Instead of the counting the number of times \(X_{m,1}^2+\cdots+X_{m,n}^2 \le 1\) , we give weights to each of samples and then add up the weights.

    If \(p\) is the density of the uniform distribution on \([-1,1]\) (the target distribution) and \(q\) is the density of the proposal distribution, then the IS Monte Carlo estimate of \(v_n\) is

    \(\displaystyle{\frac{1}{M}\sum_{m=1}^M Z_m \prod_{i=1}^n \frac{p(X_{m,i})}{q(X_{m,i})}},\)

    where as before \(Z_m\) is one if \(X_{m,1}^2+\cdots +X_{m,n}^2 \le 1\) and \(Z_m\) is zero otherwise. As long as \(q(x)=0\) implies \(p(x)=0\), the IS Monte Carlo estimate will be an unbiased estimate of \(v_n\). More importantly, a good choice of the proposal distribution \(q\) can drastically reduce the variance of the IS estimate compared to the plain Monte Carlo estimate.

    In this example, a good choice of proposal distribution is the normal distribution with mean \(0\) and variance \(1/n\). Under this proposal distribution, the expected value of \(X_{m,1}^2 +\cdots+ X_{m,n}^2\) is one and so the event \(X_{m,1}^2 + \cdots + X_{m,n}^2 \le 1\) is much more likely.

    Here are the IS Monte Carlo estimates with again \(M = 1,000\) and \(n\) ranging from \(1\) to \(20\). The results speak for themselves.

    The relative error is typically less than 10%. A big improvement over the 100% relative error of plain Monte Carlo.

    The next plot shows a close agreement between \(v_n\) and the IS Monte Carlo approximation on the log scale with \(n\) going all the way up to \(100\).

    Notes

    1. There are exact formulas for \(v_n\) (available on Wikipedia). I used these to compare the approximations and compute relative errors. There are related problems where no formulas exist and Monte Carlo integration is one of the only ways to get an approximate answer.
    2. The post by John Cook also talks about why the central limit theorem can’t be used to approximate \(v_n\). I initially thought a technique called large deviations could be used to approximate \(v_n\) but again this does not directly apply. I was happy to discover that importance sampling worked so well!

    More sampling posts

  • The fast Fourier transform as a matrix factorization

    The discrete Fourier transform (DFT) is a mapping from \(\mathbb{C}^n\) to \(\mathbb{C}^n\). Given a vector \(u \in \mathbb{C}^n\), the discrete Fourier transform of \(u\) is another vector \(U \in \mathbb{C}^n\) given by,

    \(\displaystyle{U_k = \sum_{j=0}^{n-1} \exp\left(-2 \pi i kj/m\right)u_j}.\)

    If we let \(\omega_n = \exp\left(-2 \pi i /m\right)\), then \(U\) can be written more compactly as

    \(\displaystyle{U_k = \sum_{j=0}^{n-1} \omega^{kj}_n u_j }.\)

    The map from \(u\) to \(U\) is linear and can therefor be represented as multiplication by an \(n \times n\) matrix \(D\). Looking at the equation above, we can see that if we set \(D_{kj} = \omega^{kj}_n\) and use \(0\)-indexing, then,

    \(U = Du.\)

    This suggests that calculating the vector \(U = Du\) would require roughly \(n^2\) operations. However, the famous fast Fourier transform shows that \(U\) can be computed in roughly \(n\log_2(n)\) operations. There are many derivations of the fast Fourier transform but I wanted to focus on one which is based on factorizing the matrix \(D\) into sparse matrices.

    Matrix factorizations

    A factorization of a matrix \(A\) is a way of writing \(A\) as a product of matrices \(A = A_L\cdots A_2A_1\) where typically \(L\) is \(2\) or \(3\). Matrix factorizations can tell us information about the matrix \(A\). For example, the singular value decomposition tells us the spectral norm and singular vectors \(A\) (and lots more). Other matrix factorizations can help us compute things related to the matrix \(A\). The QR decomposition helps us solve least-squares problems and the LU factorization let us invert \(A\) in (almost) the same amount of time it takes to solve the equation \(Ax = b\).

    Each of the factorizations above applied to arbitrary matrices. Every possible matrix has an singular value decomposition and a QR factorization. Likewise all square matrices have an LU factorization. The DFT matrix \(D\) has a factorization which most matrices other matrices do not. This factorization lets us write \(D = D_L\cdots D_2D_1\) where each matrix \(D_l\) is sparse.

    Sparse matrices

    A matrix \(A\) is sparse if many of the entries of \(A\) are zero. There isn’t a precise definition of “many”. A good heuristic is that if \(A\) has size \(m\) by \(n\), then \(A\) is sparse if the number of non-zero entries is of the order \(m\) or \(n\). In contrast, a dense matrix is one where the number of non-zero entries is of the same order of \(mn\).

    The structure of sparse matrices makes them especially nice to work with. They take up a lot less memory and they are quicker to compute with. If \(A \in \mathbb{C}^{m \times n}\) has only \(k\) non-zero elements, then for any vector \(x \in \mathbb{C}^n\), the matrix-vector product \(Ax\) can be computed in order \(k\) operations. If \(k\) is much less than \(mn\), then this can be a huge speed up.

    This shows why a sparse factorization of the DFT matrix \(D\) will give a fast way of calculating \(Du\). Specifically, suppose that we have \(D = D_L \cdots D_2D_1\) and each matrix \(D_l\) has just \(k_l\) non-zero entries. Then, for any vector \(u \in \mathbb{C}^n\), we can compute \(U= Du=D_L \cdots D_2 D_1 u\) in stages. First we multiply \(u\) by \(D_1\) to produce \(u^{(1)}\). The output \(u^{(1)}\) is then multiplied by \(D_2\) to produce \(u^{(2)}\). We continue in this way until we finally multiply \(u^{(L-1)}\) by \(D_L\) to produce \(u = D_L u^{(L-1)}\). The total computation cost is of the order \(k_1 + k_2 + \cdots + k_L\) which can be drastically less than \(n^2\).

    The DFT factorization

    We’ll now show that \(D\) does indeed have a sparse matrix factorization. To avoid some of the technicalities, we’ll assume that \(n\) is a power of \(2\). That is, \(n = 2^m\) for some \(m\). Remember that the discrete Fourier transform of \(u\) is the vector \(U\) given by

    \(\displaystyle{U_k = \sum_{j=0}^{n-1} \omega^{jk}_n u_j},\)

    where \(\omega_n = \exp(-2\pi i/n)\). The idea behind the fast Fourier transform is to split the above sum into two terms. One where we add up the even values of \(j\) and one where we add up the odd values of \(j\). Specifically,

    \(\displaystyle{U_k = \sum_{j=0, \text{j even }}^{n-1} \omega_n^{jk} u_j + \sum_{j = 0, j \text{ odd}}^{n-1} \omega_n^{jk}u_j = \sum_{j=0}^{n/2 – 1} \omega^{2jk}_n u_{2j} + \sum_{j=0}^{n/2-1} \omega^{(2j+1)k}_n u_{2j+1}}\)

    We’ll now look at each of these sums separately. In the first sum, we have terms of the form \(\omega^{2 jk}_n\), these can by rewritten as follows,

    \(\omega^{2 jk}_n = \exp(-2 \pi i /n)^{2 jk} = \exp(-2 \pi i 2/n)^{jk} = \exp(-2 \pi i /(n/2))^{jk} = \omega_{n/2}^{jk}\).

    We can also use this result to rewrite the terms \(\omega^{(2j+1)k}_n\),

    \(\omega^{(2j+1)k}_n = \omega_n^k\omega^{2jk}_n = \omega_n^k \omega^{jk}_{n/2}.\)

    If we plug both of these back into the expression for \(U = Du\), we get

    \(\displaystyle{U_k = \sum_{j=0}^{n/2 – 1} \omega_{n/2}^{jk} u_{2j} + \sum_{j=0}^{n/2-1} \omega_n^k \omega_{n/2}^{jk} u_{2j+1} = \sum_{j=0}^{n/2-1} \omega_{n/2}^{jk} u_{2j} + \omega_n^k \sum_{j=0}^{n/2-1}\omega_{n/2}^{jk} u_{2j+1}. }\)

    This shows that the Fourier transform of the \(n\)-dimensional vector \(u\) can be done by computing and combining the Fourier transforms of the \(n/2\)-dimensional vectors. One of these vectors contain the even indexed elements of \(u\) and the others contain the odd indexed elements of \(u\).

    There is one more observation that we need. If \(k\) is between \(n/2\) and \(n-1\), then we can simplify the above expression for \(U_k\). Specifically, if we set \(k’=k-n/2\), then

    \(\omega_{n/2}^{jk} = \omega_{n/2}^{j(n/2+k’)} = \omega_{n/2}^{jn/2}\omega_{n/2}^{jk’} = \exp(-2\pi i (n/2)/(n/2))^j \omega_{n/2}^{jk’} = \omega_{n/2}^{jk’}.\)

    If we plug this into the above formula for \(U_k\), then we get

    \(\displaystyle{U_k = \sum_{j=0}^{n/2-1} \omega_{n/2}^{j(k-n/2)} u_{2j} + \omega_n^k \sum_{j=0}^{n/2-1}\omega_{n/2}^{j(k-n/2)} u_{2j+1}}\) for \(k = n/2, n/2+1,\ldots, n-1\).

    This reduces the computation since there are some computations we only have to do once but can use twice. The algebra we did above shows that we can compute the discrete Fourier transform of \(u \in \mathbb{C}^n\) in the following step.

    1. Form the vectors \(u^{\text{even}} = (u_0,u_2,\ldots,u_{n-2}) \in \mathbb{C}^{n/2}\) and \(u^{\text{odd}} = (u_1,u_3,\ldots,u_{n-1}) \in \mathbb{C}^{n/2}\).
    2. Calculate the discrete Fourier transforms of \(u^{\text{even}}\) and \(u^{\text{odd}}\). This gives two vector \(U^{\text{even}}, U^{\text{odd}} \in \mathbb{C}^{n/2}\).
    3. Compute \(U \in \mathbb{C}^n\) by,

    \(\displaystyle{U_k = \begin{cases} U^{\text{even}}_k + \omega_n^k U^{\text{odd}}_k & \text{for } k = 0,1,\ldots, n/2-1,\\ U^{\text{even}}_{k-n/2}+\omega_n^k U^{\text{odd}}_{k-n/2} &\text{for } k=n/2,n/2+1,\ldots,n-1 \end{cases}}\)

    The steps above break down computing \(U=Du\) into three simpler steps. Each of these simpler steps are also linear maps and can therefore be written as multiplication by a matrix. If we call these matrices \(D_1,D_2\) and \(D_3\), then

    • The matrix \(D_1\) maps \(u\) to \(\begin{bmatrix} u^{\text{even}} \\ u^{\text{odd}}\end{bmatrix}\). That is

    \(D_1 = \begin{bmatrix} 1&0&0&0&\cdots&0&0\\0&0&1&0&\cdots &0&0 \\\vdots& \vdots &\vdots & \vdots & \ddots & \vdots & \vdots \\0&0&0&0&\cdots&1&0 \\ 0&1&0&0&\cdots&0&0 \\ 0&0&0&1&\cdots&0&0 \\ \vdots&\vdots &\vdots & \vdots & \ddots & \vdots & \vdots \\ 0&0&0&0&\cdots&0&1\end{bmatrix} \in \mathbb{C}^{n \times n}.\)

    • The matrix \(D_2\) is block-diagonal and each block is the \(n/2\)-dimensional discrete Fourier transformation matrix. That is

    \(D_2 = \begin{bmatrix}D^{(n/2)} & 0\\ 0 & D^{(n/2)}\end{bmatrix}\in \mathbb{C}^{n \times n},\)

    where \(D^{(n/2)}\) is given by \(D^{(n/2)}_{jk} = \omega_{n/2}^{jk}\).

    • The matrix \(D_3\) combines two \(n/2\)-vectors according to the formula in step 3 above,

    \(D_3 = \begin{bmatrix} 1&0&\cdots & 0 & \omega_n^0 & 0 & \cdots & 0\\ 0&1&\cdots&0 &0& \omega_n^1&\cdots & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots& \vdots &\ddots&\vdots\\ 0&0&\cdots & 1 & 0 & 0 & \cdots & \omega_n^{n/2-1}\\1&0&\cdots & 0 & \omega_n^{n/2} & 0 & \cdots & 0\\ 0&1&\cdots&0 &0& \omega_n^{n/2+1}&\cdots & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots& \vdots &\ddots&\vdots\\ 0&0&\cdots & 1 & 0 & 0 & \cdots & \omega_n^{n-1}\end{bmatrix}\in \mathbb{C}^{n \times n} \)

    The three-step procedure above for calculating \(U=Du\) is equivalent to iteratively multiplying by \(D_1\), then \(D_2\) and the finally by \(D_3\). Another way of saying this is that \(D=D_3D_2D_1\) is a matrix factorization!

    From the definitions, we can see that each of the matrices \(D_1,D_2\) and \(D_3\) have a lot of zeros. Specifically, \(D_1\) has just \(n\) non-zero entries (one in each row), \(D_2\) has \(2(n/2)^2\) non-zero entries (two blocks of \(n/2 \times n/2\) non-zeros) and \(D_3\) has \(2n\) non-zero entries (two in each row). This means that the product \(U=D_3D_2D_1u\) can be computed in roughly

    \(n+2(n/2)^2 + 2n = 3n+2(n/2)^2,\)

    operations. This is far away from the \(n\log_2(n)\) operations promised at the start of this blog post. However, the matrix \(D_2\) contains two copies of the \(n/2 \times n/2\) discrete Fourier transform matrix. This means we could apply the exact same argument to the \(n/2 \times n/2\) matrix \(D^{(n/2)}\). This lets us replace \((n/2)^2\) with \(\frac{3}{2}n + 2(n/4)^2\). This brings the number of operations down to

    \(3n+2\left(\frac{3}{2}n + 2(n/4)^2\right) = 6n + 4(n/4)^2.\)

    We can again keep going! If we recall that \(n=2^m\), then we get to repeat the proccess \(m\) times in total giving approximately,

    \(3mn + 2^m = 3\log_2(n)n+n,\) operations.

    So by using lots of sparse matrices, we can compute \(U=Du\) in \(\approx n\log_2(n)\) operations instead of \(\approx n^2\) operations.

  • Concavity of the squared sum of square roots

    The \(p\)-norm of a vector \(x\in \mathbb{R}^n\) is defined to be:

    \(\displaystyle{\Vert x \Vert_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}}.\)

    If \(p \ge 1\), then the \(p\)-norm is convex. When \(0<p<1\), this function is not convex and actually concave when all the entries of \(x\) are non-negative. On a recent exam for the course Convex Optimization, we were asked to prove this when \(p = 1/2\). In this special case, the mathematics simplifies nicely.

    When \(p = 1/2\), the \(p\)-norm can be described as the squared sum of square roots. Specifically,

    \(\displaystyle{\Vert x \Vert_{1/2} = \left(\sum_{i=1}^n \sqrt{|x_i|} \right)^2}.\)

    Note that we can expand the square and rewrite the function as follows

    \(\displaystyle{\Vert x \Vert_{1/2} = \sum_{i=1}^n\sum_{k=1}^n\sqrt{\left|x_i\right|} \sqrt{|x_k|} =\sum_{i=1}^n\sum_{k=1}^n \sqrt{|x_ix_k|}}.\)

    If we restrict to \(x \in \mathbb{R}^n\) with \(x_i \ge 0\) for all \(i\), then this function simplifies to

    \(\displaystyle{\Vert x \Vert_{1/2} =\sum_{i=1}^n\sum_{k=1}^n \sqrt{x_ix_k}},\)

    which is a sum of geometric means. The geometric mean function \((u,v) \mapsto \sqrt{uv}\) is concave when \(u,v \ge 0\). This can be proved by calculating the Hessian of this function and verifying that it is negative semi-definite.

    From this, we can conclude that each function \(x \mapsto \sqrt{x_ix_k}\) is also concave. This is because \(x \mapsto \sqrt{x_ix_k}\) is a linear function followed by a concave function. Finally, any sum of concave functions is also concave and thus \(\Vert x \Vert_{1/2}\) is concave.

    Hellinger distance

    A similar argument can be used to show that Hellinger distance is a convex function. Hellinger distance, \(d_H(\cdot,\cdot)\) is defined on pairs of probability distributions \(p\) and \(q\) on a common set \(\{1,\ldots,n\}\). For such a pair,

    \(\displaystyle{d_H(p,q) = \sum_{i=1}^n \left(\sqrt{p_i}-\sqrt{q_i}\right)^2},\)

    which certainly doesn’t look convex. However, we can expand the square and use the fact that \(p\) and \(q\) are probability distributions. This shows us that Helligener distance can be written as

    \(\displaystyle{d_H(p,q) = \sum_{i=1}^n p_i – 2\sum_{i=1}^n \sqrt{p_iq_i}+\sum_{i=1}^n q_i = 2 – 2\sum_{i=1}^n\sqrt{p_iq_i}}.\)

    Again, each function \((p,q) \mapsto \sqrt{p_iq_i}\) is concave and so the negative sum of such functions is convex. Thus, \(d_H(p,q)\) is convex.

    The course

    As a final comment, I’d just like to say how much I am enjoying the class. Prof. Stephen Boyd is a great lecturer and we’ve seen a wide variety of applications in the class. I’ve recently been reading a bit of John D Cook’s blog and agree with all he says about the course here.