michaelnhowes – mathstoshare

Poisson approximations to the negative binomial distribution

This post is an introduction to the negative binomial distribution and a discussion of different ways of approximating the negative binomial distribution.

The negative binomial distribution describes the number of times a coin lands on tails before a certain number of heads are recorded. The distribution depends on two parameters \(p\) and \(r\). The parameter \(p\) is the probability that the coin lands on heads and \(r\) is the number of heads. If \(X\) has the negative binomial distribution, then \(X = x\) means in the first \(x+r-1\) tosses of the coin, there were \(r-1\) heads and that toss number \(x+r\) was a head. This means that the probability that \(X=x\) is given by

\(\displaystyle{f(x) = \binom{x+r-1}{r-1}p^{r}\left(1-p\right)^x}\)

Here is a plot of the function \(f(x)\) for different values of \(r\) and \(p\).

Poisson approximations

When the parameter \(r\) is large and \(p\) is close to one, the negative binomial distribution can be approximated by a Poisson distribution. More formally, suppose that \(r(1-p)=\lambda\) for some positive real number \(\lambda\). If \(r\) is large then, the negative binomial random variable with parameters \(p\) and \(r\), converges to a Poisson random variable with parameter \(\lambda\). This is illustrated in the picture below where three negative binomial distributions with \(r(1-p)=5\) approach the Poisson distribution with \(\lambda =5\).

Total variation distance is a common way to measure the distance between two discrete probability distributions. The log-log plot below shows that the error from the Poisson approximation is on the order of \(1/r\) and that the error is bigger if the limiting value of \(r(1-p)\) is larger.

It turns out that is is possible to get a more accurate approximation by using a different Poisson distribution. In the first approximation, we used a Poisson random variable with mean \(\lambda = r(1-p)\). However, the mean of the negative binomial distribution is \(r(1-p)/p\). This suggests that we can get a better approximation by setting \(\lambda = r(1-p)/p\).

The change from \(\lambda = r(1-p)\) to \(\lambda = r(1-p)/p\) is a small because \(p \approx 1\). However, this small change gives a much better approximation, especially for larger values of \(r(1-p)\). The below plot shows that both approximations have errors on the order of \(1/r\), but the constant for the second approximation is much better.

Second order accurate approximation

It is possible to further improve the Poisson approximation by using a Gram–Charlier expansion. A Gram–Charlier approximation for the Poisson distribution is given in this paper.¹ The approximation is

\(\displaystyle{f_{GC}(x) = P_\lambda(x) – \frac{1}{2}q\left((x-\lambda)P_\lambda(x)-(x-1-\lambda)P_\lambda(x-1)\right)},\)

where \(\lambda = \frac{k(1-p)}{p}\), \(q=1-p\), and \(P_\lambda(x)\) is the Poisson pmf evaluated at \(x\).

The Gram–Charlier expansion is considerably more accurate than either Poisson approximation. The errors are on the order of \(1/r^2\). This higher accuracy means that the error curves for the Gram–Charlier expansion has a steeper slope.

The approximation is given in equation (4) of the paper and is stated in terms of the CDF instead of the PMF. The equation also contains a small typo, it should say \(\frac{1}{2}q\) instead of \(\frac{1}{2}p\). ↩︎

Jigsaw puzzles and the isoperimetric inequality

The 2025 World Jigsaw Puzzle Championship recently took place in Spain. In the individual heats, competitors had to choose their puzzle. Each competitor could either complete a 500-piece rectangular puzzle or a 500-piece circular puzzle. For example, in Round F, puzzlers could choose one of the following two puzzles:

There are many things to consider when picking a puzzle. First, the puzzles had already been released. This meant a puzzler could get an advantage by choosing to do a puzzle they had already completed. The image on the puzzle is also important, as is the puzzle’s style. In the example above, most puzzlers picked the circular puzzle because it had a clear colour gradient and distinct sections.

The commentators mentioned that puzzlers might also consider the number of edge pieces. The circular puzzles have fewer edge pieces than the rectangular puzzles. This could make the rectangular puzzle easier because some puzzlers like to do the edge first and then build inwards.

The difference in edge pieces isn’t a coincidence. Circular puzzles have fewer edge pieces than any other 500-piece puzzle. This fact is a consequence of a piece of mathematics called the isoperimetric inequality.

The isopermetric inequality

The isoperimetric inequality is a statement about 2D shapes. The word ‘isoperimetric’ literally translates to ‘having the same perimeter.’ The isoperimetric inequality says that if a shape has the same perimeter as a particular circle, then the circle has a larger (or equal) area. For example, the square and circle below have the same perimeter but the circle has a larger area:

The isoperimetric inequality can also be flipped around to give a statement about shapes with the same area. This ‘having the same area’ inequality states that if a shape has the same area as a circle, then the circle has a smaller (or equal) perimeter. For example, the square and circle below have the same area but the circle has a smaller perimeter:

The ‘having the same area’ inequality applies to the jigsaw puzzles above. In a jigsaw puzzle, the area is proportional to the number of pieces and the perimeter is proportional to the number of edge pieces (assuming the pieces are all roughly the same size and shape). This means that if one puzzle has the same number of pieces as a circular puzzle (for example 500), then the circular puzzle will have fewer edge pieces! This is exactly what the commentator said about the puzzles in the competition!

I don’t think many of the puzzlers thought about the isopermetric inequality when picking their puzzles. But if you ever do a circular puzzle, remember that you’re doing a puzzle with as few edge pieces as possible!

The Multivariate Hypergeometric Distribution

The hypergeometric distribution describes the number of white balls in a sample of \(n\) balls draw without replacement from an urn with \(m_W\) white balls and \(m_B\) black balls. The probability of having \(x\) white balls in the sample is:

\(\displaystyle{f(x; m_W, m_B, n) = \frac{\binom{m_W}{x} \binom{m_B}{n-x}}{\binom{m_W+m_B}{n}}}\)

This is because there are \(\binom{m_W+m_B}{n}\) possible ways to sample \(n\) balls from the urn which has \(m_W+m_B\) black balls in total. Of these possibilities, there are \(\binom{m_W}{x} \binom{m_B}{n-x}\) ways of selecting exactly \(x\) white balls.

The multivariate hypergeometric distribution is a generalization of the hypergeometric distribution. In the multivariate hypergeometric distribution, there are \(k\) colours instead of \(2\). Suppose that the urn contains \(m_1\) balls of the first colour, \(m_2\) balls of the second colour and so on. The total number of balls in the urn is \(M = m_1+m_2+\cdots + m_k\). Next we draw \(n\) balls from the urn. Let \(X_i\) be the number of balls of colour \(i\) in our sample. The vector \(X=(X_1,X_2,\ldots, X_k)\) follows the multivariate hypergeometric distribution. The vector \(X\) satisfies the constraints \(X_i \in \{0,1,\ldots,M_i\}\) and \(\sum_{i=1}^k X_i = n\). The probability of observing \(X=x\) is

\(\displaystyle{f(x; m, n) = \frac{\binom{m_1}{x_1} \binom{m_2}{x_2}\cdots \binom{m_k}{x_k}}{\binom{M}{n} }, \quad x_1+x_2+\cdots+x_k = n}\)

This is because there are \(\binom{M}{n}\) possible ways of sampling \(n\) balls from the urn that has \(M\) balls in total. Likewise, The number of ways of choosing \(x_i\) balls of colour \(m_i\) is \(\binom{m_i}{x_i}\).

Sampling

There are two common methods for sampling from the multivariate hypergeometric distribution. The first option is to sample \(n\) numbers from \(1,\ldots,M\) without replacement. The numbers \(1,\ldots,M\) represent the balls in the urn. We can let \(1,\ldots,m_1\) represent the numbers with colour 1, and let \(m_1+1,\ldots,m_1+m_2\) represent the numbers with colour 2 and so on. Suppose that \(Y_1,\ldots,Y_n\) is the sample of size \(n\) without replacement from \(1,\ldots,M\). Next define:

\(\displaystyle{X_i = \#\left\{j :\sum_{k=1}^{i-1}X_k < Y_j \le \sum_{k=1}^i X_k\right\}}\)

Then the vector \(X=(X_1,\ldots,X_k)\) will be distributed according to the multivariate hypergeometric distribution. This method of sampling has complexity on the order of \(n\log(M)\). This is because sampling a random number of size \(M\) has complexity \(\log(M)\) and we need to sample \(n\) of these numbers.

There is a second method of sampling that is more efficient when \(n\) is large and \(k\) is small. This is a sequential method that samples each \(X_i\) from a (univariate) hypergeometric distribution. The idea is that when sampling \(X_1\) we can treat the balls of colour \(1\) as “white” and all the other colours as “black”. This means that marginally, \(X_1\) is hypergeometric with parameters \(m_W = m_1\) and \(m_B = m_2+\cdots + m_k\) and \(n = n\). Furthermore, conditional on \(X_1\), the second entry, \(X_2\), is also hypergeometric. The parameters for \(X_2\) are \(m_W = m_2\), \(m_B = m_3+\cdots+m_k\) and \(n=n-X_1\). In fact, all of the conditionals of \(X\) are hypergeometric random variables and so can sampled sequentially. The python code at the end of this post implements the sequential sampler.

The complexity of the sequential method is bounded above by \(k\) times the complexity of sampling from the hypergeometric distribution. By using the ratio of uniform distribution this can be done in complexity on the order of \(\log(m_w+m_b)\). The overall complexity of the sequential method is therefore on the order of \(k\log(M)\) which can be much smaller than \(n\log(M)\).

import numpy as np

def multivariate_hypergeometric_sample(m, n):
    x = np.zeros_like(m)
    k = len(m)
    for i in range(k-1):
        x[i] = np.random.hypergeometric(m[i], np.sum(m[(i+1):]), n)
        n -= x[i]
    x[k-1] = n
    return x

For more about the multivariate hyper-geometric distribution see this post about contingency tables and double cosets.

The maximum of geometric random variables

I was working on a problem involving the maximum of a collection of geometric random variables. To state the problem, let \((X_i)_{i=1}^n\) be i.i.d. geometric random variables with success probability \(p=p(n)\). Next, define \(M_n = \max\{X_i : 1 \le i \le n\}\). I wanted to know the limiting distribution of \(M_n\) as \(p \to 0\) and \(n \to \infty\). I was particularly interested in the case when \(p\) was on the order of \(1/n\).

It turns out that this problem has a very nice solution: If \(p\to 0\) as \(n \to \infty\), then for all real numbers \(c\)

\(\displaystyle{\lim_{n \to \infty}\mathbb{P}\left(M_n \le \frac{\log n}{p}+\frac{c}{p}\right) \to \exp(-e^{-c})}\)

That is, \(M_n\) is on the order of \(\frac{\log n}{p}\) and has fluctuations of order \(1/p\). Furthermore, these fluctuations are distributed according to the Gumbel distribution. Importantly, this result places no restrictions on how quickly \(p\) goes to \(0\).

In the rest of this post, I’ll derive the above identity. The first step relates \(M_n\) to the maximum of a collection of exponential random variables. The second step uses a coupling to prove that the exponential and geometric random variables are close to each other.

An easier version

When \(p\) goes to \(0\), the scaled geometric random variables \(p X_i\) converge to exponential random variables. Exponential random variables are nicer to work because they have a continuous distribution.

This suggests that we could try replacing all the geometric random variables with exponential random variables. More precisely, we will replace \(X_i\) with \(Y_i/p\) where \(Y_i\) are exponentially distributed. Next, we will derive the main result for the exponential random variables. Then, in the next section, we will prove that \(X_i\) is sufficiently close to \(Y_i/p\).

Let \(\widetilde{M}_n = \max\{ Y_i/p : 1 \le i \le n\}\) where \(Y_i\) are i.i.d and exponentially distributed. This means that for \(y \ge 0\), \(\mathbb{P}(Y_i \le y)=1-e^{-y}\). The condition \(\widetilde{M}_n \le y\) is equivalent to \(Y_i \le py\) for all \(i =1,\ldots,n\). Since the \(Y_i\)’s are independent, this implies that for \(y \ge 0\)

\(\displaystyle{\mathbb{P}(\widetilde{M}_n \le y) = \prod_{i=1}^n \mathbb{P}(Y_i \le py) = \left(1-e^{-py}\right)^n}\)

Now fix a real number \(c\) and let \(y = \frac{\log n}{p}+\frac{c}{p}\). Since \(n \to \infty\), we know that \(y \ge 0\) for large enough \(n\). Thus, for sufficiently large \(n\)

\(\displaystyle{\mathbb{P}\left(\widetilde{M}_n \le \frac{\log n}{p}+\frac{c}{p}\right) = \left(1-e^{-\log n – c})\right)^n = \left(1-\frac{e^{-c}}{n}\right)^n},\)

And \(\left(1-\frac{e^{-c}}{n}\right)^n \to \exp(-e^{-c})\) as \(n \to \infty\). So the main result holds for \(\widetilde{M}_n = \max\{Y_i/p : 1 \le i \le n\}\). We want to show that it holds for \(M_n = \max\{X_i : 1 \le i \le n\}\). The next section proves a close relationship between \(X_i\) and \(Y_i\).

Exponential and geometric coupling

It is true that as \(p \to 0\), \(pX_i\) converges in distribution to an exponential random variable. This is a result that holds for a single random variable. For an approximation like \(M_n \approx \widetilde{M}_n\), we need a result that works for every \(X_i\) at once. This seems tricky since it involves the convergence of an infinite collection of random variables. Fortunately, there is a simple approach based on coupling.

For each geometric random variable \(X_i\) it is possible to produce an exponential random variables \(Y_i\) with

\(\displaystyle{\frac{1}{\lambda} Y_i \le X_i \le \frac{1}{\lambda} Y_i + 1},\)

where \(\lambda = -\log(1-p)\). This is because \(\mathbb{P}(X_i \le x) = 1-(1-p)^{[x]}\) and \(\mathbb{P}(Y_i/\lambda \le x) = 1-e^{-x/\lambda} = 1-(1-p)^{x}\). This means that

\( \displaystyle{\mathbb{P}(Y_i/\lambda \le x-1) \le \mathbb{P}(X_i \le x)\le \mathbb{P}(Y_i/\lambda \le x-1) }.\)

If we sample both \(X_i\) and \(Y_i\) with inverse transform sampling, and use the same uniform random variables, then we will have

\(\displaystyle{ \frac{1}{\lambda} Y_i \le X_i \le \frac{1}{\lambda} Y_i + 1},\)

as claimed. If we use this inequality for every single pair \((X_i,Y_i)\) we will get

\(\displaystyle{\max\{Y_i/ \lambda\} \le M_n \le \max\{Y_i/\lambda\} +1}\)

As \(p \to 0\), we have that \(\lambda \to 0\) and \(\frac{p}{\lambda}=\frac{p}{-\log(1-p)} \to 1\). This means that the random variable \(\max\{Y_i/\lambda\} \) has the same limit as \(\widetilde{M}_n = \max\{ Y_i/p : 1 \le i \le n\}\). Furthermore, \(\max\{Y_i/\lambda\} +1\) also has the same limit as \(\widetilde{M}_n\). Since \( M_n\) is “sandwiched” between \(\max\{Y_i/\lambda\}\) and \(\max\{Y_i/\lambda\} +1\), the result must also hold for \(M_n\). And so

\(\displaystyle{\lim_{n \to \infty}\mathbb{P}\left(M_n \le \frac{\log n}{p}+\frac{c}{p}\right) \to \exp(-e^{-c})}\)

“Uniformly random”

The term “uniformly random” sounds like a contradiction. How can the word “uniform” be used to describe anything that’s random? Uniformly random actually has a precise meaning, and, in a sense, means “as random as possible.” I’ll explain this with an example about shuffling card.

Shuffling cards

Suppose I have a deck of ten cards labeled 1 through 10. Initially, the cards are face down and in perfect order. The card labeled 10 is on top of the deck. The card labeled 9 is second from the top, and so on down to the card labeled 1. The cards are definitely not random.

Next, I generate a random number between 1 and 10. I then find the card with the corresponding label and put it face down on top of the deck. The cards are now somewhat random. The number on top could anything, but the rest of the cards are still in order. The cards are random but they are not uniformly random.

Now suppose that I keep generating random numbers and moving cards to the top of the deck. Each time I do this, the cards get more random. Eventually (after about 30 moves¹) the cards will be really jumbled up. Even if you knew the first few cards, it would be hard to predict the order of the remaining ones. Once the cards are really shuffled, they are uniformly random.

Uniformly random

A deck of cards is uniformly random if each of the possible arrangements of the cards are equally likely. After only moving one card, the deck of cards is not uniformly random. This is because there are only 10 possible arrangements of the deck. Once the deck is well-shuffled, all of the 3,628,800 possible arrangements are equally likely.

In general, something is uniformly random if each possibility is equally likely. So the outcome of rolling a fair 6-sided die is uniformly random, but rolling a loaded die is not. The word “uniform” refers to the chance of each possibility (1/6 for each side of the die). These chances are all the same and “uniform”.

This is why uniformly random can mean “as random as possible.” If you are using a fair die or a well-shuffled deck, there are no biases in the outcome. Mathematically, this means you can’t predict the outcome.

Communicating research

The inspiration for this post came from a conversation I had earlier in the week. I was telling someone about my research. As an example, I talked about how long it takes for a deck of cards to become uniformly random. They quickly stopped me and asked how the two words could ever go together. It was a good point! I use the words uniformly random all the time and had never realized this contradiction.² It was a good reminder about the challenge of clear communication.

Footnotes

The number of moves it takes for the deck to well-shuffled is actually random. But on average it takes around 30 moves. For the mathematical details, see Example 1 in Shuffling Cards and Stopping Times by David Aldous and Persi Diaconis. ↩︎
Of the six posts I published last year, five contain the word uniform! ↩︎