Poisson approximations to the negative binomial distribution

This post is an introduction to the negative binomial distribution and a discussion of different ways of approximating the negative binomial distribution.

The negative binomial distribution describes the number of times a coin lands on tails before a certain number of heads are recorded. The distribution depends on two parameters \(p\) and \(r\). The parameter \(p\) is the probability that the coin lands on heads and \(r\) is the number of heads. If \(X\) has the negative binomial distribution, then \(X = x\) means in the first \(x+r-1\) tosses of the coin, there were \(r-1\) heads and that toss number \(x+r\) was a head. This means that the probability that \(X=x\) is given by

\(\displaystyle{f(x) = \binom{x+r-1}{r-1}p^{r}\left(1-p\right)^x}\)

Here is a plot of the function \(f(x)\) for different values of \(r\) and \(p\).

Poisson approximations

When the parameter \(r\) is large and \(p\) is close to one, the negative binomial distribution can be approximated by a Poisson distribution. More formally, suppose that \(r(1-p)=\lambda\) for some positive real number \(\lambda\). If \(r\) is large then, the negative binomial random variable with parameters \(p\) and \(r\), converges to a Poisson random variable with parameter \(\lambda\). This is illustrated in the picture below where three negative binomial distributions with \(r(1-p)=5\) approach the Poisson distribution with \(\lambda =5\).

Total variation distance is a common way to measure the distance between two discrete probability distributions. The log-log plot below shows that the error from the Poisson approximation is on the order of \(1/r\) and that the error is bigger if the limiting value of \(r(1-p)\) is larger.

It turns out that is is possible to get a more accurate approximation by using a different Poisson distribution. In the first approximation, we used a Poisson random variable with mean \(\lambda = r(1-p)\). However, the mean of the negative binomial distribution is \(r(1-p)/p\). This suggests that we can get a better approximation by setting \(\lambda = r(1-p)/p\).

The change from \(\lambda = r(1-p)\) to \(\lambda = r(1-p)/p\) is a small because \(p \approx 1\). However, this small change gives a much better approximation, especially for larger values of \(r(1-p)\). The below plot shows that both approximations have errors on the order of \(1/r\), but the constant for the second approximation is much better.

Second order accurate approximation

It is possible to further improve the Poisson approximation by using a Gram–Charlier expansion. A Gram–Charlier approximation for the Poisson distribution is given in this paper.1 The approximation is

\(\displaystyle{f_{GC}(x) = P_\lambda(x) – \frac{1}{2}q\left((x-\lambda)P_\lambda(x)-(x-1-\lambda)P_\lambda(x-1)\right)},\)

where \(\lambda = \frac{k(1-p)}{p}\), \(q=1-p\), and \(P_\lambda(x)\) is the Poisson pmf evaluated at \(x\).

The Gram–Charlier expansion is considerably more accurate than either Poisson approximation. The errors are on the order of \(1/r^2\). This higher accuracy means that the error curves for the Gram–Charlier expansion has a steeper slope.

  1. The approximation is given in equation (4) of the paper and is stated in terms of the CDF instead of the PMF. The equation also contains a small typo, it should say \(\frac{1}{2}q\) instead of \(\frac{1}{2}p\). ↩︎

The Multivariate Hypergeometric Distribution

The hypergeometric distribution describes the number of white balls in a sample of \(n\) balls draw without replacement from an urn with \(m_W\) white balls and \(m_B\) black balls. The probability of having \(x\) white balls in the sample is:

\(\displaystyle{f(x; m_W, m_B, n) = \frac{\binom{m_W}{x} \binom{m_B}{n-x}}{\binom{m_W+m_B}{n}}}\)

This is because there are \(\binom{m_W+m_B}{n}\) possible ways to sample \(n\) balls from the urn which has \(m_W+m_B\) black balls in total. Of these possibilities, there are \(\binom{m_W}{x} \binom{m_B}{n-x}\) ways of selecting exactly \(x\) white balls.

The multivariate hypergeometric distribution is a generalization of the hypergeometric distribution. In the multivariate hypergeometric distribution, there are \(k\) colours instead of \(2\). Suppose that the urn contains \(m_1\) balls of the first colour, \(m_2\) balls of the second colour and so on. The total number of balls in the urn is \(M = m_1+m_2+\cdots + m_k\). Next we draw \(n\) balls from the urn. Let \(X_i\) be the number of balls of colour \(i\) in our sample. The vector \(X=(X_1,X_2,\ldots, X_k)\) follows the multivariate hypergeometric distribution. The vector \(X\) satisfies the constraints \(X_i \in \{0,1,\ldots,M_i\}\) and \(\sum_{i=1}^k X_i = n\). The probability of observing \(X=x\) is

\(\displaystyle{f(x; m, n) = \frac{\binom{m_1}{x_1} \binom{m_2}{x_2}\cdots \binom{m_k}{x_k}}{\binom{M}{n} }, \quad x_1+x_2+\cdots+x_k = n}\)

This is because there are \(\binom{M}{n}\) possible ways of sampling \(n\) balls from the urn that has \(M\) balls in total. Likewise, The number of ways of choosing \(x_i\) balls of colour \(m_i\) is \(\binom{m_i}{x_i}\).

Sampling

There are two common methods for sampling from the multivariate hypergeometric distribution. The first option is to sample \(n\) numbers from \(1,\ldots,M\) without replacement. The numbers \(1,\ldots,M\) represent the balls in the urn. We can let \(1,\ldots,m_1\) represent the numbers with colour 1, and let \(m_1+1,\ldots,m_1+m_2\) represent the numbers with colour 2 and so on. Suppose that \(Y_1,\ldots,Y_n\) is the sample of size \(n\) without replacement from \(1,\ldots,M\). Next define:

\(\displaystyle{X_i = \#\left\{j :\sum_{k=1}^{i-1}X_k < Y_j \le \sum_{k=1}^i X_k\right\}}\)

Then the vector \(X=(X_1,\ldots,X_k)\) will be distributed according to the multivariate hypergeometric distribution. This method of sampling has complexity on the order of \(n\log(M)\). This is because sampling a random number of size \(M\) has complexity \(\log(M)\) and we need to sample \(n\) of these numbers.

There is a second method of sampling that is more efficient when \(n\) is large and \(k\) is small. This is a sequential method that samples each \(X_i\) from a (univariate) hypergeometric distribution. The idea is that when sampling \(X_1\) we can treat the balls of colour \(1\) as “white” and all the other colours as “black”. This means that marginally, \(X_1\) is hypergeometric with parameters \(m_W = m_1\) and \(m_B = m_2+\cdots + m_k\) and \(n = n\). Furthermore, conditional on \(X_1\), the second entry, \(X_2\), is also hypergeometric. The parameters for \(X_2\) are \(m_W = m_2\), \(m_B = m_3+\cdots+m_k\) and \(n=n-X_1\). In fact, all of the conditionals of \(X\) are hypergeometric random variables and so can sampled sequentially. The python code at the end of this post implements the sequential sampler.

The complexity of the sequential method is bounded above by \(k\) times the complexity of sampling from the hypergeometric distribution. By using the ratio of uniform distribution this can be done in complexity on the order of \(\log(m_w+m_b)\). The overall complexity of the sequential method is therefore on the order of \(k\log(M)\) which can be much smaller than \(n\log(M)\).

import numpy as np

def multivariate_hypergeometric_sample(m, n):
    x = np.zeros_like(m)
    k = len(m)
    for i in range(k-1):
        x[i] = np.random.hypergeometric(m[i], np.sum(m[(i+1):]), n)
        n -= x[i]
    x[k-1] = n
    return x

For more about the multivariate hyper-geometric distribution see this post about contingency tables and double cosets.

The maximum of geometric random variables

I was working on a problem involving the maximum of a collection of geometric random variables. To state the problem, let \((X_i)_{i=1}^n\) be i.i.d. geometric random variables with success probability \(p=p(n)\). Next, define \(M_n = \max\{X_i : 1 \le i \le n\}\). I wanted to know the limiting distribution of \(M_n\) as \(p \to 0\) and \(n \to \infty\). I was particularly interested in the case when \(p\) was on the order of \(1/n\).

It turns out that this problem has a very nice solution: If \(p\to 0\) as \(n \to \infty\), then for all real numbers \(c\)

\(\displaystyle{\lim_{n \to \infty}\mathbb{P}\left(M_n \le \frac{\log n}{p}+\frac{c}{p}\right) \to \exp(-e^{-c})}\)

That is, \(M_n\) is on the order of \(\frac{\log n}{p}\) and has fluctuations of order \(1/p\). Furthermore, these fluctuations are distributed according to the Gumbel distribution. Importantly, this result places no restrictions on how quickly \(p\) goes to \(0\).

In the rest of this post, I’ll derive the above identity. The first step relates \(M_n\) to the maximum of a collection of exponential random variables. The second step uses a coupling to prove that the exponential and geometric random variables are close to each other.

An easier version

When \(p\) goes to \(0\), the scaled geometric random variables \(p X_i\) converge to exponential random variables. Exponential random variables are nicer to work because they have a continuous distribution.

This suggests that we could try replacing all the geometric random variables with exponential random variables. More precisely, we will replace \(X_i\) with \(Y_i/p\) where \(Y_i\) are exponentially distributed. Next, we will derive the main result for the exponential random variables. Then, in the next section, we will prove that \(X_i\) is sufficiently close to \(Y_i/p\).

Let \(\widetilde{M}_n = \max\{ Y_i/p : 1 \le i \le n\}\) where \(Y_i\) are i.i.d and exponentially distributed. This means that for \(y \ge 0\), \(\mathbb{P}(Y_i \le y)=1-e^{-y}\). The condition \(\widetilde{M}_n \le y\) is equivalent to \(Y_i \le py\) for all \(i =1,\ldots,n\). Since the \(Y_i\)’s are independent, this implies that for \(y \ge 0\)

\(\displaystyle{\mathbb{P}(\widetilde{M}_n \le y) = \prod_{i=1}^n \mathbb{P}(Y_i \le py) = \left(1-e^{-py}\right)^n}\)

Now fix a real number \(c\) and let \(y = \frac{\log n}{p}+\frac{c}{p}\). Since \(n \to \infty\), we know that \(y \ge 0\) for large enough \(n\). Thus, for sufficiently large \(n\)

\(\displaystyle{\mathbb{P}\left(\widetilde{M}_n \le \frac{\log n}{p}+\frac{c}{p}\right) = \left(1-e^{-\log n – c})\right)^n = \left(1-\frac{e^{-c}}{n}\right)^n},\)

And \(\left(1-\frac{e^{-c}}{n}\right)^n \to \exp(-e^{-c})\) as \(n \to \infty\). So the main result holds for \(\widetilde{M}_n = \max\{Y_i/p : 1 \le i \le n\}\). We want to show that it holds for \(M_n = \max\{X_i : 1 \le i \le n\}\). The next section proves a close relationship between \(X_i\) and \(Y_i\).

Exponential and geometric coupling

It is true that as \(p \to 0\), \(pX_i\) converges in distribution to an exponential random variable. This is a result that holds for a single random variable. For an approximation like \(M_n \approx \widetilde{M}_n\), we need a result that works for every \(X_i\) at once. This seems tricky since it involves the convergence of an infinite collection of random variables. Fortunately, there is a simple approach based on coupling.

For each geometric random variable \(X_i\) it is possible to produce an exponential random variables \(Y_i\) with

\(\displaystyle{\frac{1}{\lambda} Y_i \le X_i \le \frac{1}{\lambda} Y_i + 1},\)

where \(\lambda = -\log(1-p)\). This is because \(\mathbb{P}(X_i \le x) = 1-(1-p)^{[x]}\) and \(\mathbb{P}(Y_i/\lambda \le x) = 1-e^{-x/\lambda} = 1-(1-p)^{x}\). This means that

\( \displaystyle{\mathbb{P}(Y_i/\lambda \le x-1) \le \mathbb{P}(X_i \le x)\le \mathbb{P}(Y_i/\lambda \le x-1) }.\)

If we sample both \(X_i\) and \(Y_i\) with inverse transform sampling, and use the same uniform random variables, then we will have

\(\displaystyle{ \frac{1}{\lambda} Y_i \le X_i \le \frac{1}{\lambda} Y_i + 1},\)

as claimed. If we use this inequality for every single pair \((X_i,Y_i)\) we will get

\(\displaystyle{\max\{Y_i/ \lambda\} \le M_n \le \max\{Y_i/\lambda\} +1}\)

As \(p \to 0\), we have that \(\lambda \to 0\) and \(\frac{p}{\lambda}=\frac{p}{-\log(1-p)} \to 1\). This means that the random variable \(\max\{Y_i/\lambda\} \) has the same limit as \(\widetilde{M}_n = \max\{ Y_i/p : 1 \le i \le n\}\). Furthermore, \(\max\{Y_i/\lambda\} +1\) also has the same limit as \(\widetilde{M}_n\). Since \( M_n\) is “sandwiched” between \(\max\{Y_i/\lambda\}\) and \(\max\{Y_i/\lambda\} +1\), the result must also hold for \(M_n\). And so

\(\displaystyle{\lim_{n \to \infty}\mathbb{P}\left(M_n \le \frac{\log n}{p}+\frac{c}{p}\right) \to \exp(-e^{-c})}\)

“Uniformly random”

The term “uniformly random” sounds like a contradiction. How can the word “uniform” be used to describe anything that’s random? Uniformly random actually has a precise meaning, and, in a sense, means “as random as possible.” I’ll explain this with an example about shuffling card.

Shuffling cards

Suppose I have a deck of ten cards labeled 1 through 10. Initially, the cards are face down and in perfect order. The card labeled 10 is on top of the deck. The card labeled 9 is second from the top, and so on down to the card labeled 1. The cards are definitely not random.

Next, I generate a random number between 1 and 10. I then find the card with the corresponding label and put it face down on top of the deck. The cards are now somewhat random. The number on top could anything, but the rest of the cards are still in order. The cards are random but they are not uniformly random.

Now suppose that I keep generating random numbers and moving cards to the top of the deck. Each time I do this, the cards get more random. Eventually (after about 30 moves1) the cards will be really jumbled up. Even if you knew the first few cards, it would be hard to predict the order of the remaining ones. Once the cards are really shuffled, they are uniformly random.

Uniformly random

A deck of cards is uniformly random if each of the possible arrangements of the cards are equally likely. After only moving one card, the deck of cards is not uniformly random. This is because there are only 10 possible arrangements of the deck. Once the deck is well-shuffled, all of the 3,628,800 possible arrangements are equally likely.

In general, something is uniformly random if each possibility is equally likely. So the outcome of rolling a fair 6-sided die is uniformly random, but rolling a loaded die is not. The word “uniform” refers to the chance of each possibility (1/6 for each side of the die). These chances are all the same and “uniform”.

This is why uniformly random can mean “as random as possible.” If you are using a fair die or a well-shuffled deck, there are no biases in the outcome. Mathematically, this means you can’t predict the outcome.

Communicating research

The inspiration for this post came from a conversation I had earlier in the week. I was telling someone about my research. As an example, I talked about how long it takes for a deck of cards to become uniformly random. They quickly stopped me and asked how the two words could ever go together. It was a good point! I use the words uniformly random all the time and had never realized this contradiction.2 It was a good reminder about the challenge of clear communication.

Footnotes

  1. The number of moves it takes for the deck to well-shuffled is actually random. But on average it takes around 30 moves. For the mathematical details, see Example 1 in Shuffling Cards and Stopping Times by David Aldous and Persi Diaconis. ↩︎
  2. Of the six posts I published last year, five contain the word uniform! ↩︎

Understanding the Ratio of Uniforms Distribution

The ratio of uniforms distribution is a useful distribution for rejection sampling. It gives a simple and fast way to sample from discrete distributions like the hypergeometric distribution1. To use the ratio of uniforms distribution in rejection sampling, we need to know the distributions density. This post summarizes some properties of the ratio of uniforms distribution and computes its density.

The ratio of uniforms distribution is the distribution of the ratio of two independent uniform random variables. Specifically, suppose \(U \in [-1,1]\) and \(V \in [0,1]\) are independent and uniformly distributed. Then \(R = U/V\) has the ratio of uniforms distribution. The plot below shows a histogram based on 10,000 samples from the ratio of uniforms distribution2.

The histogram has a flat section in the middle and then curves down on either side. This distinctive shape is called a “table mountain”. The density of \(R\) also has a table mountain shape.

And here is the density plotted on top of the histogram.

A formula for the density of \(R\) is

\[h(R) = \begin{cases} \frac{1}{4} & \text{if } -1 \le R \le 1, \\\frac{1}{4R^2} & \text{if } R < -1 \text{ or } R > 1.\end{cases}\]

The first case in the definition of \(h\) corresponds to the flat part of the table mountain. The second case corresponds to the sloping curves. The rest of this post use geometry to derive the above formula for \(h(R)\).

Calculating the density

The point \((U,V)\) is uniformly distributed in the box \(B=[-1,1] \times [0,1]\). The image below shows an example of a point \((U,V)\) inside the box \(B\).

We can compute the ratio \(R = U/V\) geometrically. First we draw a straight line that starts at \((0,0)\) and goes through \((U,V)\). This line will hit the horizontal line \(y=1\). The \(x\) coordinate at this point is exactly \(R=U/V\).

In the above picture, all of the points on the dashed line map to the same value of \(R\). We can compute the density of \(R\) by computing an area. The probability that \(R\) is in a small interval \([R,R+dR]\) is \(\frac{\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\})}{\text{Area}(B)}\) which is equal to

\[\frac{1}{2}\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\}).\]

If we can compute the above area, then we will know the density of \(R\) because by definition

\(\displaystyle{h(R) = \lim_{dR \to 0} \frac{1}{2dR}\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\})}.\)

We will first work on the case when \(R\) is between \(-1\) and \(1\). In this case, the set \(\{(u,v) \in B : u/v \in [R, R+dR]\}\) is a triangle. This triangle is drawn in blue below.

The horizontal edge of this triangle has length \(dR\). The perpendicular height of the triangle from the horizontal edge is \(1\). This means that

\[\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\}) =\frac{dR}{2}.\]

And so, when \(R \in [-1,1]\) we have

\(\displaystyle{h(R) = \lim_{dR\to 0} \frac{1}{2dR}\times \frac{dR}{2}=\frac{1}{4}}.\)

Now let’s work on the case when \(R\) is bigger than \(1\) or less than \(-1\). In this case, the set \(\{(u,v) \in B : u/v \in [R, R+dR]\}\) is again triangle. But now the triangle has a vertical edge and is much skinnier. Below the triangle is drawn in red. Note that only points inside the box \(B\) are coloured in.

The vertical edge of the triangle has length \(\frac{1}{R} – \frac{1}{R+dR}= \frac{dR}{R(R+dR)}\). The perpendicular height of the triangle from the vertical edge is \(1\). Putting this together

\(\displaystyle{\text{Area}(\{(u,v) \in B : u/v \in [R, R+dR]\}) =\frac{dR}{2R(R+dR)}}.\)

And so

\(\displaystyle{h(R) = \lim_{dR \to 0} \frac{1}{2dR} \times \frac{dR}{2 R(R+dR)} = \frac{1}{4R^2}}.\)

And so putting everything together

\(\displaystyle{h(R) = \begin{cases} \frac{1}{4} & \text{if } -1 \le R \le 1, \\\frac{1}{4R^2} & \text{if } R < -1 \text{ or } R > 1.\end{cases}}\)

Footnotes and references

  1. https://ieeexplore.ieee.org/document/718718 ↩︎
  2. For visual purposes, I restricted the sample to values of \(R\) between \(-8\) and \(8\). This is because the ratio of uniform distribution has heavy tails. This meant that there were some very large values of \(R\) that made the plot hard to see. ↩︎