Probability – Page 2 – mathstoshare

The discrete arcsine distribution

The discrete arcsine distribution is a probability distribution on \(\{0,1,\ldots,n\}\). It is a u-shaped distribution. There are peaks at \(0\) and \(n\) and a dip in the middle. The figure below shows the probability distribution function for \(n=10,15, 20\).

The probability distribution function of the arcsine distribution is given by

\(\displaystyle{p_n(k) = \frac{1}{2^{2n}}\binom{2k}{k}\binom{2n-2k)}{n-k}\text{ for } k \in \{0,1,\ldots,n\}}\)

The discrete arcsine distribution is related to simple random walks and to an interesting Markov chain called the Burnside process. The connection with simple random walks is explained in Chapter 3, Volume 1 of An Introduction to Probability and its applications by William Feller. The connection to the Burnside process was discovered by Persi Diaconis in Analysis of a Bose-Einstein Markov Chain.

The discrete arcsine distribution gets its name from the continuous arcsine distribution. Suppose \(X_n\) is distributed according to the discrete arcsine distribution with parameter \(n\). Then the normalized random variables \(X_n/n\) converges in distribution to the continuous arcsine distribution on \([0,1]\). The continuous arcsine distribution has the probability density function

\(\displaystyle{f(x) = \frac{1}{\pi\sqrt{x(1-x)}} \text{ for } 0 \le x \le 1}\)

This means that continuous arcsine distribution is a beta distribution with \(\alpha=\beta=1/2\). It is called the arcsine distribution because the cumulative distribution function involves the arcsine function

\(\displaystyle{F(x) = \int_0^x f(y)dy = \frac{2}{\pi}\arcsin(\sqrt{x}) \text{ for } 0 \le x \le 1}\)

There is another connection between the discrete and continuous arcsine distributions. The continuous arcsine distribution can be used to sample the discrete arcsine distribution. The two step procedure below produces a sample from the discrete arcsine distribution with parameter \(n\):

Sample \(p\) from the continuous arcsine distribution.
Sample \(X\) from the binomial distribution with parameters \(n\) and \(p\).

This means that the discrete arcsine distribution is actually the beta-binomial distribution with parameters \(\alpha = \beta =1/2\). I was surprised when I was told this, and couldn’t find a reference. The rest of this blog post proves that the discrete arcsine distribution is an instance of the beta-binomial distribution.

As I showed in this post, the beta-binomial distribution has probability distribution function:

\(\displaystyle{q_{\alpha,\beta,n}(k) = \binom{n}{k}\frac{B(k+\alpha, n-k+\alpha)}{B(a,b)}},\)

where \(B(x,y)=\frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)}\) is the Beta-function. To show that the discrete arc sine distribution is an instance of the beta-binomial distribution we need that \(p_n(k)=q_{1/2,1/2,n}(k)\). That is

\(\displaystyle{ \binom{n}{k}\frac{B(k+1/2, n-k+1/2)}{B(1/2,1/2)} = \frac{1}{2^{2n}}\binom{2k}{k}\binom{2n-2k}{n-k}}\),

for all \(k = 0,1,\ldots,n\). To prove the above equation, we can first do some simplifying to \(q_{1/2,1/2,n}(k)\). By definition

\(\displaystyle{\frac{B(k+1/2, n-k+1/2)}{B(1/2,1/2)} = \frac{\frac{\Gamma(k+1/2)\Gamma(n-k+1/2)}{\Gamma(n+1)}}{\frac{\Gamma(1/2)\Gamma(1/2)}{\Gamma(1)}}}\),

Since \(\Gamma(m)=(m-1)!\) if \(m\) is a natural number this complicated fraction simplifies to

\[\frac{1}{n!}\frac{\Gamma(k+1/2)}{\Gamma(1/2)}\frac{\Gamma(n-k+1/2)}{\Gamma(1/2)}\]

The Gamma function \(\Gamma(x)\) also satisfies the property \(\Gamma(x+1)=x\Gamma(x)\). Using this repeatedly gives

\(\displaystyle{\Gamma(k+1/2) = (k-1/2) (k-3/2) \times \cdots \times \frac{3}{2}\times\frac{1}{2}\times\Gamma(1/2) }. \)

This means that

\(\displaystyle{\frac{\Gamma(k+1/2)}{\Gamma(1/2)} = (k-1/2) (k-3/2)\times \cdots \times \frac{3}{2}\times\frac{1}{2} =\frac{(2k-1)!!}{2^k}},\)

where \((2k-1)!!=(2k-1)\times (2k-3)\times\cdots \times 3 \times 1\) is the double factorial. The same reasoning gives

\(\displaystyle{\frac{\Gamma(n-k+1/2)}{\Gamma(1/2)} =\frac{(2n-2k-1)!!}{2^{n-k}}}.\)

And so

\(\displaystyle{q_{1/2,1/2,n}(k) =\frac{1}{2^nk!(n-k)!}(2k-1)!!(2n-2k-1)!!}.\)

We’ll now show that \(p_n(k)\) is also equal to the above final expression. Recall

\(\displaystyle{p_n(k) = \frac{1}{2^{2n}} \binom{2k}{k}\binom{2(n-k)}{n-k} = \frac{1}{2^{2n}}\frac{(2k)!(2(n-k))!}{k!k!(n-k)!(n-k)!} }.\)

By shuffling some of the terms

\[p_n(k)=\frac{1}{2^nk!(n-k)!}\frac{(2k)!}{k!2^k}\frac{(2n-2k)!}{(n-k)!2^{n-k}}\]

And so it suffices to show \(\frac{(2k)!}{k!2^k} = (2k-1)!!\) (and hence \(\frac{(2n-2k)!}{(n-k)!2^{n-k}}=(2n-2k-1)!!\)). To see why this last claim holds, note that

\(\displaystyle{\frac{(2k)!}{k!2^k} = \frac{(2k)(2k-1)(2k-2)\times\cdots\times 3 \times 2 \times 1}{(2k)(2k-2)\times \cdots \times 2} = (2k-1)!!}\)

Showing that \(p_{n}(k)=q_{n,1/2,1/2}(k)\) as claimed.

The sample size required for importance sampling

My last post was about using importance sampling to estimate the volume of high-dimensional ball. The two figures below compare plain Monte Carlo to using importance sampling with a Gaussian proposal. Both plots use \(M=1,000\) samples to estimate \(v_n\), the volume of an \(n\)-dimensional ball

A friend of mine pointed out that the relative error does not seem to increase with the dimension \(n\). He thought it was too good to be true. It turns out he was right and the relative error does increase with dimension but it increases very slowly. To estimate \(v_n\) the number of samples needs to grow on the order of \(\sqrt{n}\).

To prove this, we will use the paper The sample size required for importance sampling by Chatterjee and Diaconis [1]. This paper shows that the sample size for importance sampling is determined by the Kullback-Liebler divergence. The relevant result from their paper is Theorem 1.3. This theorem is about the relative error in using importance sampling to estimate a probability.

In our setting the proposal distribution is \(Q=\mathcal{N}(0,\frac{1}{n}I_n)\). That is the distribution \(Q\) is an \(n\)-dimensional Gaussian vector with mean \(0\) and covariance \(\frac{1}{n}I_n\). The conditional target distribution is \(P\) the uniform distribution on the \(n\) dimensional ball. Theorem 1.3 in [1] tells us how many samples are needed to estimate \(v_n\). Informally, the required sample size is \(M = O(\exp(D(P \Vert Q)))\). Here \(D(P\Vert Q)\) is the Kullback-Liebler divergence between \(P\) and \(Q\).

To use this theorem we need to compute \(D(P \Vert Q)\). Kullback-Liebler divergence is defined as integral. Specifically

\(\displaystyle{D(P\Vert Q) = \int_{\mathbb{R}^n} \log\frac{P(x)}{Q(x)}P(x)dx}\)

Computing the high-dimensional integral above looks challenging. Fortunately, it can reduced to a one-dimensional integral. This is because both the distributions \(P\) and \(Q\) are rotationally symmetric. To use this, define \(P_r,Q_r\) to be the distribution of the norm squared under \(P\) and \(Q\). That is if \(X \sim P\), then \(\Vert X \Vert_2^2 \sim P_R\) and likewise for \(Q_R\). By the rotational symmetry of \(P\) and \(Q\) we have

\(D(P\Vert Q) = D(P_R \Vert Q_R).\)

We can work out both \(P_R\) and \(Q_R\). The distribution \(P\) is the uniform distribution on the \(n\)-dimensional ball. And so for \(X \sim P\) and any \(r \in [0,1]\)

\(\mathbb{P}(\Vert X \Vert_2^2 \le r) = \frac{v_n r^n}{v_n} = r^n.\)

Which implies that \(P_R\) has density \(P_R(r)=nr^{n-1}\). This means that \(P_R\) is a Beta distribution with parameters \(\alpha = n, \beta = 1\). The distribution \(Q\) is a multivariate Gaussian distribution with mean \(0\) and variance \(\frac{1}{n}I_n\). This means that if \(X \sim Q\), then \(\Vert X \Vert_2^2 = \sum_{i=1}^n X_i^2\) is a scaled chi-squared variable. The shape parameter of \(Q_R\) is \(n\) and scale parameter is \(1/n\). The density for \(Q_R\) is therefor

\(Q_R(r) = \frac{n^{n/2}}{2^{n/2}\Gamma(n/2)}r^{n/2-1}e^{-nx/2}\)

The Kullback-Leibler divergence between \(P\) and \(Q\) is therefor

\(\displaystyle{D(P\Vert Q)=D(P_R\Vert Q_R) = \int_0^1 \log \frac{P_R(r)}{Q_R(r)} P_R(r)dr}\)

Getting Mathematica to do the above integral gives

\(D(P \Vert Q) = -\frac{1+2n}{2+2n} + \frac{n}{2}\log(2 e) – (1-\frac{n}{2})\log n + \log \Gamma(\frac{n}{2}).\)

Using the approximation \(\log \Gamma(z) \approx (z-\frac{1}{2})\log(z)-z+O(1)\) we get that for large \(n\)

\(D(P \Vert Q) = \frac{1}{2}\log n + O(1)\).

And so the required number of samples is \(O(\exp(D(P \Vert Q)) = O(\sqrt{n}).\)

[1] Chatterjee, Sourav, and Persi Diaconis. “The sample size required in importance sampling.” The Annals of Applied Probability 28, no. 2 (2018): 1099–1135. https://www.jstor.org/stable/26542331. (Public preprint here https://arxiv.org/abs/1511.01437)

Monte Carlo integration in high dimensions

Monte Carlo integration

John Cook recently wrote a cautionary blog post about using Monte Carlo to estimate the volume of a high-dimensional ball. He points out that if \(\mathbf{X}=(X_1,\ldots,X_n)\) are independent and uniformly distributed on the interval \([-1,1]\) then

\(\displaystyle{\mathbb{P}(X_1^2 + X_2^2 + \cdots + X_n^2 \le 1) = \frac{v_n}{2^n}},\)

where \(v_n\) is the volume of an \(n\)-dimensional ball with radius one. This observation means that we can use Monte Carlo to estimate \(v_n\).

To do this we repeatedly sample vectors \(\mathbf{X}_m = (X_{m,1},\ldots,X_{m,n})\) with \(X_{m,i}\) uniform on \([-1,1]\) and \(m\) ranging from \(1\) to \(M\). Next, we count the proportion of vectors \(\mathbf{X}_m\) with \(X_{1,m}^2 + \cdots + X_{n,m}^2 \le 1\). Specifically, if \(Z_m\) is equal to one when \(X_{1,m}^2 + \cdots + X_{n,m}^2 \le 1\) and equal to zero otherwise, then by the law of large numbers

\(\displaystyle{\frac{1}{M}\sum_{m=1}^M Z_m \approx \frac{v_n}{2^n}}\)

Which implies

\(v_n \approx 2^n \frac{1}{M}\sum_{m=1}^M Z_m\)

This method of approximating a volume or integral by sampling and counting is called Monte Carlo integration and is a powerful general tool in scientific computing.

The problem with Monte Carlo integration

As pointed out by John, Monte Carlo integration does not work very well in this example. The plot below shows a large difference between the true value of \(v_n\) with \(n\) ranging from \(1\) to \(20\) and the Monte Carlo approximation with \(M = 1,000\).

The problem is that \(v_n\) is very small and the probability \(v_n/2^n\) is even smaller! For example when \(n = 10\), \(v_n/2^n \approx 0.0025\). This means that in our one thousand samples we only expect two or three occurrences of the event \(X_{m,1}^2 + \cdots + X_{m,n}^2 \le 1\). As a result our estimate has a high variance.

The results get even worse as \(n\) increases. The probability \(v_n/2^n\) does to zero faster than exponentially. Even with a large value of \(M\), our estimate will be zero. Since \(v_n \neq 0\), the relative error in the approximation is 100%.

Importance sampling

Monte Carlo can still be used to approximate \(v_n\). Instead of using plain Monte Carlo, we can use a variance reduction technique called importance sampling (IS). Instead of sampling the points \(X_{m,1},\ldots, X_{m,n}\) from the uniform distribution on \([-1,1]\), we can instead sample the from some other distribution called a proposal distribution. The proposal distribution should be chosen so that that the event \(X_{m,1}^2 +\cdots +X_{m,n}^2 \le 1\) becomes more likely.

In importance sampling, we need to correct for the fact that we are using a new distribution instead of the uniform distribution. Instead of the counting the number of times \(X_{m,1}^2+\cdots+X_{m,n}^2 \le 1\) , we give weights to each of samples and then add up the weights.

If \(p\) is the density of the uniform distribution on \([-1,1]\) (the target distribution) and \(q\) is the density of the proposal distribution, then the IS Monte Carlo estimate of \(v_n\) is

\(\displaystyle{\frac{1}{M}\sum_{m=1}^M Z_m \prod_{i=1}^n \frac{p(X_{m,i})}{q(X_{m,i})}},\)

where as before \(Z_m\) is one if \(X_{m,1}^2+\cdots +X_{m,n}^2 \le 1\) and \(Z_m\) is zero otherwise. As long as \(q(x)=0\) implies \(p(x)=0\), the IS Monte Carlo estimate will be an unbiased estimate of \(v_n\). More importantly, a good choice of the proposal distribution \(q\) can drastically reduce the variance of the IS estimate compared to the plain Monte Carlo estimate.

In this example, a good choice of proposal distribution is the normal distribution with mean \(0\) and variance \(1/n\). Under this proposal distribution, the expected value of \(X_{m,1}^2 +\cdots+ X_{m,n}^2\) is one and so the event \(X_{m,1}^2 + \cdots + X_{m,n}^2 \le 1\) is much more likely.

Here are the IS Monte Carlo estimates with again \(M = 1,000\) and \(n\) ranging from \(1\) to \(20\). The results speak for themselves.

The relative error is typically less than 10%. A big improvement over the 100% relative error of plain Monte Carlo.

The next plot shows a close agreement between \(v_n\) and the IS Monte Carlo approximation on the log scale with \(n\) going all the way up to \(100\).

Notes

There are exact formulas for \(v_n\) (available on Wikipedia). I used these to compare the approximations and compute relative errors. There are related problems where no formulas exist and Monte Carlo integration is one of the only ways to get an approximate answer.
The post by John Cook also talks about why the central limit theorem can’t be used to approximate \(v_n\). I initially thought a technique called large deviations could be used to approximate \(v_n\) but again this does not directly apply. I was happy to discover that importance sampling worked so well!

Uniformly sampling orthogonal matrices

An \(n \times n\) matrix \(M \in \mathbb{R}^{n \times n}\) is orthogonal if \(M^\top M = I\). The set of all \(n \times n\) orthogonal matrices is a compact group written as \(O_n\). The uniform distribution on \(O_n\) is called Haar measure. There are many ways to generate random matrices for Haar measure. One of which is based on the Gram-Schmidt algorithm.

Proposition. Let \(X\) be a \(n \times n\) matrix such that each entry \(X_{ij}\) is an independent standard normal random variable. Let \(x_1,x_2,\ldots,x_n \in \mathbb{R}^n\) be the columns of \(X\) and let \(q_1,q_2,\ldots,q_n \in \mathbb{R}^n\) be the result of applying Gram-Schmidt to \(x_1,x_2,\ldots,x_n\). Then the matrix \(M=[q_1,q_2,\ldots,q_n] \in O_n\) is distributed according to Haar measure.

Another way of stating the above proposition is that if \(X\) has i.i.d. standard normal entries, and \(Q,R\) is a QR-factorization of \(X\) calculated using Gram-Schmidt, then \(Q \in O_n\) is distributed according to Haar measure. However, most numerical libraries do not use Gram-Schmidt to calculate the QR-factorization of a matrix. This means that if you generate a random \(X\) and then use your computer’s QR-factorization function, then the result might not be Haar distributed as the plot below shows.

The plot shows an estimate of the distribution of \(\sqrt{n}M_{11}\), the top-left entry of a matrix \(M\). When \(M\) is correctly sampled, the distribution is symmetric about 0. When \(M\) is incorrectly sampled the distribution is clearly biased towards negative numbers.

Fortunately there is a fix described in [1]! We can first use any QR-factorization algorithm to produce \(Q,R\) with \(X=QR\). We then compute a diagonal matrix \(L\) with \(L_{ii} = \mathrm{Sign}(R_{ii})\). Then, the matrix \(M=QL\) is Haar distributed. The following python code thus samples a Haar distributed matrix in \(O_n\).

import numpy as np

def sample_M(n):
    M = np.random.randn(n, n)
    Q, R = np.linalg.qr(M)
    L = np.sign(np.diag(R))
    return Q*L[None,:]

References

[1] Mezzadri, Francesco. “How to generate random matrices from the classical compact groups.” arXiv preprint math-ph/0609050 (2006). https://arxiv.org/abs/math-ph/0609050

I recently gave a talk at the Stanford student probability seminar on Haar distributed random matrices. My notes and many further references are available here.

Auxiliary variables sampler

The auxiliary variables sampler is a general Markov chain Monte Carlo (MCMC) technique for sampling from probability distributions with unknown normalizing constants [1, Section 3.1]. Specifically, suppose we have \(n\) functions \(f_i : \mathcal{X} \to (0,\infty)\) and we want to sample from the probability distribution \(P(x) \propto \prod_{i=1}^n f_i(x).\) That is

\(\displaystyle{ P(x) = \frac{1}{Z}\prod_{i=1}^n f_i(x)},\)

where \(Z = \sum_{x \in \mathcal{X}} \prod_{i=1}^n f_i(x)\) is a normalizing constant. If the set \(\mathcal{X}\) is very large, then it may be difficult to compute \(Z\) or sample from \(P(x)\). To approximately sample from \(P(x)\) we can run an ergodic Markov chain with \(P(x)\) as a stationary distribution. Adding auxiliary variables is one way to create such a Markov chain. For each \(i \in \{1,\ldots,n\}\), we add a auxiliary variable \(U_i \ge 0\) such that

\(\displaystyle{P(U \mid X) = \prod_{i=1}^n \mathrm{Unif}[0,f_i(X)]}.\)

That is, conditional on \(X\), the auxiliary variables \(U_1,\ldots,U_n\) are independent and \(U_i\) is uniformly distributed on the interval \([0,f_i(X)]\). If \(X\) is distributed according to \(P\) and \(U_1,\ldots,U_n\) have the above auxiliary variable distribution, then

\(\displaystyle{P(X,U) =P(X)P(U\mid X)\propto \prod_{i=1}^n f_i(X) \frac{1}{f_i(X)} I[U_i \le f(X_i)] = \prod_{i=1}^n I[U_i \le f(X_i)]}.\)

This means that the joint distribution of \((X,U)\) is uniform on the set

\(\widetilde{\mathcal{X}} = \{(x,u) \in \mathcal{X} \times (0,\infty)^n : u_i \le f(x_i) \text{ for all } i\}.\)

Put another way, suppose we could jointly sample \((X,U)\) from the uniform distribution on \(\widetilde{\mathcal{X}}\). Then, the above calculation shows that if we discard \(U\) and only keep \(X\), then \(X\) will be sampled from our target distribution \(P\).

The auxiliary variables sampler approximately samples from the uniform distribution on \(\widetilde{\mathcal{X}}\) is by using a Gibbs sampler. The Gibbs samplers alternates between sampling \(U’\) from \(P(U \mid X)\) and then sampling \(X’\) from \(P(X \mid U’)\). Since the joint distribution of \((X,U)\) is uniform on \(\widetilde{\mathcal{X}}\), the conditional distributions \(P(X \mid U)\) and \(P(U \mid X)\) are also uniform. The auxiliary variables sampler thus transitions from \(X\) to \(X’\) according to the two step process

Independently sample \(U_i\) uniformly from \([0,f_i(X)]\).
Sample \(X’\) uniformly from the set \(\{x \in \mathcal{X} : f_i(x) \ge u_i \text{ for all } i\}\).

Since the auxiliary variables sampler is based on a Gibbs sampler, we know that the joint distribution of \((X,U)\) will converge to the uniform distribution on \(\mathcal{X}\). So when we discard \(U\), the distribution of \(X\) will converge to the target distribution \(P\)!

Auxiliary variables in practice

To perform step 2 of the auxiliary variables sampler we have to be able to sample uniformly from the sets

\(\mathcal{X}_u = \{x \in \mathcal{X}:f_i(x) \ge u_i \text{ for all } i\}. \)

Depending on the nature of the set \(\mathcal{X}\) and the functions \(f_i\), it might be difficult to do this. Fortunately, there are some notable examples where this step has been worked out. The very first example of auxiliary variables is the Swendsen-Wang algorithm for sampling from the Ising model [2]. In this model it is possible to sample uniformly from \(\mathcal{X}_u\). Another setting where we can sample exactly is when \(\mathcal{X}\) is the real numbers \(\mathcal{R}\) and each \(f_i\) is an increasing function of \(x\). This is explored in [3] where they apply auxiliary variables to sampling from Bayesian posteriors.

There is an alternative to sampling exactly from the uniform distribution on \(\mathcal{X}_u\). Instead of sampling \(X’\) uniformly from \(\mathcal{X}_u\), we can run a Markov chain from the old \(X\) that has the uniform distribution as a stationary distribution. This approach leads to another special case of auxiliary variables which is called “slice sampling” [4].

References

[1] Andersen HC, Diaconis P. Hit and run as a unifying device. Journal de la societe francaise de statistique & revue de statistique appliquee. 2007;148(4):5-28. http://www.numdam.org/item/JSFS_2007__148_4_5_0/
[2] Swendsen RH, Wang JS. Nonuniversal critical dynamics in Monte Carlo simulations. Physical review letters. 1987 Jan 12;58(2):86. https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.58.86
[3] Damlen P, Wakefield J, Walker S. Gibbs sampling for Bayesian non‐conjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1999 Apr;61(2):331-44. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00179
[4] Neal RM. Slice sampling. The annals of statistics. 2003 Jun;31(3):705-67. https://projecteuclid.org/journals/annals-of-statistics/volume-31/issue-3/Slice-sampling/10.1214/aos/1056562461.full

Category: Probability

The discrete arcsine distribution

The sample size required for importance sampling

Monte Carlo integration in high dimensions

Monte Carlo integration

The problem with Monte Carlo integration

Importance sampling

Notes

More sampling posts

Uniformly sampling orthogonal matrices

References

Auxiliary variables sampler

Auxiliary variables in practice

References