Tag: math

  • Monte Carlo integration in high dimensions

    Monte Carlo integration

    John Cook recently wrote a cautionary blog post about using Monte Carlo to estimate the volume of a high-dimensional ball. He points out that if \(\mathbf{X}=(X_1,\ldots,X_n)\) are independent and uniformly distributed on the interval \([-1,1]\) then

    \(\displaystyle{\mathbb{P}(X_1^2 + X_2^2 + \cdots + X_n^2 \le 1) = \frac{v_n}{2^n}},\)

    where \(v_n\) is the volume of an \(n\)-dimensional ball with radius one. This observation means that we can use Monte Carlo to estimate \(v_n\).

    To do this we repeatedly sample vectors \(\mathbf{X}_m = (X_{m,1},\ldots,X_{m,n})\) with \(X_{m,i}\) uniform on \([-1,1]\) and \(m\) ranging from \(1\) to \(M\). Next, we count the proportion of vectors \(\mathbf{X}_m\) with \(X_{1,m}^2 + \cdots + X_{n,m}^2 \le 1\). Specifically, if \(Z_m\) is equal to one when \(X_{1,m}^2 + \cdots + X_{n,m}^2 \le 1\) and equal to zero otherwise, then by the law of large numbers

    \(\displaystyle{\frac{1}{M}\sum_{m=1}^M Z_m \approx \frac{v_n}{2^n}}\)

    Which implies

    \(v_n \approx 2^n \frac{1}{M}\sum_{m=1}^M Z_m\)

    This method of approximating a volume or integral by sampling and counting is called Monte Carlo integration and is a powerful general tool in scientific computing.

    The problem with Monte Carlo integration

    As pointed out by John, Monte Carlo integration does not work very well in this example. The plot below shows a large difference between the true value of \(v_n\) with \(n\) ranging from \(1\) to \(20\) and the Monte Carlo approximation with \(M = 1,000\).

    The problem is that \(v_n\) is very small and the probability \(v_n/2^n\) is even smaller! For example when \(n = 10\), \(v_n/2^n \approx 0.0025\). This means that in our one thousand samples we only expect two or three occurrences of the event \(X_{m,1}^2 + \cdots + X_{m,n}^2 \le 1\). As a result our estimate has a high variance.

    The results get even worse as \(n\) increases. The probability \(v_n/2^n\) does to zero faster than exponentially. Even with a large value of \(M\), our estimate will be zero. Since \(v_n \neq 0\), the relative error in the approximation is 100%.

    Importance sampling

    Monte Carlo can still be used to approximate \(v_n\). Instead of using plain Monte Carlo, we can use a variance reduction technique called importance sampling (IS). Instead of sampling the points \(X_{m,1},\ldots, X_{m,n}\) from the uniform distribution on \([-1,1]\), we can instead sample the from some other distribution called a proposal distribution. The proposal distribution should be chosen so that that the event \(X_{m,1}^2 +\cdots +X_{m,n}^2 \le 1\) becomes more likely.

    In importance sampling, we need to correct for the fact that we are using a new distribution instead of the uniform distribution. Instead of the counting the number of times \(X_{m,1}^2+\cdots+X_{m,n}^2 \le 1\) , we give weights to each of samples and then add up the weights.

    If \(p\) is the density of the uniform distribution on \([-1,1]\) (the target distribution) and \(q\) is the density of the proposal distribution, then the IS Monte Carlo estimate of \(v_n\) is

    \(\displaystyle{\frac{1}{M}\sum_{m=1}^M Z_m \prod_{i=1}^n \frac{p(X_{m,i})}{q(X_{m,i})}},\)

    where as before \(Z_m\) is one if \(X_{m,1}^2+\cdots +X_{m,n}^2 \le 1\) and \(Z_m\) is zero otherwise. As long as \(q(x)=0\) implies \(p(x)=0\), the IS Monte Carlo estimate will be an unbiased estimate of \(v_n\). More importantly, a good choice of the proposal distribution \(q\) can drastically reduce the variance of the IS estimate compared to the plain Monte Carlo estimate.

    In this example, a good choice of proposal distribution is the normal distribution with mean \(0\) and variance \(1/n\). Under this proposal distribution, the expected value of \(X_{m,1}^2 +\cdots+ X_{m,n}^2\) is one and so the event \(X_{m,1}^2 + \cdots + X_{m,n}^2 \le 1\) is much more likely.

    Here are the IS Monte Carlo estimates with again \(M = 1,000\) and \(n\) ranging from \(1\) to \(20\). The results speak for themselves.

    The relative error is typically less than 10%. A big improvement over the 100% relative error of plain Monte Carlo.

    The next plot shows a close agreement between \(v_n\) and the IS Monte Carlo approximation on the log scale with \(n\) going all the way up to \(100\).

    Notes

    1. There are exact formulas for \(v_n\) (available on Wikipedia). I used these to compare the approximations and compute relative errors. There are related problems where no formulas exist and Monte Carlo integration is one of the only ways to get an approximate answer.
    2. The post by John Cook also talks about why the central limit theorem can’t be used to approximate \(v_n\). I initially thought a technique called large deviations could be used to approximate \(v_n\) but again this does not directly apply. I was happy to discover that importance sampling worked so well!

    More sampling posts

  • Uniformly sampling orthogonal matrices

    An \(n \times n\) matrix \(M \in \mathbb{R}^{n \times n}\) is orthogonal if \(M^\top M = I\). The set of all \(n \times n\) orthogonal matrices is a compact group written as \(O_n\). The uniform distribution on \(O_n\) is called Haar measure. There are many ways to generate random matrices for Haar measure. One of which is based on the Gram-Schmidt algorithm.

    Proposition. Let \(X\) be a \(n \times n\) matrix such that each entry \(X_{ij}\) is an independent standard normal random variable. Let \(x_1,x_2,\ldots,x_n \in \mathbb{R}^n\) be the columns of \(X\) and let \(q_1,q_2,\ldots,q_n \in \mathbb{R}^n\) be the result of applying Gram-Schmidt to \(x_1,x_2,\ldots,x_n\). Then the matrix \(M=[q_1,q_2,\ldots,q_n] \in O_n\) is distributed according to Haar measure.

    Another way of stating the above proposition is that if \(X\) has i.i.d. standard normal entries, and \(Q,R\) is a QR-factorization of \(X\) calculated using Gram-Schmidt, then \(Q \in O_n\) is distributed according to Haar measure. However, most numerical libraries do not use Gram-Schmidt to calculate the QR-factorization of a matrix. This means that if you generate a random \(X\) and then use your computer’s QR-factorization function, then the result might not be Haar distributed as the plot below shows.

    The plot shows an estimate of the distribution of \(\sqrt{n}M_{11}\), the top-left entry of a matrix \(M\). When \(M\) is correctly sampled, the distribution is symmetric about 0. When \(M\) is incorrectly sampled the distribution is clearly biased towards negative numbers.

    Fortunately there is a fix described in [1]! We can first use any QR-factorization algorithm to produce \(Q,R\) with \(X=QR\). We then compute a diagonal matrix \(L\) with \(L_{ii} = \mathrm{Sign}(R_{ii})\). Then, the matrix \(M=QL\) is Haar distributed. The following python code thus samples a Haar distributed matrix in \(O_n\).

    import numpy as np

    def sample_M(n):
    M = np.random.randn(n, n)
    Q, R = np.linalg.qr(M)
    L = np.sign(np.diag(R))
    return Q*L[None,:]

    References

    [1] Mezzadri, Francesco. “How to generate random matrices from the classical compact groups.” arXiv preprint math-ph/0609050 (2006). https://arxiv.org/abs/math-ph/0609050

    I recently gave a talk at the Stanford student probability seminar on Haar distributed random matrices. My notes and many further references are available here.

  • Why is the fundamental theorem of arithmetic a theorem?

    The fundamental theorem of arithmetic states that every natural number can be factorized uniquely as a product of prime numbers. The word “uniquely” here means unique up to rearranging. The theorem means that if you and I take the same number \(n\) and I write \(n = p_1p_2\ldots p_k\) and you write \(n = q_1q_2\ldots q_l\) where each \(p_i\) and \(q_i\) is a prime number, then in fact \(k=l\) and we wrote the same prime numbers (but maybe in a different order).

    Most people happily accept this theorem as self evident and believe it without proof. Indeed some people take it to be so self evident they feel it doesn’t really deserve the name “theorem” – hence the title of this blog post. In this post I want to highlight two situations where an analogous theorem fails.

    Situation One: The Even Numbers

    Imagine a world where everything comes in twos. In this world nobody knows of the number one or indeed any odd number. Their counting numbers are the even numbers \(\mathbb{E} = \{2,4,6,8,\ldots\}\). People in this world can add numbers and multiply numbers just like we can. They can even talk about divisibility, for example \(2\) divides \(8\) since \(8 = 4\cdot 2\). Note that things are already getting a bit strange in this world. Since there is no number one, numbers in this world do not divide themselves.

    Once people can talk about divisibility, they can talk about prime numbers. A number is prime in this world if it is not divisible by any other number. For example \(2\) is prime but as we saw \(8\) is not prime. Surprisingly the number \(6\) is also prime in this world. This is because there are no two even numbers that multiply together to make \(6\).

    If a number is not prime in this world, we can reduce it to a product of primes. This is because if \(n\) is not prime, then there are two number \(a \) and \(b\) such that \(n = ab\). Since \(a\) and \(b\) are both smaller than \(n\), we can apply the same argument and recursively write \(n\) as a product of primes.

    Now we can ask whether or not the fundamental theorem of arthimetic holds in this world. Namely we want to know if their is a unique way to factorize each number in this world. To get an idea we can start with some small even numbers.

    • \(2\) is prime.
    • \(4 = 2 \cdot 2\) can be factorized uniquely.
    • \(6\) is prime.
    • \(8 = 2\cdot 2 \cdot 2\) can be factorized uniquely.
    • \(10\) is prime.
    • \(12 = 2 \cdot 6\) can be factorized uniquely.
    • \(14\) is prime.
    • \(16 = 2\cdot 2 \cdot 2 \cdot 2\) can be factorized uniquely.
    • \(18\) is prime.
    • \(20 = 2 \cdot 10\) can be factorized uniquely.

    Thus it seems as though there might be some hope for this theorem. It at least holds for the first handful of numbers. Unfortunately we eventually get to \(36\) and we have:

    \(36 = 2 \cdot 18\) and \(36 = 6 \cdot 6\).

    Thus there are two distinct ways of writing \(36\) as a product of primes in this world and thus the fundamental theorem of arithmetic does not hold.

    Situtation Two: A Number Ring

    While the first example is fun and interesting, it is somewhat artificial. We are unlikely to encounter a situation where we only have the even numbers. It is however common and natural for mathematicians to be lead into certain worlds called number rings. We will see one example here and see what an effect the fundamental theorem of arithmetic can have.

    Consider wanting to solve the equation \(x^2+19=y^3\) where \(x\) and \(y\) are both integers. One way to try to solve this is by rewriting the equation as \((x+\sqrt{-19})(x-\sqrt{-19}) = y^3\). With this rewriting we have left the familiar world of the whole numbers and entered the number ring \(\mathbb{Z}[\sqrt{-19}]\).

    In \(\mathbb{Z}[\sqrt{-19}]\) all numbers have the form \(a + b \sqrt{-19}\), where \(a\) and \(b\) are integers. Addition of two such numbers is defined like so

    \((a+b\sqrt{-19}) + (c + d \sqrt{-19}) = (a+c) + (b+d)\sqrt{-19}\).

    Multiplication is define by using the distributive law and the fact that \(\sqrt{-19}^2 = -19\). Thus

    \((a+b\sqrt{-19})(c+d\sqrt{-19}) = (ac-19bd) + (ad+bc)\sqrt{-19}\).

    Since we have multiplication we can talk about when a number in \(\mathbb{Z}[\sqrt{-19}]\) divides another and hence define primes in \(\mathbb{Z}[\sqrt{-19}]\). One can show that if \(x^2 + 19 = y^3\), then \(x+\sqrt{-19}\) and \(x-\sqrt{-19}\) are coprime in \(\mathbb{Z}[\sqrt{-19}]\) (see the references at the end of this post).

    This means that there are no primes in \(\mathbb{Z}[\sqrt{-19}]\) that divides both \(x+\sqrt{-19}\) and \(x-\sqrt{-19}\). If we assume that the fundamental theorem of arthimetic holds in \(\mathbb{Z}[\sqrt{-19}]\), then this implies that \(x+\sqrt{-19}\) must itself be a cube. This is because \((x+\sqrt{-19})(x-\sqrt{-19})=y^3\) is a cube and if two coprime numbers multiply to be a cube, then both of those coprime numbers must be cubes.

    Thus we can conclude that there are integers \(a\) and \(b\) such that \(x+\sqrt{-19} = (a+b\sqrt{-19})^3 \). If we expand out this cube we can conclude that

    \(x+\sqrt{-19} = (a^3-57ab^2)+(3a^2b-19b^3)\sqrt{-19}\).

    Thus in particular we have \(1=3a^2b-19b^3=(3a^2-19b^2)b\). This implies that \(b = \pm 1\) and \(3a^2-19b^2=\pm 1\). Hence \(b^2=1\) and \(3a^2-19 = \pm 1\). Now if \(3a^2 -19 =-1\), then \(a^2=6\) – a contradiction. Similarly if \(3a^2-19=1\), then \(3a^2=20\) – another contradiction. Thus we can conclude there are no integer solutions to the equation \(x^2+19=y^3\)!

    Unfortunately however, a bit of searching reveals that \(18^2+19=343=7^3\). Thus simply assuming that that the ring \(\mathbb{Z}[\sqrt{-19}]\) has unique factorization led us to incorrectly conclude that an equation had no solutions. The question of unique factorization in number rings such as \(\mathbb{Z}[\sqrt{-19}]\) is a subtle and important one. Some of the flawed proofs of Fermat’s Last Theorem incorrectly assume that certain number rings have unique factorization – like we did above.

    References

    The lecturer David Smyth showed us that the even integers do not have unique factorization during a lecture of the great course MATH2222.

    The example of \(\mathbb{Z}[\sqrt{-19}]\) failing to have unique factorization and the consequences of this was shown in a lecture for a course on algebraic number theory by James Borger. In this class we followed the (freely available) textbook “Number Rings” by P. Stevenhagen. Problem 1.4 on page 8 is the example I used in this post. By viewing the textbook you can see a complete solution to the problem.