Tag: maths

  • Poisson approximations to the negative binomial distribution

    This post is an introduction to the negative binomial distribution and a discussion of different ways of approximating the negative binomial distribution.

    The negative binomial distribution describes the number of times a coin lands on tails before a certain number of heads are recorded. The distribution depends on two parameters \(p\) and \(r\). The parameter \(p\) is the probability that the coin lands on heads and \(r\) is the number of heads. If \(X\) has the negative binomial distribution, then \(X = x\) means in the first \(x+r-1\) tosses of the coin, there were \(r-1\) heads and that toss number \(x+r\) was a head. This means that the probability that \(X=x\) is given by

    \(\displaystyle{f(x) = \binom{x+r-1}{r-1}p^{r}\left(1-p\right)^x}\)

    Here is a plot of the function \(f(x)\) for different values of \(r\) and \(p\).

    Poisson approximations

    When the parameter \(r\) is large and \(p\) is close to one, the negative binomial distribution can be approximated by a Poisson distribution. More formally, suppose that \(r(1-p)=\lambda\) for some positive real number \(\lambda\). If \(r\) is large then, the negative binomial random variable with parameters \(p\) and \(r\), converges to a Poisson random variable with parameter \(\lambda\). This is illustrated in the picture below where three negative binomial distributions with \(r(1-p)=5\) approach the Poisson distribution with \(\lambda =5\).

    Total variation distance is a common way to measure the distance between two discrete probability distributions. The log-log plot below shows that the error from the Poisson approximation is on the order of \(1/r\) and that the error is bigger if the limiting value of \(r(1-p)\) is larger.

    It turns out that is is possible to get a more accurate approximation by using a different Poisson distribution. In the first approximation, we used a Poisson random variable with mean \(\lambda = r(1-p)\). However, the mean of the negative binomial distribution is \(r(1-p)/p\). This suggests that we can get a better approximation by setting \(\lambda = r(1-p)/p\).

    The change from \(\lambda = r(1-p)\) to \(\lambda = r(1-p)/p\) is a small because \(p \approx 1\). However, this small change gives a much better approximation, especially for larger values of \(r(1-p)\). The below plot shows that both approximations have errors on the order of \(1/r\), but the constant for the second approximation is much better.

    Second order accurate approximation

    It is possible to further improve the Poisson approximation by using a Gram–Charlier expansion. A Gram–Charlier approximation for the Poisson distribution is given in this paper.1 The approximation is

    \(\displaystyle{f_{GC}(x) = P_\lambda(x) – \frac{1}{2}(1-p)\left((x-\lambda)P_\lambda(x)-(x-1-\lambda)P_\lambda(x-1)\right)},\)

    where \(\lambda = \frac{k(1-p)}{p}\) as in the second Poisson approximation and \(P_\lambda(x)\) is the Poisson pmf evaluated at \(x\).

    The Gram–Charlier expansion is considerably more accurate than either Poisson approximation. The errors are on the order of \(1/r^2\). This higher accuracy means that the error curves for the Gram–Charlier expansion has a steeper slope.

    1. The approximation is given in equation (4) of the paper and is stated in terms of the CDF instead of the PMF. The equation also contains a small typo, it should say \(\frac{1}{2}q\) instead of \(\frac{1}{2}p\). ↩︎
  • “Uniformly random”

    The term “uniformly random” sounds like a contradiction. How can the word “uniform” be used to describe anything that’s random? Uniformly random actually has a precise meaning, and, in a sense, means “as random as possible.” I’ll explain this with an example about shuffling card.

    Shuffling cards

    Suppose I have a deck of ten cards labeled 1 through 10. Initially, the cards are face down and in perfect order. The card labeled 10 is on top of the deck. The card labeled 9 is second from the top, and so on down to the card labeled 1. The cards are definitely not random.

    Next, I generate a random number between 1 and 10. I then find the card with the corresponding label and put it face down on top of the deck. The cards are now somewhat random. The number on top could anything, but the rest of the cards are still in order. The cards are random but they are not uniformly random.

    Now suppose that I keep generating random numbers and moving cards to the top of the deck. Each time I do this, the cards get more random. Eventually (after about 30 moves1) the cards will be really jumbled up. Even if you knew the first few cards, it would be hard to predict the order of the remaining ones. Once the cards are really shuffled, they are uniformly random.

    Uniformly random

    A deck of cards is uniformly random if each of the possible arrangements of the cards are equally likely. After only moving one card, the deck of cards is not uniformly random. This is because there are only 10 possible arrangements of the deck. Once the deck is well-shuffled, all of the 3,628,800 possible arrangements are equally likely.

    In general, something is uniformly random if each possibility is equally likely. So the outcome of rolling a fair 6-sided die is uniformly random, but rolling a loaded die is not. The word “uniform” refers to the chance of each possibility (1/6 for each side of the die). These chances are all the same and “uniform”.

    This is why uniformly random can mean “as random as possible.” If you are using a fair die or a well-shuffled deck, there are no biases in the outcome. Mathematically, this means you can’t predict the outcome.

    Communicating research

    The inspiration for this post came from a conversation I had earlier in the week. I was telling someone about my research. As an example, I talked about how long it takes for a deck of cards to become uniformly random. They quickly stopped me and asked how the two words could ever go together. It was a good point! I use the words uniformly random all the time and had never realized this contradiction.2 It was a good reminder about the challenge of clear communication.

    Footnotes

    1. The number of moves it takes for the deck to well-shuffled is actually random. But on average it takes around 30 moves. For the mathematical details, see Example 1 in Shuffling Cards and Stopping Times by David Aldous and Persi Diaconis. ↩︎
    2. Of the six posts I published last year, five contain the word uniform! ↩︎
  • The discrete arcsine distribution

    The discrete arcsine distribution is a probability distribution on \(\{0,1,\ldots,n\}\). It is a u-shaped distribution. There are peaks at \(0\) and \(n\) and a dip in the middle. The figure below shows the probability distribution function for \(n=10,15, 20\).

    The probability distribution function of the arcsine distribution is given by

    \(\displaystyle{p_n(k) = \frac{1}{2^{2n}}\binom{2k}{k}\binom{2n-2k)}{n-k}\text{ for } k \in \{0,1,\ldots,n\}}\)

    The discrete arcsine distribution is related to simple random walks and to an interesting Markov chain called the Burnside process. The connection with simple random walks is explained in Chapter 3, Volume 1 of An Introduction to Probability and its applications by William Feller. The connection to the Burnside process was discovered by Persi Diaconis in Analysis of a Bose-Einstein Markov Chain.

    The discrete arcsine distribution gets its name from the continuous arcsine distribution. Suppose \(X_n\) is distributed according to the discrete arcsine distribution with parameter \(n\). Then the normalized random variables \(X_n/n\) converges in distribution to the continuous arcsine distribution on \([0,1]\). The continuous arcsine distribution has the probability density function

    \(\displaystyle{f(x) = \frac{1}{\pi\sqrt{x(1-x)}} \text{ for } 0 \le x \le 1}\)

    This means that continuous arcsine distribution is a beta distribution with \(\alpha=\beta=1/2\). It is called the arcsine distribution because the cumulative distribution function involves the arcsine function

    \(\displaystyle{F(x) = \int_0^x f(y)dy = \frac{2}{\pi}\arcsin(\sqrt{x}) \text{ for } 0 \le x \le 1}\)

    There is another connection between the discrete and continuous arcsine distributions. The continuous arcsine distribution can be used to sample the discrete arcsine distribution. The two step procedure below produces a sample from the discrete arcsine distribution with parameter \(n\):

    1. Sample \(p\) from the continuous arcsine distribution.
    2. Sample \(X\) from the binomial distribution with parameters \(n\) and \(p\).

    This means that the discrete arcsine distribution is actually the beta-binomial distribution with parameters \(\alpha = \beta =1/2\). I was surprised when I was told this, and couldn’t find a reference. The rest of this blog post proves that the discrete arcsine distribution is an instance of the beta-binomial distribution.

    As I showed in this post, the beta-binomial distribution has probability distribution function:

    \(\displaystyle{q_{\alpha,\beta,n}(k) = \binom{n}{k}\frac{B(k+\alpha, n-k+\alpha)}{B(a,b)}},\)

    where \(B(x,y)=\frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)}\) is the Beta-function. To show that the discrete arc sine distribution is an instance of the beta-binomial distribution we need that \(p_n(k)=q_{1/2,1/2,n}(k)\). That is

    \(\displaystyle{ \binom{n}{k}\frac{B(k+1/2, n-k+1/2)}{B(1/2,1/2)} = \frac{1}{2^{2n}}\binom{2k}{k}\binom{2n-2k}{n-k}}\),

    for all \(k = 0,1,\ldots,n\). To prove the above equation, we can first do some simplifying to \(q_{1/2,1/2,n}(k)\). By definition

    \(\displaystyle{\frac{B(k+1/2, n-k+1/2)}{B(1/2,1/2)} = \frac{\frac{\Gamma(k+1/2)\Gamma(n-k+1/2)}{\Gamma(n+1)}}{\frac{\Gamma(1/2)\Gamma(1/2)}{\Gamma(1)}} = \frac{1}{n!}\frac{\Gamma(k+1/2)}{\Gamma(1/2)}\frac{\Gamma(n-k+1/2)}{\Gamma(1/2)}}\),

    where I have used that \(\Gamma(m)=(m-1)!\) factorial if \(m\) is a natural number. The Gamma function \(\Gamma(x)\) also satisfies the property \(\Gamma(x+1)=x\Gamma(x)\). Using this repeatedly gives

    \(\displaystyle{\Gamma(k+1/2) = (k-1/2) \times (k-3/2) \times \cdots \times \frac{3}{2}\times\frac{1}{2}\times\Gamma(1/2) }. \)

    This means that

    \(\displaystyle{\frac{\Gamma(k+1/2)}{\Gamma(1/2)} = (k-1/2) \times (k-3/2) \times \cdots \times \frac{3}{2}\times\frac{1}{2} = \frac{(2k-1)\times(2k-3)\times \cdots \times 3 \times 1}{2^k}=\frac{(2k-1)!!}{2^k}},\)

    where \((2k-1)!!=(2k-1)\times (2k-3)\times\cdots \times 3 \times 1\) is the double factorial. The same reasoning gives

    \(\displaystyle{\frac{\Gamma(n-k+1/2)}{\Gamma(1/2)} =\frac{(2n-2k-1)!!}{2^{n-k}}}.\)

    And so

    \(\displaystyle{q_{1/2,1/2,n}(k) =\frac{1}{2^nk!(n-k)!}(2k-1)!!(2n-2k-1)!!}.\)

    We’ll now show that \(p_n(k)\) is also equal to the above final expression. Recall

    \(\displaystyle{p_n(k) = \frac{1}{2^{2n}} \binom{2k}{k}\binom{2(n-k)}{n-k} = \frac{1}{2^{2n}}\frac{(2k)!(2(n-k))!}{k!k!(n-k)!(n-k)!} = \frac{1}{2^nk!(n-k)!}\frac{(2k)!}{k!2^k}\frac{(2n-2k)!}{(n-k)!2^{n-k}}}.\)

    And so it suffices to show \(\frac{(2k)!}{k!2^k} = (2k-1)!!\) (and hence \(\frac{(2n-2k)!}{(n-k)!2^{n-k}}=(2n-2k-1)!!\)). To see why this last claim holds, note that

    \(\displaystyle{\frac{(2k)!}{k!2^k} = \frac{(2k)\times (2k-1)\times(2k-2)\times\cdots\times 3 \times 2 \times 1}{(2k)\times (2k-2)\times \cdots \times 2} = (2k-1)!!}\)

    Showing that \(p_{n}(k)=q_{n,1/2,1/2}(k)\) as claimed.

  • Two sample tests as correlation tests

    Suppose we have two samples \(Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)}\) and \(Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)}\) and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.

    The equal variance t-test

    In the equal variance t-test, we assume that \(Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2)\) and \(Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2)\), where \(\sigma^2\) is unknown. Our hypothesis that \(Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)}\) and \(Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)}\) are from the same distribution becomes the hypothesis \(\mu_0 = \mu_1\). The test statistic is

    \(t = \frac{\displaystyle \overline{Y}^{(1)} – \overline{Y}^{(0)}}{\displaystyle \hat{\sigma}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}\),

    where \(\overline{Y}^{(0)}\) and \(\overline{Y}^{(1)}\) are the two sample means. The variable \(\hat{\sigma}\) is the pooled estimate of the standard deviation and is given by

    \(\hat{\sigma}^2 = \displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right)\).

    Under the null hypothesis, \(t\) follows the T-distribution with \(n_0+n_1-2\) degrees of freedom. We thus reject the null \(\mu_0=\mu_1\) when \(|t|\) exceeds the \(1-\alpha/2\) quantile of the T-distribution.

    Pooling the data

    We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let \(Y_1,\ldots,Y_{n_0}, Y_{n_0+1},\ldots,Y_{n_0+n_1}\) be the pooled data and for \(i = 1,\ldots, n_0+n_1\), define \(x_i \in \{0,1\}\) by

    \(x_i = \begin{cases} 0 & \text{if } 1 \le i \le n_0,\\ 1 & \text{if } n_0+1 \le i \le n_0+n_1.\end{cases}\)

    The assumptions that \(Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2)\) and \(Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2)\) can be rewritten as

    \(Y_i = \beta_0+\beta_1x_i + \varepsilon_i,\)

    where \(\varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2)\). That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis \(\mu_0 = \mu_1\) is equivalent to \(\beta_1 = 0\). To test \(\beta_1 = 0\) we can use the standard t-test for a coefficient in linear model. The test statistic in this case is

    \(t’ = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{(X^TX)^{-1}_{11}}},\)

    where \(\hat{\beta}_1\) is the ordinary least squares estimate of \(\beta_1\), \(X \in \mathbb{R}^{(n_0+n_1)\times 2}\) is the design matrix and \(\hat{\sigma}_{OLS}\) is an estimate of \(\sigma\) given by

    \(\hat{\sigma}_{OLS}^2 = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1} (Y_i-\hat{Y}_i)^2,\)

    where \(\hat{Y} = \hat{\beta}_0+\hat{\beta}_1x_i\) is the fitted value of \(Y_i\).

    It turns out that \(t’\) is exactly equal to \(t\). We can see this by writing out the design matrix and calculating everything above. The design matrix has rows \([1,x_i]\) and is thus equal to

    \(X = \begin{bmatrix} 1&x_1\\ 1&x_2\\ \vdots&\vdots\\ 1&x_{n_0}\\ 1&x_{n_0+1}\\ \vdots&\vdots\\ 1&x_{n_0+n_1}\end{bmatrix} = \begin{bmatrix} 1&0\\ 1&0\\ \vdots&\vdots\\ 1&0\\ 1&1\\ \vdots&\vdots\\ 1&1\end{bmatrix}.\)

    This implies that

    \(X^TX = \begin{bmatrix} n_0+n_1 &n_1\\n_1&n_1 \end{bmatrix},\)

    And therefore,

    \((X^TX)^{-1} = \frac{1}{(n_0+n_1)n_1-n_1^2}\begin{bmatrix} n_1 &-n_1\\-n_1&n_0+n_1 \end{bmatrix} = \frac{1}{n_0n_1}\begin{bmatrix} n_1&-n_1\\-n_1&n_0+n_1\end{bmatrix} =\begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix} .\)

    Thus, \((X^TX)^{-1}_{11} = \frac{1}{n_0}+\frac{1}{n_1}\). So,

    \(t’ = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},\)

    which is starting to like \(t\) from the two-sample test. Now

    \(X^TY = \begin{bmatrix} \displaystyle\sum_{i=1}^{n_0+n_1} Y_i\\ \displaystyle \sum_{i=n_0+1}^{n_0+n_1} Y_i \end{bmatrix} = \begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\ n_1\overline{Y}^{(1)} \end{bmatrix}.\)

    And so

    \(\hat{\beta} = (X^TX)^{-1}X^TY = \begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix}\begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\ n_1\overline{Y}^{(1)} \end{bmatrix}=\begin{bmatrix} \overline{Y}^{(0)}\\ \overline{Y}^{(1)} -\overline{Y}^{(0)}\end{bmatrix}.\)

    Thus, \(\hat{\beta}_1 = \overline{Y}^{(1)} -\overline{Y}^{(0)}\) and

    \(t’ = \displaystyle\frac{\overline{Y}^{(1)}-\overline{Y}^{(0)}}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}.\)

    This means to show that \(t’ = t\), we only need to show that \(\hat{\sigma}_{OLS}^2=\hat{\sigma}^2\). To do this, note that the fitted values \(\hat{Y}\) are equal to

    \(\displaystyle\hat{Y}_i=\hat{\beta}_0+x_i\hat{\beta}_1 = \begin{cases} \overline{Y}^{(0)} & \text{if } 1 \le i \le n_0,\\ \overline{Y}^{(1)} & \text{if } n_0+1\le i \le n_0+n_1\end{cases}.\)

    Thus,

    \(\hat{\sigma}^2_{OLS} = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1}\left(Y_i-\hat{Y}_i\right)^2=\displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right),\)

    Which is exactly \(\hat{\sigma}^2\). Therefore, \(t’=t\) and the two sample t-test is equivalent to a correlation test.

    The Friedman-Rafsky test

    In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.

    In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.

    I plan to write a follow up post explaining these procedures but you can also read about it in Chapter 6 of Group Representations in Probability and Statistics by Persi Diaconis.

    References

    Persi Diaconis “Group representations in probability and statistics,” pp 104-106, Hayward, CA: Institute of Mathematical Statistics, (1988)

    Jerome H. Friedman, Lawrence C. Rafsky “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests,” The Annals of Statistics, Ann. Statist. 7(4), 697-717, (July, 1979)

    Jerome H. Friedman, Lawrence C. Rafsky “Graph-Theoretic Measures of Multivariate Association and Prediction,” The Annals of Statistics, Ann. Statist. 11(2), 377-391, (June, 1983).

  • Non-measurable sets, cardinality and the axiom of choice

    The following post is based on a talk I gave at the 2022 Stanford statistics retreat. The talk was titled “Another non-measurable monster”.

    The opening slide from my presentation. The test reads “another non-measurable monster”. There are spooky pumpkins in the background. Image from https://www.shutterstock.com/image-photo/pumpkins-graveyard-spooky-night-halloween-backdrop-1803950377

    The material was based on the discussion and references given in this stackexchange post. The title is a reference to a Halloween lecture on measurability given by Professor Persi Diaconis.

    What’s scarier than a non-measurable set?

    Making every set measurable. Or rather one particular consequence of making every set measurable.

    In my talk, I argued that if you make every set measurable, then there exists a set \(\Omega\) and an equivalence relation \(\sim\) on \(\Omega\) such that \(|\Omega| < |\Omega / \sim|\). That is, the set \(\Omega\) has strictly smaller cardinality than the set of equivalence classes \(\Omega/\sim\). The contradictory nature of this statement is illustrated in the picture below

    We can think of the set \(\Omega\) as the collection of crosses drawn above. The equivalence relation \(\sim\) divides \(\Omega\) into the regions drawn above. The statement \(|\Omega|<|\Omega /\sim|\) means that in some sense there are more regions than crosses.

    To make sense of this we’ll first have to be a bit more precise about what we mean by cardinality.

    What do we mean by bigger and smaller?

    Let \(A\) and \(B\) be two sets. We say that \(A\) and \(B\) have the same cardinality and write \(|A| = |B|\) if there exists a bijection function \(f:A \to B\). We can think of the function \(f\) as a way of pairing each element of \(A\) with a unique element of \(B\) such that every element of \(B\) is paired with an element of \(A\).

    We next want to define \(|A|\le |B|\) which means \(A\) has cardinality at most the cardinality of \(B\). There are two reasonable ways in which we could try to define this relationship

    1. We could say \(|A|\le |B|\) means that there exists an injective function \(f : A \to B\).
    2. Alternatively, we could \(|A|\le |B|\) means that there exists a surjective function \(g:B \to A\).

    Definitions 1 and 2 say similar things and, in the presence of the axiom of choice, they are equivalent. Since we are going to be making every set measurable in this talk, we won’t be assuming the axiom of choice. Definitions 1 and 2 are thus no longer equivalent and we have a decision to make. We will use definition . in this talk. For justification, note that definition 1 implies that there exists a subset \(B’ \subseteq B\) such that \(|A|=|B|\). We simply take \(B’\) to be the range of \(f\). This is a desirable property of the relation \(|A|\le |B|\) and it’s not clear how this could be done using definition 2.

    Infinite binary sequences

    It’s time to introduce the set \(\Omega\) and the equivalence relation we will be working with. The set \(\Omega\) is the set \(\{0,1\}^\mathbb{Z}\) the set of all function \(\omega : \mathbb{Z} \to \{0,1\}\). We can think of each elements \(\omega \in \Omega\) as an infinite sequence of zeros and ones stretching off in both directions. For example

    \(\omega = \ldots 1110110100111\ldots\).

    But this analogy hides something important. Each \(\omega \in \Omega\) has a “middle” which is the point \(\omega_0\). For instance, the two sequences below look the same but when we make \(\omega_0\) bold we see that they are different.

    \(\omega = \ldots 111011\mathbf{0}100111\ldots\),

    \(\omega’ = \ldots 1110110\mathbf{1}00111\ldots\).

    The equivalence relation \(\sim\) on \(\Omega\) can be thought of as forgetting the location \(\omega_0\). More formally we have \(\omega \sim \omega’\) if and only if there exists \(n \in \mathbb{Z}\) such that \(\omega_{n+k} = \omega_{k}’\) for all \(k \in \mathbb{Z}\). That is, if we shift the sequence \(\omega\) by \(n\) we get the sequence \(\omega’\). We will use \([\omega]\) to denote the equivalence class of \(\omega\) and \(\Omega/\sim\) for the set of all equivalences classes.

    Some probability

    Associated with the space \(\Omega\) are functions \(X_k : \Omega \to \{0,1\}\), one for each integer \(k \in \mathbb{Z}\). These functions simply evaluate \(\omega\) at \(k\). That is \(X_k(\omega)=\omega_k\). A probabilist or statistician would think of \(X_k\) as reporting the result of one of infinitely many independent coin tosses. Normally to make this formal we would have to first define a \(\sigma\)-algebra on \(\Omega\) and then define a probability on this \(\sigma\)-algebra. Today we’re working in a world where every set is measurable and so don’t have to worry about \(\sigma\)-algebras. Indeed we have the following result:

    (Solovay, 1970)1 There exists a model of the Zermelo Fraenkel axioms of set theory such that there exists a probability \(\mathbb{P}\) defined on all subsets of \(\Omega\) such that \(X_k\) are i.i.d. \(\mathrm{Bernoulli}(0.5)\).

    This result is saying that there is world in which, other than the axiom of choice, all the regular axioms of set theory holds. And in this world, we can assign a probability to every subset \(A \subseteq \Omega\) in a way so that the events \( \{X_k=1\}\) are all independent and have probability \(0.5\). It’s important to note that this is a true countably additive probability and we can apply all our familiar probability results to \(\mathbb{P}\). We are now ready to state and prove the spooky result claimed at the start of this talk.

    Proposition: Given the existence of such a probability \(\mathbb{P}\), \(|\Omega | < |\Omega /\sim|\).

    Proof: Let \(f:\Omega/\sim \to \Omega\) be any function. To show that \(|\Omega|<|\Omega /\sim|\) we need to show that \(f\) is not injective. To do this, we’ll first define another function \(g:\Omega \to \Omega\) given by \(g(\omega)=f([\omega])\). That is, \(g\) first maps \(\omega\) to \(\omega\)’s equivalence class and then applies \(f\) to this equivalence class. This is illustrated below.

    A commutative diagram showing the definition of \(g\) as \(g(\omega)=f([\omega])\).

    We will show that \(g : \Omega \to \Omega\) is almost surely constant with respect to \(\mathbb{P}\). That is, there exists \(\omega^\star \in \Omega\) such that \(\mathbb{P}(g(\omega)=\omega^\star)=1\). Each equivalence class \([\omega]\) is finite or countable and thus has probability zero under \(\mathbb{P}\). This means that if \(g\) is almost surely constant, then \(f\) cannot be injective and must map multiple (in fact infinitely many) equivalence classes to \(\omega^\star\).

    It thus remains to show that \(g:\Omega \to \Omega\) is almost surely constant. To do this we will introduce a third function \(\varphi : \Omega \to \Omega\). The map \(\varphi\) is simply the shift map and is given by \(\varphi(\omega)_k = \omega_{k+1}\). Note that \(\omega\) and \(\varphi(\omega)\) are in the same equivalence class for every \(\omega\in \Omega\). Thus, the map \(g\) satisfies \(g\circ \varphi = g\). That is \(g\) is \(\varphi\)-invariant.

    The map \(\varphi\) is ergodic. This means that if \(A \subseteq \Omega\) satisfies \(\varphi(A)=A\), then \(\mathbb{P}(A)\) equals \(0\) or \(1\). For example if \(A\) is the event that \(10110\) appears at some point in \(\omega\), then \(\varphi(A)=A\) and \(\mathbb{P}(A)=`1\). Likewise if \(A\) is the event that the relative frequency of heads converges to a number strictly greater than \(0.5\), then \(\varphi(A)=A\) and \(\mathbb{P}(A)=0\). The general claim that all \(\varphi\)-invariant events have probability \(0\) or \(1\) can be proved using the independence of \(X_k\).

    For each \(k\), define an event \(A_k\) by \(A_k = \{\omega : g(\omega)_k = 1\}\). Since \(g\) is \(\varphi\)-invariant we have that \(\varphi(A_k)=A_k\). Thus, \(\mathbb{P}(A_k)=0\) or \(1\). This gives us a function \(\omega^\star :\mathbb{Z} \to \{0,1\}\) given by \(\omega^\star_k = \mathbb{P}(A_k)\). Note that for every \(k\), \(\mathbb{P}(\{\omega : g(\omega)_k = \omega_k^\star\}) = 1\). This is because if \(w_{k}^\star=1\), then \(\mathbb{P}(\{\omega: g(\omega)_k = 1\})=1\), by definition of \(w_k^\star\). Likewise if \(\omega_k^\star =0\), then \(\mathbb{P}(\{\omega:g(\omega)_k=1\})=0\) and hence \(\mathbb{P}(\{\omega:g(\omega)_k=0\})=1\). Thus, in both cases, \(\mathbb{P}(\{\omega : g(\omega)_k = \omega_k^*\})= 1\).

    Since \(\mathbb{P}\) is a probability measure, we can conclude that

    \(\mathbb{P}(\{\omega : g(\omega)=\omega^\star\}) = \mathbb{P}\left(\bigcap_{k \in \mathbb{Z}} \{\omega : g(\omega)_k = \omega_k^\star\}\right)=1\).

    Thus, \(g\) map \(\Omega\) to \(\omega^\star\) with probability one. Showing that \(g\) is almost surely constant and hence that \(f\) is not injective. \(\square\)

    There’s a catch!

    So we have proved that there cannot be an injective map \(f : \Omega/\sim \to \Omega\). Does this mean we have proved \(|\Omega| < |\Omega/\sim|\)? Technically no. We have proved the negation of \(|\Omega/\sim|\le |\Omega|\) which does not imply \(|\Omega| \le |\Omega/\sim|\). To argue that \(|\Omega| < |\Omega/\sim|\) we need to produce a map \(g: \Omega \to \Omega/\sim\) that is injective. Surprising this is possible and not too difficult. The idea is to find a map \(g : \Omega \to \Omega\) such that \(g(\omega)\sim g(\omega’)\) implies that \(\omega = \omega’\). This can be done by somehow encoding in \(g(\omega)\) where the centre of \(\omega\) is.

    A simpler proof and other examples

    Our proof was nice because we explicitly calculated the value \(\omega^\star\) where \(g\) sent almost all of \(\Omega\). We could have been less explicit and simply noted that the function \(g:\Omega \to \Omega\) was measurable with respect to the invariant \(\sigma\)-algebra of \(\varphi\) and hence almost surely constant by the ergodicity of \(\varphi\).

    This quicker proof allows us to generalise our “spooky result” to other sets. Below are two examples where \(\Omega = [0,1)\)

    • Fix \(\theta \in [0,1)\setminus \mathbb{Q}\) and define \(\omega \sim \omega’\) if and only if \(\omega + n \theta= \omega’\) for some \(n \in \mathbb{Z}\).
    • \(\omega \sim \omega’\) if and only if \(\omega – \omega’ \in \mathbb{Q}\).

    A similar argument can be used to show that in Solovay’s world \(|\Omega| < |\Omega/\sim|\). The exact same argument follows from the ergodicity of the corresponding actions on \(\Omega\) under the uniform measure.

    Three takeaways

    I hope you agree that this example is good fun and surprising. I’d like to end with some remarks.

    • The first remark is some mathematical context. This argument given today is linked to some interesting mathematics called descriptive set theory. This field studies the properties of well behaved subsets (such as Borel subsets) of topological spaces. Descriptive set theory incorporates logic, topology and ergodic theory. I don’t know much about the field but in Persi’s Halloween talk he said that one “monster” was that few people are interested in the subject.
    • The next remark is a better way to think about our “spooky result”. The result is really saying something about cardinality. When we no longer use the axiom of choice, cardinality becomes a subtle concept. The statement \(|A|\le |B|\) no longer corresponds to \(A\) being “smaller” than \(B\) but rather that \(A\) is “less complex” than \(B\). This is perhaps analogous to some statistical models which may be “large” but do not overfit due to subtle constraints on the model complexity.
    • In light of the previous remark, I would invite you to think about whether the example I gave is truly spookier than non-measurable sets. It might seem to you that it is simply a reasonable consequence of removing the axiom of choice and restricting ourselves to functions we could actually write down or understand. I’ll let you decide

    Footnotes

    1. Technically Solovay proved that there exists a model of set theory such that every subset of \(\mathbb{R}\) is Borel measurable. To get the result for binary sequences we have to restrict to \([0,1)\) and use the binary expansion of \(x \in [0,1)\) to define a function \([0,1) \to \Omega\). Solvay’s paper is available here https://www.jstor.org/stable/1970696?seq=1