Category: Probability

  • The field with one element in a probability seminar

    Something very exciting this afternoon. Professor Persi Diaconis was presenting at the Stanford probability seminar and the field with one element made an appearance. The talk was about joint work with Mackenzie Simper and Arun Ram. They had developed a way of “collapsing” a random walk on a group to a random walk on the set of double cosets. As an illustrative example, Persi discussed a random walk on \(GL_n(\mathbb{F}_q)\) given by multiplication by a random transvection (a map of the form \(v \mapsto v + ab^Tv\), where \(a^Tb = 0\)).

    The Bruhat decomposition can be used to match double cosets of \(GL_n(\mathbb{F}_q)\) with elements of the symmetric group \(S_n\). So by collapsing the random walk on \(GL_n(\mathbb{F}_q)\) we get a random walk on \(S_n\) for all prime powers \(q\). As Professor Diaconis said, you can’t stop him from taking \(q \to 1\) and asking what the resulting random walk on \(S_n\) is. The answer? Multiplication by a random transposition. As pointed sets are vector spaces over the field with one element and the symmetric groups are the matrix groups, this all fits with what’s expected of the field with one element.

    This was just one small part of a very enjoyable seminar. There was plenty of group theory, probability, some general theory and engaging examples.

    Update: I have written another post about some of the group theory from the seminar! You can read it here: Double cosets and contingency tables.

  • A not so simple conditional expectation

    It is winter 2022 and my PhD cohort has moved on the second quarter of our first year statistics courses. This means we’ll be learning about generalised linear models in our applied course, asymptotic statistics in our theory course and conditional expectations and martingales in our probability course.

    In the first week of our probability course we’ve been busy defining and proving the existence of the conditional expectation. Our approach has been similar to how we constructed the Lebesgue integral in the previous course. Last quarter, we first defined the Lebesgue integral for simple functions, then we used a limiting argument to define the Lebesgue integral for non-negative functions and then finally we defined the Lebesgue integral for arbitrary functions by considering their positive and negative parts.

    Our approach to the conditional expectation has been similar but the journey has been different. We again started with simple random variables, then progressed to non-negative random variables and then proved the existence of the conditional expectation of any arbitrary integrable random variable. Unlike the Lebesgue integral, the hardest step was proving the existence of the conditional expectation of a simple random variable. Progressing from simple random variables to arbitrary random variables was a straight forward application of the monotone convergence theorem and linearity of expectation. But to prove the existence of the conditional expectation of a simple random variable we needed to work with projections in the Hilbert space \(L^2(\Omega, \mathbb{P},\mathcal{F})\).

    Unlike the Lebesgue integral, defining the conditional expectation of a simple random variable is not straight forward. One reason for this is that the conditional expectation of a random variable need not be a simple random variable. This comment was made off hand by our Professor and sparked my curiosity. The following example is what I came up with. Below I first go over some definitions and then we dive into the example.

    A simple random variable with a conditional expectation that is not simple

    Let \((\Omega, \mathbb{P}, \mathcal{F})\) be a probability space and let \(\mathcal{G} \subseteq \mathcal{F}\) be a sub-\(\sigma\)-algebra. The conditional expectation of an integrable random variable \(X\) is a random variable \(\mathbb{E}(X|\mathcal{G})\) that satisfies the following two conditions:

    1. The random variable \(\mathbb{E}(X|\mathcal{G})\) is \(\mathcal{G}\)-measurable.
    2. For all \(B \in \mathcal{G}\), \(\mathbb{E}[X1_B] = \mathbb{E}[\mathbb{E}(X|\mathcal{G})1_B]\), where \(1_B\) is the indicator function of \(B\).

    The conditional expectation of an integrable random variable is unique and always exists. One can think of \(\mathbb{E}(X|\mathcal{G})\) as the expected value of \(X\) given the information in \(\mathcal{G}\).

    A simple random variable is a random variable \(X\) that take only finitely many values. Simple random variables are always integrable and so \(\mathbb{E}(X|\mathcal{G})\) always exists but we will see that \(\mathbb{E}(X|\mathcal{G})\) need not be simple.

    Consider a random vector \((U,V)\) uniformly distributed on the square \([-1,1]^2 \subseteq \mathbb{R}^2\). Let \(D\) be the unit disc \(D = \{(u,v) \in \mathbb{R}^2:u^2+v^2 \le 1\}\). The random variable \(X = 1_D(U,V)\) is a simple random variable since \(X\) equals \(1\) if \((U,V) \in D\) and \(X\) equals \(0\) otherwise. Let \(\mathcal{G} = \sigma(U)\) the \(\sigma\)-algebra generated by \(U\). It turns out that

    \(\mathbb{E}(X|\mathcal{G}) = \sqrt{1-U^2}\).

    Thus \(\mathbb{E}(X|\mathcal{G})\) is not a simple random variable. Let \(Y = \sqrt{1-U^2}\). Since \(Y\) is a continuous function of \(U\), the random variable is \(\mathcal{G}\)-measurable. Thus \(Y\) satisfies condition 1. Furthermore if \(B \in \mathcal{G}\), then \(B = \{U \in A\}\) for some measurable set \(A\subseteq [-1,1]\). Thus \(X1_B\) equals \(1\) if and only if \(U \in A\) and \(V \in [-\sqrt{1-U^2}, \sqrt{1-U^2}]\). Since \((U,V)\) is uniformly distributed we thus have

    \(\mathbb{E}[X1_B] = \int_A \int_{-\sqrt{1-u^2}}^{\sqrt{1-u^2}} \frac{1}{4}dvdu = \int_A \frac{1}{2}\sqrt{1-u^2}du\).

    The random variable \(U\) is uniformly distributed on \([-1,1]\) and thus has density \(\frac{1}{2}1_{[-1,1]}\). Therefore,

    \(\mathbb{E}[Y1_B] = \mathbb{E}[\sqrt{1-U^2}1_{\{U \in A\}}] = \int_A \frac{1}{2}\sqrt{1-u^2}du\).

    Thus \(\mathbb{E}[X1_B] = \mathbb{E}[Y1_B]\) and therefore \(Y = \sqrt{1-U^2}\) equals \(\mathbb{E}(X|\mathcal{G})\). Intuitively we can see this because given \(U=u\), we know that \(X\) is \(1\) when \(V \in [-\sqrt{1-u^2},\sqrt{1+u^2}]\) and that \(X\) is \(0\) otherwise. Since \(V\) is uniformly distributed on \([-1,1]\) the probability that \(V\) is in \([-\sqrt{1-u^2},\sqrt{1+u^2}]\) is \(\sqrt{1-u^2}\). Thus given \(U=u\), the expected value of \(X\) is \(\sqrt{1-u^2}\).

    An extension

    The previous example suggests an extension that shows just how “complicated” the conditional expectation of a simple random variable can be. I’ll state the extension as an exercise:

    Let \(f:[-1,1]\to \mathbb{R}\) be any continuous function with \(f(x) \in [0,1]\). With \((U,V)\) and \(\mathcal{G}\) as above show that there exists a measurable set \(A \subseteq [-1,1]^2\) such that \(\mathbb{E}(1_A|\mathcal{G}) = f(U)\).

  • Extremal couplings

    This post is inspired by an assignment question I had to answer for STATS 310A – a probability course at Stanford for first year students in the statistics PhD program. In the question we had to derive a few results about couplings. I found myself thinking and talking about the question long after submitting the assignment and decided to put my thoughts on paper. I would like to thank our lecturer Prof. Diaconis for answering my questions and pointing me in the right direction.

    What are couplings?

    Given two distribution functions \(F\) and \(G\) on \(\mathbb{R}\), a coupling of \(F\) and \(G\) is a distribution function \(H\) on \(\mathbb{R}^2\) such that the marginals of \(H\) are \(F\) and \(G\). Couplings can be used to give probabilistic proofs of analytic statements about \(F\) and \(G\) (see here). Couplings are also are studied in their own right in the theory optimal transport.

    We can think of \(F\) and \(G\) as being the cumulative distribution functions of some random variables \(X\) and \(Y\). A coupling \(H\) of \(F\) and \(G\) thus corresponds to a random vector \((\widetilde{X},\widetilde{Y})\) where \(\widetilde{X}\) has the same distribution as \(X\), \(\widetilde{Y}\) has the same distribution as \(Y\) and \((\widetilde{X},\widetilde{Y}) \sim H\).

    The independent coupling

    For two given distributions function \(F\) and \(G\) there exist many possible couplings. For example we could take \(H = H_I\) where \(H_I(x,y) = F(x)G(y)\). This coupling corresponds to a random vector \((\widetilde{X}_I,\widetilde{Y}_I)\) where \(\widetilde{X}_I\) and \(\widetilde{Y}_I\) are independent and (as is required for all couplings) \(\widetilde{X}_I \stackrel{\text{dist}}{=} X\), \(\widetilde{Y}_I \stackrel{\text{dist}}{=} Y\).

    In some sense the coupling \(H_I\) is in the “middle” of all couplings. This is because \(\widetilde{X}\) and \(\widetilde{Y}\) are independent and so \(\widetilde{X}\) doesn’t carry any information about \(\widetilde{Y}\). As the title of the post suggests, there are couplings were this isn’t the case and \(\widetilde{X}\) carries “as much information as possible” about \(\widetilde{Y}\).

    The two extremal couplings

    Define two function \(H_L, H_U :\mathbb{R}^2 \to [0,1]\) by

    \(H_U(x,y) = \min\{F(x), G(y)\}\) and \(H_L(x,y) = \max\{F(x)+G(y) – 1, 0\}\).

    With some work, one can show that \(H_L\) and \(H_U\) are distributions functions on \(\mathbb{R}^2\) and that they have the correct marginals. In this post I would like to talk about how to construct random vectors \((\widetilde{X}_U, \widetilde{Y}_U) \sim H_U\) and \((\widetilde{X}_L, \widetilde{Y}_L) \sim H_L\).

    Let \(F^{-1}\) and \(G^{-1}\) be the quantile functions of \(F\) and \(G\). That is,

    \(F^{-1}(c) = \inf\{ x \in \mathbb{R} : F(x) \ge c\}\) and \(G^{-1}(c) = \inf\{ x \in \mathbb{R} : G(x) \ge c\}\).

    Now let \(V\) be a random variable that is uniformly distributed on \([0,1]\) and define

    \(\widetilde{X}_U = F^{-1}(V)\) and \(\widetilde{Y}_U = G^{-1}(V)\).

    Since \(F^{-1}(V) \le x\) if and only if \(V \le F(x)\), we have \(\widetilde{X}_U \stackrel{\text{dist}}{=} X\) and likewise \(\widetilde{Y}_U \stackrel{\text{dist}}{=} Y\). Furthermore \(\widetilde{X}_U \le x, \widetilde{Y}_U \le y\) occurs if and only if \(V \le F(x), V \le G(y)\) which is equivalent to \(V \le \min\{F(x),G(y)\}\). Thus

    \(\mathbb{P}(\widetilde{X}_U \le x, \widetilde{Y}_U \le y) = \mathbb{P}(V \le \min\{F(x),G(y)\})= \min\{F(x),G(y)\}.\)

    Thus \((\widetilde{X}_U,\widetilde{Y}_U)\) is distributed according to \(H_U\). We see that under the coupling \(H_U\), \(\widetilde{X}_U\) and \(\widetilde{Y}_U\) are closely related as they are both increasing functions of a common random variable \(V\).

    We can follow a similar construction for \(H_L\). Define

    \(\widetilde{X}_L = F^{-1}(V)\) and \(\widetilde{Y}_L = G^{-1}(1-V)\).

    Thus \(\widetilde{X}_L\) and \(\widetilde{Y}_L\) are again functions of a common random variable \(V\) but \(\widetilde{X}_L\) is an increasing function of \(V\) and \(\widetilde{Y}_L\) is a decreasing function of \(V\). Note that \(1-V\) is also uniformly distributed on \([0,1]\). Thus \(\widetilde{X}_L \stackrel{\text{dist}}{=} X\) and \(\widetilde{Y}_L \stackrel{\text{dist}}{=} Y\).

    Now \(\widetilde{X}_L \le x, \widetilde{Y}_L \le y\) occurs if and only if \(V \le F(x)\) and \(1-V \le G(y)\) which occurs if and only if \(1-G(y) \le V \le F(x)\). If \(1-G(y) \le F(x)\), then \(F(x)+G(y)-1 \ge 0\) and \(\mathbb{P}(1-G(y) \le V \le F(x)) =F(x)+G(y)-1\). On the other hand, if \(1 – G(y) > F(x)\), then \(F(x)+G(y)-1< 0\) and \(\mathbb{P}(1-G(y) \le V \le F(x))=0\). Thus

    \(\mathbb{P}(\widetilde{X}_L \le x, \widetilde{Y}_L \le y) = \mathbb{P}(1-G(y) \le V \le F(x)) = \max\{F(x)+G(y)-1,0\}\),

    and so \((\widetilde{X}_L,\widetilde{Y}_L)\) is distributed according to \(H_L\).

    What makes \(H_U\) and \(H_L\) extreme?

    Now that we know that \(H_U\) and \(H_L\) are indeed couplings, it is natural to ask what makes them “extreme”. What we would like to say is that \(\widetilde{Y}_U\) is an increasing function of \(\widetilde{X}_U\) and \(\widetilde{Y}_L\) is a decreasing function of \(\widetilde{X}_L\). Unfortunately this isn’t always the case as can be seen by taking \(X\) to be constant and \(Y\) to be continuous.

    However the intuition that \(\widetilde{Y}_U\) is increasing in \(\widetilde{X}_U\) and \(\widetilde{Y}_L\) is decreasing in \(\widetilde{X}_L\) is close to correct. Given a coupling \((\widetilde{X},\widetilde{Y}) \sim H\), we can look at the quantity

    \(C(x,y) = \mathbb{P}(\widetilde{Y} \le y | \widetilde{X} \le x) -\mathbb{P}(\widetilde{Y} \le y) = \frac{H(x,y)}{F(x)}-G(y)\)

    This quantity tells us something about how \(\widetilde{Y}\) changes with \(\widetilde{X}\). For instance if \(\widetilde{X}\) and \(\widetilde{Y}\) were positively correlated, then \(C(x,y)\) would be positive and if \(\widetilde{X}\) and \(\widetilde{Y}\) were negatively correlated, then \(C(x,y)\) would be negative.

    For the independent coupling \((\widetilde{X}_I,\widetilde{Y}_I) \sim H_I\), the quantity \(C(x,y)\) is constantly \(0\). It turns out that the above probability is maximised by the coupling \((\widetilde{X}_U, \widetilde{Y}_U) \sim H_U\) and minimised by \((\widetilde{X}_L,\widetilde{Y}_L) \sim H_L\) and it is in this sense that they are extremal. This final claim is the two dimensional version of the Fréchet-Hoeffding Theorem and checking it is a good exercise.

  • Finitely Additive Measures

    I am again tutoring the course MATH3029. The content is largely the same but the name has changed from “Probability Modelling and Applications” to “Probability Theory and Applications” to better reflect the material taught. There was a good question on the first assignment that leads to some interesting mathematics.

    The Question

    The assignment question is as follows. Let \(\Omega\) be a set and let \(\mathcal{F} \subseteq \mathcal{P}(\Omega)\) be a \(\sigma\)-algebra on \(\Omega\). Let \(\mathbb{P} : \mathcal{F} \to [0,\infty)\) be a function with the following properties

    1. \(\mathbb{P}(\Omega) = 1\).
    2. For any finite sequence of pairwise disjoint sets \((A_k)_{k=1}^n\) in \(\mathcal{F}\), we have \(\mathbb{P}\left(\bigcup_{k=1}^n A_k \right) = \sum_{k=1}^n \mathbb{P}(A_k)\).
    3. If \((B_n)_{n=1}^\infty\) is a sequence of sets in \(\mathcal{F}\) such that \(B_{n+1} \subseteq B_n\) for all \(n \in \mathbb{N}\) and \(\bigcap_{n=1}^\infty A_n = \emptyset\), then, as \(n\) tends to infinity, \(\mathbb{P}(A_n) \to 0\).

    Students were then asked to show that the function \(\mathbb{P}\) is a probability measure on \((\Omega, \mathcal{F})\). This amounts to showing that \(\mathbb{P}\) is countably additive. That is if \((A_k)_{k=1}^\infty\) is a sequence of pairwise disjoint sets, then \(\mathbb{P}\left(\cup_{k=1}^\infty A_k\right) = \sum_{k=1}^\infty \mathbb{P}(A_k)\). One way to do this is define \(B_n = \bigcup_{k=n+1}^\infty A_k\). Since the sets \((A_k)_{k=1}^\infty\) are pairwise disjoint, the sets \((B_n)_{n=1}^\infty\) satisfy the assumptions of the third property of \(\mathbb{P}\). Thus we can conclude that \(\mathbb{P}(B_n) \to 0\) as \(n \to \infty\).

    We also have that for every \(n \in \mathbb{N}\) we have \(\bigcup_{k=1}^\infty A_k = \left(\bigcup_{k=1}^n A_k\right) \cup B_n\). Thus by applying the second property of \(\mathbb{P}\) twice we get

    \(\mathbb{P}\left(\bigcup_{k=1}^\infty A_k \right) = \mathbb{P}\left( \bigcup_{k=1}^n A_k\right) + \mathbb{P}(B_n) = \sum_{k=1}^n \mathbb{P}(A_k) + \mathbb{P}(B_n)\).

    If we let \(n\) tend to infinity, then we get the desired result.

    A Follow Up Question

    A natural follow up question is whether all three of the assumptions in the question are necessary. It is particularly interesting to ask if there is an example of a function \(\mathbb{P}\) that satisfies the first two properties but is not a probability measure. It turns out the answer is yes but coming up with an example involves some serious mathematics.

    Let \(\Omega\) be the set of natural numbers \(\mathbb{N} = \{1,2,3,\ldots\}\) and let \(\mathcal{F}\) be the power set of \(\mathbb{N}\).

    One way in which people talk about the size of a subset of natural numbers \(A \subseteq \mathbb{N}\) is to look at the proportion of elements in \(A\) and take a limit. That is we could define

    \(\mathbb{P}(A) = \lim_{n \to \infty}\frac{|\{k \in A \mid k \le n \}|}{n}.\)

    This function \(\mathbb{P}\) has some nice properties for instance if \(A\) is the set of all even numbers than \(\mathbb{P}(A) = 1/2\). More generally if \(A\) is the set of all numbers divisible by \(k\), then \(\mathbb{P}(A) = 1/k\). The function \(\mathbb{P}\) gets used a lot. When people say that almost all natural numbers satisfy a property, they normally mean that if \(A\) is the subset of all numbers satisfying the property, then \(\mathbb{P}(A)=1\).

    However the function \(\mathbb{P}\) is not a probability measure. The function \(\mathbb{P}\) is finitely additive. To see this, let \((A_i)_{i=1}^m\) be a finite collection of disjoint subsets of \(\mathbb{N}\) and let \(A = \bigcup_{i=1}^m A_i\). Then for every natural number \(n\),

    \(\{k \in A \mid k \le n \} = \bigcup_{i=1}^m \{k \in A_i \mid k \le n\}\).

    Since the sets \((A_i)_{i=1}^m\) are disjoint, the union on the right is a disjoint union. Thus we have

    \(\frac{|\{k \in A \mid k \le n \}|}{n} = \sum_{i=1}^m \frac{|\{k \in A_i \mid k \le n \}|}{n}\).

    Taking limits on both sides gives \(\mathbb{P}(A)=\sum_{i=1}^m \mathbb{P}(A_i)\), as required. Furthermore, the function \(\mathbb{P}\) is not countably additive. For instance if we let \(A_i = \{i\}\) for each \(i \in \mathbb{N}\). Then \(\bigcup_{i=1}^\infty A_i = \mathbb{N}\) and \(\mathbb{P}(\mathbb{N})=1\). On the other hand \(\mathbb{P}(A_i)=0\) for every \(i \in \mathbb{N}\) and hence \(\sum_{i=1}^\infty \mathbb{P}(A_i)=0\neq \mathbb{P}(\mathbb{N})\).

    Thus it would appear that we have an example of a finitely additive measure that is not countably additive. However there is a big problem with the above definition of \(\mathbb{P}\). Namely the limit of \(\frac{|\{k \in A \mid k \le n \}|}{n}\) does not always exist. Consider the set \(A = \{3,4,9,10,11,12,13,14,15,16,33,\ldots\}\), ie a number \(k \in \mathbb{N}\) is in \(A\) if and only if \(2^{m} < k \le 2^{m+1}\) for some odd number \(m\ge 1\). The idea with the set \(A\) is that it looks a little bit like this:

    There are chunks of numbers that alternate between being in \(A\) and not being in \(A\) and as we move further along, these chunks double in size. Let \(a_n\) represent the sequence of numbers \(\frac{|\{k \in A \mid k \le n \}|}{n}\). We can see that \(a_n\) increases while \(n\) is in a chunk that belongs to \(A\) and decreases when \(n\) is in a chunk not in \(A\). More specifically if \(2^{2m-1} < n \le 2^{2m}\), then \(a_n\) is increasing but if \(2^{2m} < n \le 2^{2m+1}\), then \(a_n\) is decreasing.

    At the turning points \(n = 2^{2m}\) or \(n = 2^{2m+1}\) we can calculate exactly what \(a_n\) is equal to. Note that

    \(\{k \in A \mid k \le 2^{2m} \} = 2+8+32+\ldots+2\cdot 4^{m-1} = 2\cdot \frac{4^m-1}{4-1} = \frac{2}{3}(4^m-1)\).

    Furthermore since there are no elements of \(A\) between \(2^{2m}\) and \(2^{2m+1}\) we have

    \(\{k \in A \mid k \le 2^{2m+1}\} = \frac{2}{3}(4^m-1)\).

    Thus we have

    \(a_{2^m} = \frac{\frac{2}{3}(4^m-1)}{2^{2m}}=\frac{2}{3}\frac{4^m-1}{4^m}\) and \(a_{2m+1} = \frac{\frac{2}{3}(4^m-1)}{2^{2m+1}}=\frac{1}{3}\frac{4^m-1}{4^m}.\)

    Hence the values \(a_n\) fluctuate between approaching \(\frac{1}{3}\) and \(\frac{2}{3}\). Thus the limit of \(a_n\) does not exist and hence \(\mathbb{P}\) is not well-defined.

    There is a work around using something called a Banach limit. Banach limits are a way of extending the notion of a limit from the space of convergent sequences to the space of bounded sequences. Banach limits aren’t uniquely defined and don’t have a formula describing them. Indeed to prove the existence of Banach limits one has to rely on non-constructive mathematics such as the Hanh-Banach extension theorem. So if we take for granted the existence of Banach limits we can define

    \(\mathbb{P}(A) = L\left( \left(\frac{|\{k \in A \mid k \le n \}|}{n}\right)_{n=1}^\infty \right)\),

    where \(L\) is now a Banach limit. This new definition of \(\mathbb{P}\) is defined on all subsets of \(\mathbb{N}\) and is an example of measure that is finitely additive but not countably additive. However the definition of \(\mathbb{P}\) is very non-constructive. Indeed there are models of ZF set theory where the Hanh-Banach theorem does not hold and we cannot prove the existence of Banach limits.

    This begs the question of whether or not there exist constructible examples of measures that are finitely additive but not countably additive. A bit of Googling reveals that non-principal ultrafilters provide another way of defining non-countably additive measures. However the existence of a non-principal ultrafilter on \(\mathbb{N}\) is again equivalent to a weak form of the axiom of choice. Thus it seems that the existence of a non-countably additive measure may be inherently non-constructive. This discussion on Math Overflow goes into more detail.

  • A minimal counterexample in probability theory

    Last semester I tutored the course Probability Modelling with Applications. In this course the main objects of study are probability spaces. A probability space is a triple \((\Omega, \mathcal{F}, \mathbb{P})\) where:

    1. \(\Omega\) is a set.
    2. \(\mathcal{F}\) is a \(\sigma\)-algebra on \(\Omega\). That is \(\mathcal{F}\) is a collection of subsets of \(\Omega\) such that \(\Omega \in \mathcal{F}\) and \(\mathcal{F}\) is closed under set complements and countable unions. The element of \(\mathcal{F}\) are called events and they are precisely the subsets of \(\Omega\) that we can assign probabilities to. We will denote the power set of \(\Omega\) by \(2^\Omega\) and hence \(\mathcal{F} \subseteq 2^\Omega\).
    3. \(\mathbb{P}\) is a probability measure. That is it is a function \(\mathbb{P}: \mathcal{F} \rightarrow [0,1]\) such that \(\mathbb{P}(\Omega)=1\) and for all countable collections \(\{A_i\}_{i=1}^\infty \subseteq \mathcal{F}\) of mutually disjoint subsets we have that \(\mathbb{P} \left(\bigcup_{i=1}^\infty A_i \right) = \sum_{i=1}^\infty \mathbb{P}(A_i) \).

    It’s common for students to find probability spaces, and in particular \(\sigma\)-algebras, confusing. Unfortunately Vitalli showed that \(\sigma\)-algebras can’t be avoided if we want to study probability spaces such as \(\mathbb{R}\) or an infinite number of coin tosses. One of the main reasons why \(\sigma\)-algebras can be so confusing is that it can be very hard to give concrete descriptions of all the elements of a \(\sigma\)-algebra.

    We often have a collection \(\mathcal{G}\) of subsets of \(\Omega\) that we are interested in but this collection fails to be a \(\sigma\)-algebra. For example, we might have \(\Omega = \mathbb{R}^n\) and \(\mathcal{G}\) is the collection of open subsets. In this situation we take our \(\sigma\)-algebra \(\mathcal{F}\) to be \(\sigma(\mathcal{G})\) which is the smallest \(\sigma\)-algebra containing \(\mathcal{G}\). That is

    \(\sigma(\mathcal{G}) = \bigcap \mathcal{F}’\)

    where the above intersection is taken over all \(\sigma\)-algebras \(\mathcal{F}’\) that contain \(\mathcal{G}\). In this setting we will say that \(\mathcal{G}\) generates \(\sigma(\mathcal{G})\). When we have such a collection of generators, we might have an idea for what probability we would like to assign to sets in \(\mathcal{G}\). That is we have a function \(\mathbb{P}_0 : \mathcal{G} \rightarrow [0,1]\) and we want to extend this function to create a probability measure \(\mathbb{P} : \sigma(\mathcal{G}) \rightarrow [0,1]\). A famous theorem due to Caratheodory shows that we can do this in many cases.

    An interesting question is whether the extension \(\mathbb{P}\) is unique. That is does there exists a probability measure \(\mathbb{P}’\) on \(\sigma(\mathcal{G})\) such that \(\mathbb{P} \neq \mathbb{P}’\) but \(\mathbb{P}_{\mid \mathcal{G}} = \mathbb{P}_{\mid \mathcal{G}}’\)? The following theorem gives a criterion that guarantees no such \(\mathbb{P}’\) exists.

    Theorem: Let \(\Omega\) be a set and let \(\mathcal{G}\) be a collection of subsets of \(\Omega\) that is closed under finite intersections. Then if \(\mathbb{P},\mathbb{P}’ : \sigma(\mathcal{G}) \rightarrow [0,1]\) are two probability measures such that \(\mathbb{P}_{\mid \mathcal{G}} = \mathbb{P}’_{\mid \mathcal{G}}\), then \(\mathbb{P} = \mathbb{P}’\).

    The above theorem is very useful for two reasons. Firstly it can be combined with Caratheodory’s extension theorem to uniquely define probability measures on a \(\sigma\)-algebra by specifying the values on a collection of simple subsets \(\mathcal{G}\). Secondly if we ever want to show that two probability measures are equal, the above theorem tells us we can reduce the problem to checking equality on the simpler subsets in \(\mathcal{G}\).

    The condition that \(\mathcal{G}\) must be closed under finite intersections is somewhat intuitive. Suppose we had \(A,B \in \mathcal{G}\) but \(A \cap B \notin \mathcal{G}\). We will however have \(A \cap B \in \sigma(\mathcal{G})\) and thus we might be able to find two probability measure \(\mathbb{P},\mathbb{P}’ : \sigma(\mathcal{G}) \rightarrow [0,1]\) such that \(\mathbb{P}(A) = \mathbb{P}'(A)\) and \(\mathbb{P}'(B)=\mathbb{P}'(B)\) but \(\mathbb{P}(A \cap B) \neq \mathbb{P}'(A \cap B)\). The following counterexample shows that this intuition is indeed well-founded.

    When looking for examples and counterexamples, it’s good to try to keep things as simple as possible. With that in mind we will try to find a counterexample when \(\Omega\) is finite set with as few elements as possible and \(\sigma(\mathcal{G})\) is equal to the powerset of \(\Omega\). In this setting, a probability measure \(\mathbb{P}: \sigma(\mathcal{G}) \rightarrow [0,1]\) can be defined by specifying the values \(\mathbb{P}(\{\omega\})\) for each \(\omega \in \Omega\).

    We will now try to find a counterexample when \(\Omega\) is as small as possible. Unfortunately we won’t be able find a counterexample when \(\Omega\) only contains one or two elements. This is because we want to find \(A,B \subseteq \Omega\) such that \(A \cap B\) is not equal to \(A,B\) or \(\emptyset\).

    Thus we will start out search with a three element set \(\Omega = \{a,b,c\}\). Up to relabelling the elements of \(\Omega\), the only interesting choice we have for \(\mathcal{G}\) is \(\{ \{a,b\} , \{b,c\} \}\). This has a chance of working since \(\mathcal{G}\) is not closed under intersection. However any probability measure \(\mathbb{P}\) on \(\sigma(\mathcal{G}) = 2^{\{a,b,c\}}\) must satisfy the equations

    1. \(\mathbb{P}(\{a\})+\mathbb{P}(\{b\})+\mathbb{P}(\{c\}) = 1\),
    2. \(\mathbb{P}(\{a\})+\mathbb{P} (\{b\}) = \mathbb{P}(\{a,b\})\),
    3. \(\mathbb{P}(\{b\})+\mathbb{P}(\{c\}) = \mathbb{P}(\{b,c\})\).

    Thus \(\mathbb{P}(\{a\}) = 1- \mathbb{P}(\{b,c\})\), \(\mathbb{P}(\{c\}) = 1-\mathbb{P}(\{a,b\})\) and \(\mathbb{P}(\{b\})=\mathbb{P}(\{a,b\})+\mathbb{P}(\{b,c\})-1\). Thus \(\mathbb{P}\) is determined by its values on \(\{a,b\}\) and \(\{b,c\}\).

    However, a four element set \(\{a,b,c,d\}\) is sufficient for our counter example! We can let \(\mathcal{G} = \{\{a,b\},\{b,c\}\}\). Then \(\sigma(\mathcal{G})=2^{\{a,b,c,d\}}\) and we can define \(\mathbb{P} , \mathbb{P}’ : \sigma (\mathcal{G}) \rightarrow [0,1]\) by

    • \(\mathbb{P}(\{a\}) = 0\), \(\mathbb{P}(\{b\})=0.5\), \(\mathbb{P}(\{c\})=0\) and \(\mathbb{P}(\{d\})=0.5\).
    • \(\mathbb{P}'(\{a\})=0.5\), \(\mathbb{P}(\{b\})=0\), \(\mathbb{P}(\{c\})=0.5\) and \(\mathbb{P}(\{d\})=0\).

    Clearly \(\mathbb{P} \neq \mathbb{P}’\) however \(\mathbb{P}(\{a,b\})=\mathbb{P}'(\{a,b\})=0.5\) and \(\mathbb{P}(\{b,c\})=\mathbb{P}'(\{b,c\})=0.5\). Thus we have our counterexample! In general for any \(\lambda \in [0,1)\) we can define the probability measure \(\mathbb{P}_\lambda = \lambda\mathbb{P}+(1-\lambda)\mathbb{P}’\). The measure \(\mathbb{P}_\lambda\) is not equal to \(\mathbb{P}\) but agrees with \(\mathbb{P}\) on \(\mathcal{G}\). In general, if we have two probability measures that agree on \(\mathcal{G}\) but not on \(\sigma(\mathcal{G})\) then we can produce uncountably many such measures by taking convex combinations as done above.