mathstoshare – Page 7

Double cosets and contingency tables

Like my previous post, this blog is also motivated by a comment by Professor Persi Diaconis in his recent Stanford probability seminar. The seminar was about a way of “collapsing” a random walk on a group to a random walk on the set of double cosets. In this post, I’ll first define double cosets and then go over the example Professor Diaconis used to make us probabilists and statisticians more comfortable with all the group theory he was discussing.

Double cosets

Let \(G\) be a group and let \(H\) and \(K\) be two subgroups of \(G\). For each \(g \in G\), the \((H,K)\)-double coset containing \(g\) is defined to be the set

\(HgK = \{hgk : h \in H, k \in K\}\)

To simplify notation, we will simply write double coset instead of \((H,K)\)-double coset. The double coset of \(g\) can also be defined as the equivalence class of \(g\) under the relation

\(g \sim g’ \Longleftrightarrow g’ = hgk\) for some \(h \in H\) and \(g \in G\)

Like regular cosets, the above relation is indeed an equivalence relation. Thus, the group \(G\) can be written as a disjoint union of double cosets. The set of all double cosets of \(G\) is denoted by \(H\backslash G / K\). That is,

\(H\backslash G /K = \{HgK : g \in G\}\)

Note that if we take \(H = \{e\}\), the trivial subgroup, then the double cosets are simply the left cosets of \(K\), \(G / K\). Likewise if \(K = \{e\}\), then the double cosets are the right cosets of \(H\), \(H \backslash G\). Thus, double cosets generalise both left and right cosets.

Double cosets in \(S_n\)

Fix a natural number \(n > 0\). A partition of \(n\) is a finite sequence \(a = (a_1,a_2,\ldots, a_I)\) such that \(a_1,a_2,\ldots, a_I \in \mathbb{N}\), \(a_1 \ge a_2 \ge \ldots a_I > 0\) and \(\sum_{i=1}^I a_i = n\). For each partition of \(n\), \(a\), we can form a subgroup \(S_a\) of the symmetric group \(S_n\). The subgroup \(S_a\) contains all permutations \(\sigma \in S_n\) such that \(\sigma\) fixes the sets \(A_1 = \{1,\ldots, a_1\}, A_2 = \{a_1+1,\ldots, a_1 + a_2\},\ldots, A_I = \{n – a_I +1,\ldots, n\}\). Meaning that \(\sigma(A_i) = A_i\) for all \(1 \le i \le I\). Thus, a permutation \(\sigma \in S_a\) must individually permute the elements of \(A_1\), the elements of \(A_2\) and so on. This means that, in a natural way,

\(S_a \cong S_{a_1} \times S_{a_2} \times \ldots \times S_{a_I}\)

If we have two partitions \(a = (a_1,a_2,\ldots, a_I)\) and \(b = (b_1,b_2,\ldots,b_J)\), then we can form two subgroups \(H = S_a\) and \(K = S_b\) and consider the double cosets \(H \backslash G / K = S_a \backslash S_n / S_b\). The claim made in the seminar was that the double cosets are in one-to-one correspondence with \(I \times J\) contingency tables with row sums equal to \(a\) and column sums equal to \(b\). Before we explain this correspondence and properly define contingency tables, let’s first consider the cases when either \(H\) or \(K\) is the trivial subgroup.

Left cosets in \(S_n\)

Note that if \(a = (1,1,\ldots,1)\), then \(S_a\) is the trivial subgroup and, as noted above, \(S_a \backslash S_n / S_b\) is simply equal to \(S_n / S_b\). We will see that the cosets in \(S_n/S_b\) can be described by forgetting something about the permutations in \(S_n\).

We can think of the permutations in \(S_n\) as all the ways of drawing without replacement \(n\) balls labelled \(1,2,\ldots, n\). We can think of the partition \(b = (b_1,b_2,\ldots,b_J)\) as a colouring of the \(n\) balls by \(J\) colours. We colour balls \(1,2,\ldots, b_1\) by the first colour \(c_1\), then we colour \(b_1+1,b_1+2,\ldots, b_1+b_2\) the second colour \(c_2\) and so on until we colour \(n-b_J+1, n-b_J+2,\ldots, n\) the final colour \(c_J\). Below is an example when \(n\) is equal to 6 and \(b = (3,2,1)\).

The first three balls are coloured green, the next two are coloured red and the last ball is coloured blue.

Note that a permutation \(\sigma \in S_n\) is in \(S_b\) if and only if we draw the balls by colour groups, i.e. we first draw all the balls with colour \(c_1\), then we draw all the balls with colour \(c_2\) and so on. Thus, continuing with the previous example, the permutation \(\sigma_1\) below is in \(S_b\) but \(\sigma_2\) is not in \(S_b\).

The permutation \(\sigma_1\) is in \(S_b\) because the colours are in their original order but \(\sigma_1\) is not in \(S_b\) because the colours are rearranged.

It turns out that we can think of the cosets in \(S_n \setminus S_b\) as what happens when we “forget” the labels \(1,2,\ldots,n\) and only remember the colours of the balls. By “forgetting” the labels we mean only paying attention to the list of colours. That is for all \(\sigma_1,\sigma_2 \in S_n\), \(\sigma_1 S_b = \sigma_2 S_b\) if and only if the list of colours from the draw \(\sigma_1\) is the same as the list of colours from the draw \(\sigma_2\). Thus, the below two permutations define the same coset of \(S_b\)

When we forget the labels and only remember the colours, the permutations \(\sigma_1\) and \(\sigma_2\) look the same and thus are in the same left coset of \(S_b\).

To see why this is true, note that \(\sigma_1 S_b = \sigma_2 S_b\) if and only if \(\sigma_2 = \sigma_1 \circ \tau\) for some \(\tau \in S_b\). Furthermore, \(\tau \in S_b\) if and only if \(\tau\) maps each colour group to itself. Recall that function composition is read right to left. Thus, the equation \(\sigma_2 = \sigma_1 \circ \tau\) means that if we first relabel the balls according to \(\tau\) and then draw the balls according to \(\sigma_1\), then we get the result as just drawing by \(\sigma_2\). That is, \(\sigma_2 = \sigma_1 \circ \tau\) for some \(\tau \in S_b\) if and only if drawing by \(\sigma_2\) is the same as first relabelling the balls within each colour group and then drawing the balls according to \(\sigma_1\). Thus, \(\sigma_1 S_b = \sigma_2 S_b\), if and only if when we forget the labels of the balls and only look at the colours, \(\sigma_1\) and \(\sigma_2\) give the same list of colours. This is illustrated with our running example below.

If we permute the balls according to \(\tau \in S_b\) and the draw according to \(\sigma_1\), then the resulting draw is the same as if we had not permuted and drawn according to \(\sigma_2\). That is, \(\sigma_2 = \sigma_1 \circ \tau\).

Right cosets of \(S_n\)

Typically, the subgroup \(S_a\) is not a normal subgroup of \(S_n\). This means that the right coset \(S_a \sigma\) will not equal the left coset \(\sigma S_a\). Thus, colouring the balls and forgetting the labelling won’t describe the right cosets \(S_a \backslash S_n\). We’ll see that a different type of forgetting can be used to describe \(S_a \backslash S_n = \{S_a\sigma : \sigma \in S_n\}\).

Fix a partition \(a = (a_1,a_2,\ldots,a_I)\) and now, instead of considering \(I\) colours, think of \(I\) different people \(p_1,p_2,\ldots,p_I\). As before, a permutation \(\sigma \in S_n\) can be thought of drawing \(n\) balls labelled \(1,\ldots,n\) without replacement. We can imagine giving the first \(a_1\) balls drawn to person \(p_1\), then giving the next \(a_2\) balls to the person \(p_2\) and so on until we give the last \(a_I\) balls to person \(p_I\). An example with \(n = 6\) and \(a = (2,2,2)\) is drawn below.

Person \(p_1\) receives the ball labelled by 6 followed by the ball labelled 3, person \(p_2\) receives ball 2 and then ball 1 and finally person \(p_3\) receives ball 4 followed by ball 5.

Note that \(\sigma \in S_a\) if and only if person \(p_i\) receives the balls with labels \(\sum_{k=0}^{i-1}a_k+1,\sum_{k=0}^{i-1}a_k+2,\ldots, \sum_{k=0}^i a_k\) in any order. Thus, in the below example \(\sigma_1 \in S_a\) but \(\sigma_2 \notin S_a\).

When the balls are drawn according to \(\sigma_1\), person \(p_i\) receives the balls with labels \(2i-1\) and \(2i\), and thus \(\sigma_1 \in S_a\). On the other hand, if the balls are drawn according to \(\sigma_2\), the people receive different balls and thus \(\sigma_2 \notin S_a\).

It turns out the cosets \(S_a \backslash S_n\) are exactly determined by “forgetting” the order in which each person \(p_i\) received their balls and only remembering which balls they received. Thus, the two permutation below belong to the same coset in \(S_a \backslash S_n\).

When we forget the order in which each person receive their balls, the permutations \(\sigma_1\) and \(\sigma_2\) become the same and thus \(S_a \sigma_1 = S_a \sigma_2\). Note that if we coloured the balls according to the permutation \(a = (a_1,\ldots,a_I)\), then we could see that \(\sigma_1 S_a \neq \sigma_2 S_a\).

To see why this is true in general, consider two permutations \(\sigma_1,\sigma_2\). The permutations \(\sigma_1,\sigma_2\) result in each person \(p_i\) receiving the same balls if and only if after \(\sigma_1\) we can apply a permutation \(\tau\) that fixes each subset \(A_i = \left\{\sum_{k=0}^{i-1}a_k+1,\ldots, \sum_{k=0}^i a_k\right\}\) and get \(\sigma_2\). That is, \(\sigma_1\) and \(\sigma_2\) result in each person \(p_i\) receiving the same balls if and only if \(\sigma_2 = \tau \circ \sigma_1\) for some \(\tau \in S_a\). Thus, \(\sigma_1,\sigma_2\) are the same after forgetting the order in which each person received their balls if and only if \(S_a \sigma_1 = S_a \sigma_2\). This is illustrated below,

If we first draw the balls according to \(\sigma_1\) and then permute the balls according to \(\tau\), then the resulting draw is the same as if we had drawn according to \(\sigma_2\) and not permuted afterwards. That is, \(\sigma_2 = \tau \circ \sigma_1 \).

We can thus see why \(S_a \backslash S_n \neq S_n / S_a\). A left coset \(\sigma S_a\) correspond to pre-composing \(\sigma\) with elements of \(S_a\) and a right cosets \(S_a\sigma\) correspond to post-composing \(\sigma\) with elements of \(S_a\).

Contingency tables

With the last two sections under our belts, describing the double cosets \(S_a \backslash S_n / S_b\) is straight forward. We simply have to combine our two types of forgetting. That is, we first colour the \(n\) balls with \(J\) colours according to \(b = (b_1,b_2,\ldots,b_J)\). We then draw the balls without replace and give the balls to \(I\) different people according \(a = (a_1,a_2,\ldots,a_I)\). We then forget both the original labels and the order in which each person received their balls. That is, we only remember the number of balls of each colour each person receives. Describing the double cosets by “double forgetting” is illustrated below with \(a = (2,2,2)\) and \(b = (3,2,1)\).

The permutations \(\sigma_1\) and \(\sigma_2\) both result in person \(p_1\) receiving one green ball and one blue ball. The two permutations also results in \(p_2\) and \(p_3\) both receiving one green ball and one red ball. Thus, \(\sigma_1\) and \(\sigma_2\) are both in the same \((S_a,S_b)\)-double coset. Note however that \(\sigma_1 S_b \neq \sigma_2 S_b\) and \(S_a\sigma_1 \neq S_a \sigma_2\).

The proof that double forgetting does indeed describe the double cosets is simply a combination of the two arguments given above. After double forgetting, the number of balls given to each person can be recorded in an \(I \times J\) table. The \(N_{ij}\) entry of the table is simply the number of balls person \(p_i\) receives of colour \(c_j\). Two permutations are the same after double forgetting if and only if they produce the same table. For example, \(\sigma_1\) and \(\sigma_2\) above both produce the following table

	Green (\(c_1\))	Red (\(c_2\))	Blue (\(c_3\))	Total
Person \(p_1\)	1	0	1	2
Person \(p_2\)	1	1	0	2
Person \(p_3\)	1	1	0	2
Total	3	2	1	6

By the definition of how the balls are coloured and distributed to each person we must have for all \(1 \le i \le I\) and \(1 \le j \le J\)

\(\sum_{j=1}^J N_{ij} = a_i \) and \(\sum_{i=1}^I N_{ij} = b_j\)

An \(I \times J\) table with entries \(N_{ij} \in \{0,1,2,\ldots\}\) satisfying the above conditions is called a contingency table. Given such a contingency table with entries \(N_{ij}\) where the rows sum to \(a\) and the columns sum to \(b\), there always exists at least one permutation \(\sigma\) such that \(N_{ij}\) is the number of balls received by person \(p_i\) of colour \(c_i\). We have already seen that two permutations produce the same table if and only if they are in the same double coset. Thus, the double cosets \(S_a \backslash S_n /S_b\) are in one-to-one correspondence with such contingency tables.

The hypergeometric distribution

I would like to end this blog post with a little bit of probability and relate the contingency tables above to the hyper geometric distribution. If \(a = (m, n-m)\) for some \(m \in \{0,1,\ldots,n\}\), then the contingency tables described above have two rows and are determined by the values \(N_{11}, N_{12},\ldots, N_{1J}\) in the first row. The numbers \(N_{1j}\) are the number of balls of colour \(c_j\) the first person receives. Since the balls are drawn without replacement, this means that if we put the uniform distribution on \(S_n\), then the vector \(Y=(N_{11}, N_{12},\ldots, N_{1J})\) follows the multivariate hypergeometric distribution. Thus, if we have a random walk on \(S_n\) that quickly converges to the uniform distribution on \(S_n\), then we could use the double cosets to get a random walk that converges to the multivariate hypergeometric distribution (although there are smarter ways to do such sampling).

The field with one element in a probability seminar

Something very exciting this afternoon. Professor Persi Diaconis was presenting at the Stanford probability seminar and the field with one element made an appearance. The talk was about joint work with Mackenzie Simper and Arun Ram. They had developed a way of “collapsing” a random walk on a group to a random walk on the set of double cosets. As an illustrative example, Persi discussed a random walk on \(GL_n(\mathbb{F}_q)\) given by multiplication by a random transvection (a map of the form \(v \mapsto v + ab^Tv\), where \(a^Tb = 0\)).

The Bruhat decomposition can be used to match double cosets of \(GL_n(\mathbb{F}_q)\) with elements of the symmetric group \(S_n\). So by collapsing the random walk on \(GL_n(\mathbb{F}_q)\) we get a random walk on \(S_n\) for all prime powers \(q\). As Professor Diaconis said, you can’t stop him from taking \(q \to 1\) and asking what the resulting random walk on \(S_n\) is. The answer? Multiplication by a random transposition. As pointed sets are vector spaces over the field with one element and the symmetric groups are the matrix groups, this all fits with what’s expected of the field with one element.

This was just one small part of a very enjoyable seminar. There was plenty of group theory, probability, some general theory and engaging examples.

Update: I have written another post about some of the group theory from the seminar! You can read it here: Double cosets and contingency tables.

Working with ABC Radio’s API in R

This post is about using ABC Radio’s API to create a record of all the songs played on Triple J in a given time period. An API (application programming interface) is a connection between two computer programs. Many websites have API’s that allow users to access data from that website. ABC Radio has an API that lets users access a record of the songs played on the station Triple J since around May 2014. Below I’ll show how to access this information and how to transform it into a format that’s easier to work with. All code can be found on GitHub here.

Packages

To access the API in R I used the packages “httr” and “jsonlite”. To transform the data from the API I used the packages “tidyverse” and “lubridate”.

library(httr)
library(jsonlite)
library(tidyverse)
library(lubridate)

ABC Radio’s API

The URL for the ABC radio’s API is:

https://music.abcradio.net.au /api/v1/plays/search.json.

If you type this into a web browser, you can see a bunch of text which actually lists information about the ten most recently played songs on ABC’s different radio stations. To see only the songs played on Triple J, you can go to:

https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej.

The extra text “station=triplej” is called a parameter. We can add more parameters such as “from” and “to”. The link:

https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&from=2021-01-16T00:00:00.000Z&to=2021-01-16T00:30:00.000Z

shows information about the songs played in the 30 minutes after midnight UTC on the 16th of January last year. The last parameter that we will use is limit which specifies the number of songs to include. The link:

https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&limit=1

includes information about the most recently played song. The default for limit is 10 and the largest possible value is 100. We’ll see later that this cut off at 100 makes downloading a lot of songs a little tricky. But first let’s see how we can access the information from ABC Radio’s API in R.

Accessing an API in R

We now know how to use ABC Radio’s API to get information about the songs played on Triple J but how do we use this information in R and how is the information stored. The function GET from the package “httr” enables us to access the API in R. The input to GET is simply the same sort of URL that we’ve been using to view the API online. The below code stores information about the 5 songs played on Triple J just before 5 am on the 5th of May 2015

res <- GET("https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&limit=5&to=2015-05-05T05:00:00.000Z")
res
# Response [https://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&limit=5&to=2015-05-05T05:00:00.000Z]
# Date: 2022-03-02 23:25
# Status: 200
# Content-Type: application/json
# Size: 62.9 kB

The line “Status: 200” tells us that we have successfully grabbed some data from the API. There are many other status codes that mean different thing but for now we’ll be happy remembering that “Status: 200” means everything has worked.

The information from the API is stored within the object res. To change it from a JSON file to a list we can use the function “fromJSON” from the library jsonlite. This is done below

data <- fromJSON(rawToChar(res$content))
names(data)
# "total"  "offset" "items"

The information about the individual songs is stored under items. There is a lot of information about each song including song length, copyright information and a link to an image of the album cover. For now, we’ll just try to find the name of the song, the name of the artist and the time the song played. We can find each of these under “items”. Finding the song title and the played time are pretty straight forward:

data$items$recording$title
# "First Light" "Dog"         "Begin Again" "Rot"         "Back To You"
data$items$played_time
# "2015-05-05T04:55:56+00:00" "2015-05-05T04:52:21+00:00" "2015-05-05T04:47:52+00:00" "2015-05-05T04:41:00+00:00" "2015-05-05T04:38:36+00:00"

Finding the artist name is a bit trickier.

data$items$recording$artists
# [[1]]
# entity       arid          name artwork
# 1 Artist maQeOWJQe1 Django Django    NULL
# links
# 1 Link, mlD5AM2R5J, http://musicbrainz.org/artist/4bfce038-b1a0-4bc4-abe1-b679ab900f03, 4bfce038-b1a0-4bc4-abe1-b679ab900f03, MusicBrainz artist, NA, NA, NA, service, musicbrainz, TRUE
# is_australian    type role
# 1            NA primary   NA
# 
# [[2]]
# entity       arid      name artwork
# 1 Artist maXE59XXz7 Andy Bull    NULL
# links
# 1 Link, mlByoXMP5L, http://musicbrainz.org/artist/3a837db9-0602-4957-8340-05ae82bc39ef, 3a837db9-0602-4957-8340-05ae82bc39ef, MusicBrainz artist, NA, NA, NA, service, musicbrainz, TRUE
# is_australian    type role
# 1            NA primary   NA
# ....
# ....

We can see that each song actually has a lot of information about each artist but we’re only interested in the artist name. By using “map_chr()” from the library “tidyverse” we can grab each song’s artist name.

map_chr(data$items$recording$artists, ~.$name[1])
# [1] "Django Django" "Andy Bull"     "Purity Ring"   "Northlane"     "Twerps"

Using name[1] means that if there are multiple artists, then we only select the first one. With all of these in place we can create a tibble with the information about these five songs.

list <- data$items
tb <- tibble(
     song_title = list$recording$title,
     artist = map_chr(list$recording$artists, ~.$name[1]),
     played_time = ymd_hms(list$played_time)
) %>% 
  arrange(played_time)
tb
# # A tibble: 5 x 3
# song_title  artist        played_time        
# <chr>       <chr>         <dttm>             
# 1 Back To You Twerps        2015-05-05 04:38:36
# 2 Rot         Northlane     2015-05-05 04:41:00
# 3 Begin Again Purity Ring   2015-05-05 04:47:52
# 4 Dog         Andy Bull     2015-05-05 04:52:21
# 5 First Light Django Django 2015-05-05 04:55:56

Downloading lots of songs

The maximum number of songs we can access from the API at one time is 100. This means that if we want to download a lot of songs we’ll need to use some sort of loop. Below is some code which takes in two dates and times in UTC and fetches all the songs played between the two times. The idea is simply to grab all the songs played in a five hour interval and then move on to the next five hour interval. I found including an optional value “progress” useful for the debugging. This code is from the file get_songs.r on my GitHub.

download <- function(from, to){
  base <- "https://music.abcradio.net.au/api/v1/plays/search.json?limit=100&offset=0&page=0&station=triplej&from="
  from_char <- format(from, "%Y-%m-%dT%H:%M:%S.000Z")
  to_char <- format(to, "%Y-%m-%dT%H:%M:%S.000Z")
  url <- paste0(base,
                from_char,
                "&to=",
                to_char)
  res <- GET(url)
  if (res$status == 200) {
    data <- fromJSON(rawToChar(res$content))
    list <- data$items
    tb <- tibble(
      song_title = list$recording$title,
      artist = map_chr(list$recording$artists, ~.$name[1]),
      played_time = ymd_hms(list$played_time)
    ) %>% 
      arrange(played_time)
    return(tb)
  }
}

get_songs <- function(start, end, progress = FALSE){
  from <- ymd_hms(start)
  to <- from + dhours(5)
  end <- ymd_hms(end)
  
  songs <- NULL
  
  while (to < end) {
    if (progress) {
      print(from)
      print(to)
    }
    tb <- download(from, to)
    songs <- bind_rows(songs, tb)
    from <- to + dseconds(1)
    to <- to + dhours(5)
  }
  tb <- download(from, end)
  songs <- bind_rows(songs, tb)
  return(songs)
}

An example

Using the functions defined above, we can download all the sounds played in the year 2021 (measured in AEDT).

songs <- get_songs(ymd_hms("2020-12-31 13:00:01 UTC"),
                   ymd_hms("2021-12-31 12:59:59 UTC",
                   progress = TRUE)

This code takes a little while to run so I have saved the output as a .csv file. There are lots of things I hope to do with this data such as looking at the popularity of different songs over time. For example, here’s a plot showing the cumulative plays of Genesis Owusu’s top six songs.

I’ve also looked at using the data from last year to get a list of Triple J’s most played artists. Unfortunately, it doesn’t line up with the official list. The issue is that I’m only recording the primary artist on each song and not any of the secondary artists. Hopefully I’ll address this in a later blog post! I would also like to an analysis of how plays on Triple J correlate with song ranking in the Hottest 100.

Maximum likelihood estimation and the method of moments

Maximum likelihood and the method of moments are two ways to estimate parameters from data. In general, the two methods can differ but for one-dimensional exponential families they produce the same estimates.

Suppose that \(\{P_\theta\}_{\theta \in \Omega}\) is a one-dimensional exponential family written in canonical form. That is, \(\Omega \subseteq \mathbb{R}\) and there exists a reference measure \(\mu\) such each distribution \(P_\theta\) has a density \(p_\theta\) with respect to \(\mu\) and,

\(p_\theta(x) = h(x)\exp\left(\theta T(x)-A(\theta)\right).\)

The random variable \(T(X)\) is a sufficient statistic for the model \(X \sim P_\theta\). The function \(A(\theta)\) is the log-partition function for the family \(\{P_\theta\}_{\theta \in \Omega}\). The condition, \(\int p_\theta(x)\mu(dx)=1\) implies that

\(A(\theta) = \log\left(\int h(x)\exp(\theta T(x))\mu(dx) \right).\)

It turns out that the function \(A(\theta)\) is differentiable and that differentiation and integration are exchangeable. This implies that

\(A'(\theta) = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\int h(x)\exp(\theta T(x))\mu(dx)} = \frac{\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx)}{\exp(A(\theta))}\)

Note that \(\int h(x)\frac{d}{d\theta}\exp(\theta T(x))\mu(dx) = \int T(x)h(x)\exp(\theta T(x))\mu(dx)\). Thus,

\(A'(\theta) = \int T(x)h(x) \exp(\theta T(x)-A(\theta))\mu(dx) = \int T(x)p_\theta(x)\mu(dx).\)

This means that \(A'(\theta) = \mathbb{E}_\theta[T(X)]\), the expectation of \(T(X)\) under \(X \sim P_\theta\).

Now suppose that we have an i.i.d. sample \(X_1,\ldots, X_n \sim P_\theta\) and we want to use this sample to estimate \(\theta\). One way to estimate \(\theta\) is by maximum likelihood. That is, we choose the value of \(\theta\) that maximises the likelihood,

\(L(\theta|X_1,\ldots,X_n) = \prod_{i=1}^n p_\theta(X_i).\)

When using the maximum likelihood estimator, it is often easier to work with the log-likelihood. The log-likelihood is,

\(\log L(\theta|X_1,\ldots,X_n) = \sum_{i=1}^n \log\left(p_\theta(X_i)\right) = \sum_{i=1}^n \log(h(X_i))+\theta T(X_i) – A(\theta)\).

Maximising the likelihood is equivalent to maximising the log-likelihood. For exponential families, the log-likelihood is a concave function of \(\theta\). Thus the maximisers can be found be differentiation and solving the first order equations. Note that,

\(\frac{d}{d\theta} \log L(\theta|X_1,\ldots,X_n) =\sum_{i=1}^n T(X_i)-A'(\theta) = -nA'(\theta) + \sum_{i=1}^n T(X_i).\)

Thus the maximum likelihood estimate (MLE) \(\widehat{\theta}\) solves the equation,

\(-nA'(\widehat{\theta}) + \sum_{i=1}^n T(X_i) = 0.\)

But recall that \(A'(\widehat{\theta}) = \mathbb{E}_{\widehat{\theta}}[T(X)]\). Thus the MLE is the solution to the equation,

\(\mathbb{E}_{\widehat{\theta}}[T(X)] = \frac{1}{n}\sum_{i=1}^n T(X_i)\).

Thus the MLE is the value of \(\theta\) for which the expectation of \(T(X)\) matches the empirical average from our sample. That is, the maximum likelihood estimator for an exponential family is a method of moments estimator. Specifically, the maximum likelihood estimator matches the moments of the sufficient statistic \(T(X)\).

A counter example

It is a special property of maximum likelihood estimators that the MLE is a method of moments estimator for the sufficient statistic. When we leave the nice world of exponential families, the estimators may differ.

Suppose that we have data \(X_1,\ldots,X_n \sim P_\theta\) where \(P_\theta\) is the uniform distribution on \([0,\theta]\). A minimal sufficient statistic for this model is \(X_{(n)}\) – the maximum of \(X_1,\ldots, X_n\). Given what we saw before, we might imague that the MLE for this model would be a method of moments estimator for \(X_{(n)}\) but this isn’t the case.

The likelihood for \(X_1,\ldots,X_n\) is,

\(L(\theta|X_1,\ldots,X_n) = \begin{cases} \frac{1}{\theta^n} & \text{if } X_{(n)} \le \theta,\\ 0 & \text{if } X_{(n)} > \theta. \end{cases}\)

Thus the MLE is \(\widehat{\theta} = X_{(n)}\). However, under \(P_\theta\), \( \frac{1}{\theta}X_{(n)}\) has a \(\text{Beta}(n,1)\) distribution. Thus, \(\mathbb{E}_\theta[X_{(n)}] = \theta \frac{n}{n+1}\) so the method of moments estimator would be \( \widehat{\theta}’ = \frac{n+1}{n}X_{(n)} \neq X_{(n)} = \widehat{\theta}\).

A not so simple conditional expectation

It is winter 2022 and my PhD cohort has moved on the second quarter of our first year statistics courses. This means we’ll be learning about generalised linear models in our applied course, asymptotic statistics in our theory course and conditional expectations and martingales in our probability course.

In the first week of our probability course we’ve been busy defining and proving the existence of the conditional expectation. Our approach has been similar to how we constructed the Lebesgue integral in the previous course. Last quarter, we first defined the Lebesgue integral for simple functions, then we used a limiting argument to define the Lebesgue integral for non-negative functions and then finally we defined the Lebesgue integral for arbitrary functions by considering their positive and negative parts.

Our approach to the conditional expectation has been similar but the journey has been different. We again started with simple random variables, then progressed to non-negative random variables and then proved the existence of the conditional expectation of any arbitrary integrable random variable. Unlike the Lebesgue integral, the hardest step was proving the existence of the conditional expectation of a simple random variable. Progressing from simple random variables to arbitrary random variables was a straight forward application of the monotone convergence theorem and linearity of expectation. But to prove the existence of the conditional expectation of a simple random variable we needed to work with projections in the Hilbert space \(L^2(\Omega, \mathbb{P},\mathcal{F})\).

Unlike the Lebesgue integral, defining the conditional expectation of a simple random variable is not straight forward. One reason for this is that the conditional expectation of a random variable need not be a simple random variable. This comment was made off hand by our Professor and sparked my curiosity. The following example is what I came up with. Below I first go over some definitions and then we dive into the example.

A simple random variable with a conditional expectation that is not simple

Let \((\Omega, \mathbb{P}, \mathcal{F})\) be a probability space and let \(\mathcal{G} \subseteq \mathcal{F}\) be a sub-\(\sigma\)-algebra. The conditional expectation of an integrable random variable \(X\) is a random variable \(\mathbb{E}(X|\mathcal{G})\) that satisfies the following two conditions:

The random variable \(\mathbb{E}(X|\mathcal{G})\) is \(\mathcal{G}\)-measurable.
For all \(B \in \mathcal{G}\), \(\mathbb{E}[X1_B] = \mathbb{E}[\mathbb{E}(X|\mathcal{G})1_B]\), where \(1_B\) is the indicator function of \(B\).

The conditional expectation of an integrable random variable is unique and always exists. One can think of \(\mathbb{E}(X|\mathcal{G})\) as the expected value of \(X\) given the information in \(\mathcal{G}\).

A simple random variable is a random variable \(X\) that take only finitely many values. Simple random variables are always integrable and so \(\mathbb{E}(X|\mathcal{G})\) always exists but we will see that \(\mathbb{E}(X|\mathcal{G})\) need not be simple.

Consider a random vector \((U,V)\) uniformly distributed on the square \([-1,1]^2 \subseteq \mathbb{R}^2\). Let \(D\) be the unit disc \(D = \{(u,v) \in \mathbb{R}^2:u^2+v^2 \le 1\}\). The random variable \(X = 1_D(U,V)\) is a simple random variable since \(X\) equals \(1\) if \((U,V) \in D\) and \(X\) equals \(0\) otherwise. Let \(\mathcal{G} = \sigma(U)\) the \(\sigma\)-algebra generated by \(U\). It turns out that

\(\mathbb{E}(X|\mathcal{G}) = \sqrt{1-U^2}\).

Thus \(\mathbb{E}(X|\mathcal{G})\) is not a simple random variable. Let \(Y = \sqrt{1-U^2}\). Since \(Y\) is a continuous function of \(U\), the random variable is \(\mathcal{G}\)-measurable. Thus \(Y\) satisfies condition 1. Furthermore if \(B \in \mathcal{G}\), then \(B = \{U \in A\}\) for some measurable set \(A\subseteq [-1,1]\). Thus \(X1_B\) equals \(1\) if and only if \(U \in A\) and \(V \in [-\sqrt{1-U^2}, \sqrt{1-U^2}]\). Since \((U,V)\) is uniformly distributed we thus have

\(\mathbb{E}[X1_B] = \int_A \int_{-\sqrt{1-u^2}}^{\sqrt{1-u^2}} \frac{1}{4}dvdu = \int_A \frac{1}{2}\sqrt{1-u^2}du\).

The random variable \(U\) is uniformly distributed on \([-1,1]\) and thus has density \(\frac{1}{2}1_{[-1,1]}\). Therefore,

\(\mathbb{E}[Y1_B] = \mathbb{E}[\sqrt{1-U^2}1_{\{U \in A\}}] = \int_A \frac{1}{2}\sqrt{1-u^2}du\).

Thus \(\mathbb{E}[X1_B] = \mathbb{E}[Y1_B]\) and therefore \(Y = \sqrt{1-U^2}\) equals \(\mathbb{E}(X|\mathcal{G})\). Intuitively we can see this because given \(U=u\), we know that \(X\) is \(1\) when \(V \in [-\sqrt{1-u^2},\sqrt{1+u^2}]\) and that \(X\) is \(0\) otherwise. Since \(V\) is uniformly distributed on \([-1,1]\) the probability that \(V\) is in \([-\sqrt{1-u^2},\sqrt{1+u^2}]\) is \(\sqrt{1-u^2}\). Thus given \(U=u\), the expected value of \(X\) is \(\sqrt{1-u^2}\).

An extension

The previous example suggests an extension that shows just how “complicated” the conditional expectation of a simple random variable can be. I’ll state the extension as an exercise:

Let \(f:[-1,1]\to \mathbb{R}\) be any continuous function with \(f(x) \in [0,1]\). With \((U,V)\) and \(\mathcal{G}\) as above show that there exists a measurable set \(A \subseteq [-1,1]^2\) such that \(\mathbb{E}(1_A|\mathcal{G}) = f(U)\).