Author: michaelnhowes

  • Sports climbing at the 2020 Olympics

    Last year sports climbing made its Olympic debut. My partner Claire is an avid climber and we watched some of the competition together. The method of ranking the competitors had some surprising consequences which are the subject of this blog.

    In the finals, the climbers had to compete in three disciplines – speed climbing, bouldering and lead climbing. Each climber was given a rank for each discipline. For speed climbing, the rank was given based on how quickly the climbers scaled a set route. For bouldering and lead climbing, the rank was based on how far they climbed on routes none of them had seen before.

    To get a final score, the ranks from the three disciplines are multiplied together. The best climber is the one with the lowest final score. For example, if a climber was 5th in speed climbing, 2nd in bouldering and 3rd in lead climbing, then they would have a final score of \(5\times 2 \times 3 = 30\). A climber who came 8th in speed, 1st in bouldering and 2nd in lead climbing would have a lower (and thus better) final score of \(8 \times 1 \times 2 = 16\).

    One thing that makes this scoring system interesting is that your final score is very dependent on how well your competitors perform. This was most evident during Jakob Schubert’s lead climb in the men’s final.

    Lead climbing is the final discipline and Schubert was the last climber. This meant that ranks for speed climbing and bouldering were already decided. However, the final rankings of the climbers fluctuated hugely as Schubert scaled the 15 metre lead wall. This was because if a climber had done really well in the boulder and speed rounds, being overtaken by Schubert didn’t increase their final score by that much. However, if a climber did poorly in the previous rounds, then being overtaken by Schubert meant their score increased by a lot and they plummeted down the rankings. This can be visualised in the following plot:

    Along the x-axis we have how far up the lead wall Schubert climbed (measured by the “hold” he had reached). On the y-axis we have the finalists’ ranking at different stages of Schubert’s climb. We can see that as Schubert climbed and overtook the other climbers the whole ranking fluctuated. Here is a similar plot which also shows when Schubert overtook each competitor on the lead wall:

    The volatility of the rankings can really be seen by following Adam Ondra’s ranking (in purple). When Schubert started his climb, Ondra was ranked second. After Schubert passed Albert Gines Lopez, Ondra was ranked first. But then Schubert passed Ondra, Ondra finished the event in 6th place and Gines Lopez came first. If you go to the 4 hour and 27 minutes mark here you can watch Schubert’s climb and hear the commentator explain how both Gines Lopez and Ondra are in the running to win gold.

    Similar things happened in the women’s finals. Janja Garnbret was the last lead climber in the women’s final. Here is the same plot which shows the climbers’ final rankings and lead ranking.

    Garnbret was a favourite at the Olympics and had come fifth in the speed climbing and first in bouldering. This meant that as long as she didn’t come last in the lead climbing she would at least take home silver and otherwise she’ll get the gold. Garnbret ended up coming first in the lead climbing and we can see that as she overtook the last few climbers, their ranking fluctuated wildly.

    Here is one more plot which shows each competitor’s final score at different points of Garnbret’s climb.

    In the plot you can really see that, depending on how they performed in the previous two events, each climber’s score changed by a different amount once they were overtaken. It also shows how Garnbret is just so ahead of the competition – especially when you compare the men’s and women’s finals. Here is the same plot for the men. You can see that the men’s final scores were a lot closer together.

    Before I end this post, I would like to make one comment about the culture of sport climbing. In this post I wanted to highlight how tumultuous and complicated the sport climbing rankings were but if you watched the athletes you’d have no idea the stakes were so high. The climbers celebrate their personal bests as if no one was watching, they trade ideas on how to tackle the lead wall and the day after the final, they returned to the bouldering wall to try the routes together. Sports climbing is such a friendly and welcoming sport and I would hate for my analysis to give anyone the wrong idea.

  • Visualising Strava data with R

    I’ve recently had some fun downloading and displaying my running data from Strava. I’ve been tracking my runs on Strava for the last five years and I thought it would be interesting to make a map showing where I run. Here is one of the plots I made. Each circle is a place in south east Australia where I ran in the given year. The size of the circle corresponds to how many hours I ran there.

    I think the plot is a nice visual diary. Looking at the plot, you can see that most of my running took place in my hometown Canberra and that the time I spend running has been increasing. The plot also shows that most years I’ve spent some time running on the south coast and on my grandfather’s farm near Kempsey. You can also see my trips in 2019 and 2020 to the AMSI Summer School. In 2019, the summer school was hosted by UNSW in Sydney and in 2020 it was hosted by La Trobe University in Melbourne. You can also see a circle from this year at Mt Kosciuszko where I ran the Australian Alpine Ascent with my good friend Sarah.

    I also made a plot of all my runs on a world map which shows my recent move to California. In this plot all the circles are the same size and I grouped all the runs across the five different years.

    I learnt a few things creating these plots and so I thought I would document how I made them.


    Creating the plots

    • Strava lets you download all your data by doing a bulk export. The export includes a zipped folder with all your activities in their original file format.
    • My activities where saved as .gpx files and I used this handy python library to convert them to .csv files which I could read into R. For the R code I used the packages “tidyverse”, “maps” and “ozmaps”.
    • Now I had a .csv files for each run. In these files each row corresponded to my location at a given second during the run. What I wanted was a single data frame where each row corresponded to a different run. I found the following way to read in and edit each .csv file:
    files <- list.files(path = ".", pattern = "*.csv")
    listcsv <- lapply(files, function(x) read_csv(paste0(x)) %>% 
                        select(lat, lon, time) %>% 
                        mutate(hours = n()/3600) %>% 
                        filter(row_number() == 1)
                      )
    

    The first line creates a list of all the .csv files in the working directory. The second line then goes through the list of file names and converts each .csv file into a tibble. I then selected the rows with the time and my location and added a new column with the duration of the run in hours. Finally I removed all the rows except the first row which contains the information about where my run started.

    • Next I combined these separate tibbles into a single tibble using rbind(). I then added some new columns for grouping the runs. I added a column with the year and columns with the longitude and latitude rounded to the nearest whole number.
    runs <- do.call(rbind, listcsv) %>% 
      mutate(year = format(time,"%Y"),
             approx_lat = round(lat),
             approx_lon = round(lon))
    • To create the plot where you can see where I ran each year, I grouped the runs by the approximate location and by year. I then calculated the total time spent running at each location each year and calculated the average longitude and latitude. I also removed the runs in the USA by only keeping the runs with a negative latitude.
    run_counts_year <- runs %>% 
      group_by(approx_lat, approx_lon, year) %>% 
      summarise(hours = sum(hours),
                lat = mean(lat),
                lon = mean(lon),
                .groups = "drop") %>% 
      select(!approx_lat & !approx_lon)
    
    oz_counts_year <- run_counts_year %>% 
      filter(lat < 0)
    • I then used the package “ozmaps” to plot my running locations on a map of the ACT, New South Wales and Victoria.
    oz_states <- ozmaps::ozmap_states %>% 
      filter(NAME == "New South Wales" |
             NAME == "Victoria" |
             NAME == "Australian Capital Territory")
    
    ggplot() + 
      geom_sf(data = oz_states) +
      coord_sf() +
      geom_point(data = oz_counts_year,
                 mapping = aes(x = lon, 
                               y = lat,
                               size = hours),
                color = "blue",
                shape = 21) +
      facet_wrap(~ year) +
      theme_bw() 
    • Creating the world map was similar except I didn’t group by year and I kept the runs with positive latitude.
    run_counts <- runs %>% 
      group_by(approx_lat, approx_lon) %>% 
      summarise(hours = sum(hours),
                lat = mean(lat),
                lon = mean(lon),
                .groups = "drop") %>% 
      select(!approx_lat & !approx_lon)
    
    world <- map_data("world")
    ggplot() +
      geom_polygon(data = world,
                   mapping = aes(x = long, y = lat, group = group),
                   fill = "lightgrey",
                   color = "black",
                   size = 0.1) +
      geom_point(data = run_counts,
                 mapping = aes(x = lon, 
                               y = lat),
                 size = 2,
                 shape = 1,
                 color = "blue") +
      theme_bw()
  • Extremal couplings

    This post is inspired by an assignment question I had to answer for STATS 310A – a probability course at Stanford for first year students in the statistics PhD program. In the question we had to derive a few results about couplings. I found myself thinking and talking about the question long after submitting the assignment and decided to put my thoughts on paper. I would like to thank our lecturer Prof. Diaconis for answering my questions and pointing me in the right direction.

    What are couplings?

    Given two distribution functions \(F\) and \(G\) on \(\mathbb{R}\), a coupling of \(F\) and \(G\) is a distribution function \(H\) on \(\mathbb{R}^2\) such that the marginals of \(H\) are \(F\) and \(G\). Couplings can be used to give probabilistic proofs of analytic statements about \(F\) and \(G\) (see here). Couplings are also are studied in their own right in the theory optimal transport.

    We can think of \(F\) and \(G\) as being the cumulative distribution functions of some random variables \(X\) and \(Y\). A coupling \(H\) of \(F\) and \(G\) thus corresponds to a random vector \((\widetilde{X},\widetilde{Y})\) where \(\widetilde{X}\) has the same distribution as \(X\), \(\widetilde{Y}\) has the same distribution as \(Y\) and \((\widetilde{X},\widetilde{Y}) \sim H\).

    The independent coupling

    For two given distributions function \(F\) and \(G\) there exist many possible couplings. For example we could take \(H = H_I\) where \(H_I(x,y) = F(x)G(y)\). This coupling corresponds to a random vector \((\widetilde{X}_I,\widetilde{Y}_I)\) where \(\widetilde{X}_I\) and \(\widetilde{Y}_I\) are independent and (as is required for all couplings) \(\widetilde{X}_I \stackrel{\text{dist}}{=} X\), \(\widetilde{Y}_I \stackrel{\text{dist}}{=} Y\).

    In some sense the coupling \(H_I\) is in the “middle” of all couplings. This is because \(\widetilde{X}\) and \(\widetilde{Y}\) are independent and so \(\widetilde{X}\) doesn’t carry any information about \(\widetilde{Y}\). As the title of the post suggests, there are couplings were this isn’t the case and \(\widetilde{X}\) carries “as much information as possible” about \(\widetilde{Y}\).

    The two extremal couplings

    Define two function \(H_L, H_U :\mathbb{R}^2 \to [0,1]\) by

    \(H_U(x,y) = \min\{F(x), G(y)\}\) and \(H_L(x,y) = \max\{F(x)+G(y) – 1, 0\}\).

    With some work, one can show that \(H_L\) and \(H_U\) are distributions functions on \(\mathbb{R}^2\) and that they have the correct marginals. In this post I would like to talk about how to construct random vectors \((\widetilde{X}_U, \widetilde{Y}_U) \sim H_U\) and \((\widetilde{X}_L, \widetilde{Y}_L) \sim H_L\).

    Let \(F^{-1}\) and \(G^{-1}\) be the quantile functions of \(F\) and \(G\). That is,

    \(F^{-1}(c) = \inf\{ x \in \mathbb{R} : F(x) \ge c\}\) and \(G^{-1}(c) = \inf\{ x \in \mathbb{R} : G(x) \ge c\}\).

    Now let \(V\) be a random variable that is uniformly distributed on \([0,1]\) and define

    \(\widetilde{X}_U = F^{-1}(V)\) and \(\widetilde{Y}_U = G^{-1}(V)\).

    Since \(F^{-1}(V) \le x\) if and only if \(V \le F(x)\), we have \(\widetilde{X}_U \stackrel{\text{dist}}{=} X\) and likewise \(\widetilde{Y}_U \stackrel{\text{dist}}{=} Y\). Furthermore \(\widetilde{X}_U \le x, \widetilde{Y}_U \le y\) occurs if and only if \(V \le F(x), V \le G(y)\) which is equivalent to \(V \le \min\{F(x),G(y)\}\). Thus

    \(\mathbb{P}(\widetilde{X}_U \le x, \widetilde{Y}_U \le y) = \mathbb{P}(V \le \min\{F(x),G(y)\})= \min\{F(x),G(y)\}.\)

    Thus \((\widetilde{X}_U,\widetilde{Y}_U)\) is distributed according to \(H_U\). We see that under the coupling \(H_U\), \(\widetilde{X}_U\) and \(\widetilde{Y}_U\) are closely related as they are both increasing functions of a common random variable \(V\).

    We can follow a similar construction for \(H_L\). Define

    \(\widetilde{X}_L = F^{-1}(V)\) and \(\widetilde{Y}_L = G^{-1}(1-V)\).

    Thus \(\widetilde{X}_L\) and \(\widetilde{Y}_L\) are again functions of a common random variable \(V\) but \(\widetilde{X}_L\) is an increasing function of \(V\) and \(\widetilde{Y}_L\) is a decreasing function of \(V\). Note that \(1-V\) is also uniformly distributed on \([0,1]\). Thus \(\widetilde{X}_L \stackrel{\text{dist}}{=} X\) and \(\widetilde{Y}_L \stackrel{\text{dist}}{=} Y\).

    Now \(\widetilde{X}_L \le x, \widetilde{Y}_L \le y\) occurs if and only if \(V \le F(x)\) and \(1-V \le G(y)\) which occurs if and only if \(1-G(y) \le V \le F(x)\). If \(1-G(y) \le F(x)\), then \(F(x)+G(y)-1 \ge 0\) and \(\mathbb{P}(1-G(y) \le V \le F(x)) =F(x)+G(y)-1\). On the other hand, if \(1 – G(y) > F(x)\), then \(F(x)+G(y)-1< 0\) and \(\mathbb{P}(1-G(y) \le V \le F(x))=0\). Thus

    \(\mathbb{P}(\widetilde{X}_L \le x, \widetilde{Y}_L \le y) = \mathbb{P}(1-G(y) \le V \le F(x)) = \max\{F(x)+G(y)-1,0\}\),

    and so \((\widetilde{X}_L,\widetilde{Y}_L)\) is distributed according to \(H_L\).

    What makes \(H_U\) and \(H_L\) extreme?

    Now that we know that \(H_U\) and \(H_L\) are indeed couplings, it is natural to ask what makes them “extreme”. What we would like to say is that \(\widetilde{Y}_U\) is an increasing function of \(\widetilde{X}_U\) and \(\widetilde{Y}_L\) is a decreasing function of \(\widetilde{X}_L\). Unfortunately this isn’t always the case as can be seen by taking \(X\) to be constant and \(Y\) to be continuous.

    However the intuition that \(\widetilde{Y}_U\) is increasing in \(\widetilde{X}_U\) and \(\widetilde{Y}_L\) is decreasing in \(\widetilde{X}_L\) is close to correct. Given a coupling \((\widetilde{X},\widetilde{Y}) \sim H\), we can look at the quantity

    \(C(x,y) = \mathbb{P}(\widetilde{Y} \le y | \widetilde{X} \le x) -\mathbb{P}(\widetilde{Y} \le y) = \frac{H(x,y)}{F(x)}-G(y)\)

    This quantity tells us something about how \(\widetilde{Y}\) changes with \(\widetilde{X}\). For instance if \(\widetilde{X}\) and \(\widetilde{Y}\) were positively correlated, then \(C(x,y)\) would be positive and if \(\widetilde{X}\) and \(\widetilde{Y}\) were negatively correlated, then \(C(x,y)\) would be negative.

    For the independent coupling \((\widetilde{X}_I,\widetilde{Y}_I) \sim H_I\), the quantity \(C(x,y)\) is constantly \(0\). It turns out that the above probability is maximised by the coupling \((\widetilde{X}_U, \widetilde{Y}_U) \sim H_U\) and minimised by \((\widetilde{X}_L,\widetilde{Y}_L) \sim H_L\) and it is in this sense that they are extremal. This final claim is the two dimensional version of the Fréchet-Hoeffding Theorem and checking it is a good exercise.

  • You could have proved the Neyman-Pearson lemma

    The Neyman-Pearson lemma is foundational and important result in the theory of hypothesis testing. When presented in class the proof seemed magical and I had no idea where the ideas came from. I even drew a face like this 😲 next to the usual \(\square\) in my book when the proof was finished. Later in class we learnt the method of undetermined multipliers and suddenly I saw where the Neyman-Pearson lemma came from.

    In this blog post, I’ll first give some background and set up notation for the Neyman-Pearson lemma. Then I’ll talk about the method of undetermined multipliers and show how it can be used to derive and prove the Neyman-Pearson lemma. Finally, I’ll write about why I still think the Neyman-Pearson lemma is magical despite the demystified proof.

    Background

    In the set up of the Neyman-Pearson lemma we have data \(x\) which is a realisation of some random variable \(X\). We wish to conclude something about the distribution of \(X\) from our data \(x\) by doing a hypothesis test.

    In the Neyman-Pearson lemma we have simple hypotheses. That is our data either comes from the distribution \(\mathbb{P}_0\) or from the distribution \(\mathbb{P}_1\). Thus, our null hypothesis is \(H_0 : X \sim \mathbb{P}_0\) and our alternative hypothesis is \(H_1 : X \sim \mathbb{P}_1\).

    A test of \(H_0\) against \(H_1\) is a function \(\phi\) that takes in data \(x\) and returns a number \(\phi(x) \in [0,1]\). The value \(\phi(x)\) is the probability under the test \(\phi\) of rejecting \(H_0\) given the observed data \(x\). That is, if \(\phi(x) = 1\), we always reject \(H_0\) and if \(\phi(x)=0\) we never reject \(H_0\). For in-between values \(\phi(x) = \gamma \in (0,1)\), we reject \(H_0\) with probability \(\gamma\).

    An ideal test would have two desirable properties. We would like a test that rejects \(H_0\) with a low probability when \(H_0\) is true but we would also like the test to reject \(H_0\) with a high probability when \(H_1\) is true. To state this more formally, let \(\mathbb{E}_0[\phi(X)]\) and \(\mathbb{E}_1[\phi(X)]\) be the expectation of \(\phi(X)\) under \(H_0\) and \(H_1\) respectively. The quantity \(\mathbb{E}_0[\phi(X)]\) is the probability we reject \(H_0\) when \(H_0\) is true. Likewise, the quantity \(\mathbb{E}_1[\phi(X)]\) is the probability we reject \(H_0\) when \(H_1\) is true. An optimal test would be one that minimises \(\mathbb{E}_0[\phi(X)]\) and maximises \(\mathbb{E}_1[\phi(X)]\).

    Unfortunately the goals of minimising \(\mathbb{E}_0[\phi(X)]\) and maximising \(\mathbb{E}_1[\phi(X)]\) are at odds with one another. This is because we want \(\phi\) to be small in order to minimise \(\mathbb{E}_0[\phi(X)]\) and we want \(\phi\) to be large to maximise \(\mathbb{E}_1[\phi(X)]\). In nearly all cases we have to trade off between these two goals and there is no single test that simultaneously achieves both.

    To work around this, a standard approach is to focus on maximising \(\mathbb{E}_1[\phi(X)]\) while requiring that \(\mathbb{E}_0[\phi(X)]\) remains below some threshold. The quantity \(\mathbb{E}_1[\phi(X)]\) is called the power of the test \(\phi\). If \(\alpha\) is a number between \(0\) and \(1\), we will say that \(\phi\) has level-\(\alpha\) if \(\mathbb{E}_1[\phi(X)] \le \alpha\). A test \(\phi\) is said to be most powerful at level-\( \alpha\), if

    • The test \(\phi\) is level-\(\alpha\).
    • For all level-\(\alpha\) tests \(\phi’\), the test \(\phi\) is more powerful than \(\phi’\). That is,

    \(\mathbb{E}_1[\phi'(X)] \le \mathbb{E}_1[\phi(X)]\).

    Thus we can see that finding a most powerful level-\(\alpha\) test is a constrained optimisation problem. We wish to maximise the quantity

    \(\mathbb{E}_1[\phi(X)]\)

    subject to the constraint

    \(\mathbb{E}_0[\phi(X)] \le \alpha\)

    With this in mind, we turn to the method of undetermined multipliers.

    The method of undetermined multipliers

    The method of undetermined multipliers (also called the method of Lagrange multipliers) is a very general optimisation tool. Suppose that we have a set \(U\) and two function \(f,g : U \to \mathbb{R}\) and we wish to maximise \(f(u)\) subject to the constraint \(g(u) \le 0\).

    In the context of hypothesis testing, the set \(U\) is the set of all tests \(\phi\). The objective function \(f\) is given by \(f(\phi) = \mathbb{E}_1[\phi(X)]\). That is, \(f(\phi)\) is the power of the test \(\phi\). The constraint function \(g\) is given by \(g(\phi)=\mathbb{E}_1[\phi(X)]-\alpha\) so that \(g(\phi) \le 0\) if and only if \(\phi\) has level-\(\alpha\).

    The method of undetermined multipliers allows us to reduce this constrained optimisation problem to a (hopefully easier) unconstrained optimisation problem. More specifically we have the following result:

    Proposition: Suppose that \(u^* \in U\) is such that:

    • \(g(u^*) = 0\),
    • There exists \(k \ge 0\), such that \(u^*\) maximises \(f(u)-kg(u)\) over all \(u \in U\).

    Then \(u\) maximises \(f(u)\) under the constraint \(g(u) \le 0\).

    Proof: Suppose that \(u^*\) satisfies the above two dot points. We need to show that for all \(u \in U\), if \(g(u) \le 0\), then \(f(u) \le f(u^*)\). By assumption we know that \(f(u^*)=0\) and \(u^*\) maximises \(f(u)-kg(u)\). Thus, for all \(u \in U\),

    \(f(u^*)=f(u^*)-kg(u^*) \ge f(u)-kg(u)\).

    Now suppose \(g(u) \le 0\). Then, \(kg(u) \le 0\) and so \(f(u)-kg(u)\ge f(u)\) and so \(f(u^*) \ge f(u)\) as needed. \(\square\)

    The constant \(k\) is the undetermined multiplier. The way to use the method of undetermined is to find values \(u_k^*\) that maximise \(h_k(u) = f(u)-kg(u)\) for each \(k \ge 0\). The multiplier \(k\) is then varied so that the constraint \(g(u^*_k) = 0\) is satisfied.

    Proving the Neyman-Pearson lemma

    Now let’s use the method of undetermined multipliers to find most powerful tests. Recall the set \(U\) which we are optimising over is the set of all tests \(\phi\). Recall also that we wish to optimise \(f(\phi) = \mathbb{E}_1[\phi(X)]\) subject to the constraint \(g(\phi) = \mathbb{E}_0[\phi(X)] – \alpha \le 0\). The method of undetermined multipliers says that we should consider maximising the function

    \(h_k(\phi) = \mathbb{E}_1[\phi(X)] – k\mathbb{E}_0[\phi(X)]\),

    where \(k \ge 0\). Suppose that both \(\mathbb{P}_0\) and \(\mathbb{P}_1\) have densities1 \(p_0\) and \(p_1\) with respect to some measure \(\mu\). We can we can write the above expectations as integrals. That is,

    \(\mathbb{E}_0[\phi(X)] = \int \phi(x)p_0(x)\mu(dx)\) and \(\mathbb{E}_1[\phi(X)] = \int \phi(x)p_1(x)\mu(dx)\).

    Thus the function \(h_k\) is equal to

    \(h_k(\phi) = \int \phi(x)p_1(x)\mu(dx)- k\int \phi(x)p_0(x)\mu(dx)=\int \phi(x)(p_1(x)-kp_0(x))\mu(dx)\).

    We now wish to maximise the function \(h_k(\phi)\). Recall that \(\phi\) is a function that take values in \([0,1]\). Thus, the integral

    \(\int \phi(x)(p_1(x)-kp_0(x))\mu(dx)\),

    is maximised if and only if \(\phi(x)=1\) when \(p_1(x)-kp_0(x) > 0\) and \(\phi(x)=0\) when \(p_1(x)-kp_0(x) < 0\). Note that \(p_1(x)-kp_0(x) > 0\) if and only if \(\frac{p_1(x)}{p_0(x)} > k\). Thus for each \(k \ge 0\), a test \(\phi_k\) maximises \(h_k(\phi)\) if and only if

    \(\phi_k(x) = \begin{cases} 1 & \text{if } \frac{p_1(x)}{p_0(x)} > k,\\ 0 &\text{if } \frac{p_1(x)}{p_0(x)} < k. \end{cases}\)

    The method of undetermined multipliers says that if we can find \(k\) so that the above is satisfied and \(g(\phi_k) = 0\), then \(\phi_k\) is a most powerful test. Recall that \(g(\phi_k)=0\) is equivalent to \(\mathbb{E}_1[\phi(X)]=\alpha\). By summarising the above argument, we arrive at the Neyman-Pearson lemma,

    Neyman-Pearson Lemma2: Suppose that \(\phi\) is a test such that

    • \(\mathbb{E}_0[\phi(X)] = \alpha\), and
    • For some \(k \ge 0\), \(\phi(x) = \begin{cases} 1 & \text{if } \frac{p_1(x)}{p_0(x)} > k,\\ 0 & \text{if } \frac{p_1(x)}{p_0(x)}< k.\end{cases}\)

    then \(\phi\) is most powerful at level-\(\alpha\) for testing \(H_0 : X \sim \mathbb{P}_0\) against \(H_1 : X \sim \mathbb{P}_1\).

    The magic of Neyman-Pearson

    By learning about undetermined multipliers I’ve been able to better understand the proof of the Neyman-Pearson lemma. I now view it is as clever solution to a constrained optimisation problem rather than something that comes out of nowhere.

    There is, however, a different aspect of Neyman-Pearson that continues to surprise me. This aspect is the usefulness of the lemma. At first glance the Neyman-Pearson lemma seems to be a very specialised result because it is about simple hypothesis testing. In reality most interesting hypothesis tests have composite nulls or composite alternatives or both. It turns out that Neyman-Pearson is still incredibly useful for composite testing. Through ideas like monotone likelihood ratios, least favourable distributions and unbiasedness, the Neyman-Pearson lemma or similar ideas can be used to find optimal tests in a variety of settings.

    Thus I must admit that the title of this blog post is a little inaccurate and deceptive. I do believe that, given the tools of undetermined multipliers and the set up of simple hypothesis testing, one is naturally led to the Neyman-Pearson lemma. However, I don’t believe that many could have realised how useful and interesting simple hypothesis testing would be.

    Footnotes

    1. The assumption that \(\mathbb{P}_0\) and \(\mathbb{P}_1\) have densities with respect to a common measure \(\mu\) is not a restrictive assumption since one can always take \(\mu = \mathbb{P}_0+\mathbb{P}_1\) and the apply Radon-Nikodym. However there is often a more natural choice of \(\mu\) such as Lebesgue measure on \(\mathbb{R}^d\) or the counting measure on \(\mathbb{N}\).
    2. What I call the Neyman-Pearson lemma is really only a third of the Neyman-Pearson lemma. There are two other parts. One that guarantees the existence of a most powerful test and one that is a partial converse to the statement in this post.
  • An art and maths collaboration

    Over the course of the past year I have had the pleasure to work with the artist Sanne Carroll on her honours project at the Australian National University. I was one of two mathematics students that collaborated with Sanne. Over the course of the project Sanne drew patterns and would ask Ciaran and I to recreate them using some mathematical or algorithmic ideas. You can see the final version of project here: https://www.sannecarroll.com/ (best viewed on a computer).

    I always loved the patterns Sanne drew and the final project is so well put together. Sanne does a great job of incorporating her drawings, the mathematical descriptions and the communication between her, Ciaran and me. Her website building skills also far surpass anything I’ve done on this blog!

    It was also a lot of fun to work with Sanne. Hearing about her patterns and talking about maths with her was always fun. I also learnt a few things about GeoGebra which made the animations in my previous post a lot quicker to make. Sanne has told me that she’ll be starting a PhD soon and I’m looking forward to any future collaborations that might arise.