Logistic regression with aggregate data

Logistic regression is a technique for analyzing data where the dependent variable is binary. In this post, we will introduce logistic regression and then discuss how to fit a logistic regression model using only aggregated count data.

Logistic regression overview

Logistic regression can be used when the dependent variable \(Y_i\) is binary. For example, in a medical study the dependent variable might be whether or not a patient developed a disease. In this case \(Y_i=1\) will mean that patient \(i\) did develop the disease and \(Y_i=0\) will mean that that patient \(i\) did not develop the disease. In logistic regression, the independent variables can either be categorical or quantitative. In the medical example, the type of treatment received could be a categorical independent variable and the age of the patient could be a quantitative independent variable.

Logistic regression models the probability that \(Y_i=1\) given the independent variables. If \(X_{i,1},\ldots,X_{p,1}\) are the values of the independent variables for patient \(i\), then logistic regression assumes that

\(\mathrm{Pr}(Y_i=1) = \sigma(\beta_0 + \beta_1 X_{i,1} + \cdots +\beta_p X_{i,p}) \),

where \(\mathrm{Pr}(Y_i=1)\) is the probability that \(Y_i = 1\), \(\sigma\) is the logistic function \(\sigma(x) = \frac{1}{1+\exp(-x)}\) and \(\beta_0,\beta_1,\ldots,\beta_p\) are parameters that will be estimated from the data. The interpretation of the parameter \(\beta_j\) is that if \(X_{i,j}\) increases by 1 unit and all the other independent variables are fixed, then the odds \(\frac{\mathrm{Pr}(Y_i=1)}{\mathrm{Pr}(Y_i=0)}\) will increase by a factor of \(\exp(\beta_j)\).

Logistic regression models can be fit to data in using the function glm() in the programming language R. For example, to fit a logistic regression model with dependent variable disease and independent variables treatment and age, you can use the code

logistic_model <- glm(
          disease ~ treatment + age, 
          data = data, 
          family = binomial)

To view the estimates of the parameters of the model (as well as standard errors and p-values), you can run the code summary(logistic_model).

Logistic regression with aggregate data

If all of the independent variables are categorical, then it is possible to fit a logistic model with the aggregate count data. For example, the table below is from a study from the 1980s/1990s on the effect of AZT treatment on slowing the development of AIDS.1

RaceAZT useSymptomsNo Symptoms
BlackYes1152
BlackNo1243
WhiteYes1493
WhiteNo3281

The study featured 338 veterans who were positive for HIV and were beginning to show signs of faltering immune systems. The veterans were randomly assigned to either immediately receive AZT or to receive AZT once their T-cells showed signs of severe immune weakness. The table records whether the veterans received AZT immediately, the race of the veterans and whether they developed AIDS in the three year study period.

The table above shows aggregated data. The table reports the number of people for each combination of treatment, race and disease status. The table does not report individual patient data. However, it is it possible to fit a logistic regression model to the aggregate data.2 We just need to make some small changes to the R code.

To describe the changes, let’s first review what we would do with the patient level data. Suppose we had a dataset individual_data that had a row for each veteran and had columns race, azt_use and developed_symptoms that record the veteran’s race, AZT use and whether they developed symptoms. Our goal is to fit a logistic regression model where dependent variable \(Y_i\) is a binary variable with \(Y_i=1\) meaning that the patient did develop symptoms. The independent variables are race and AZT use. With the individual data, we could use the following:

individual_model <- glm(
    developed_symptoms ~ race + azt_use, 
    family = binomial, 
    data = individual_data)

Now suppose that we only have the aggregate data in a table called aggregate_data. This table has columns race, azt_use, symptoms and no_symptoms like the table presented above. To fit the logistic model with this data, there are two options.

  • The first option is to replace the independent variable developed_symptoms in the individual model with cbind(symptoms, no_symptoms). The function cbind creates a matrix with two columns recording the number of people with and without symptoms for each combination of race and AZT use. We also need to tell R to use the aggregate data. Here is the R code:
aggregate_model_1 <- glm(
    cbind(symptoms, no_symptoms) ~ race + azt_use, 
    family = binomial, 
    data = aggregate_data
  • The other option is to replace the independent variable developed_symptoms with the proportion of people who developed symptoms. To use this option, it is also necessary to compute the total number of people for each combination of race and AZT use. These totals are used as weights in the logistic regression. In the code below, we first compute the proportion and the total number of people and then fit the model.
aggregate_data$total <- aggregate_data$symptoms + aggregate_data$no_symptomsaggregate_data$proportion <- aggregate_data$symptoms / aggregate_data$total

aggregate_model_2 <- glm(
    proportion ~ race + azt_use, 
    family = binomial,    weights = total,    data = aggregate_data)

Results

Both aggregate_model_1 and aggregate_model_2 produce the exact same parameter estimates, standard errors and p-values. More importantly, if we had the individual data, individual_model would also have the same results. We can actually check this by “de-aggregating” the data. This involves making a dataset with, for example, 11 rows with race = "Black", azt_use = "Yes" and developed_symptoms = 1 so that these 11 rows represent the 11 veterans in the first category of the table above.

All three logisitic regression models give the same final results:

EstimateStandard errorp-value
Intercept-1.070.2634e-05
raceWhite0.050.2890.848
atz_useYes-0.7190.2790.01

The parameter estimates shows that Black and White veterans had approximately the same chance of developing symptoms (assuming that they either both received ATZ treatment or that neither of them received ATZ treatment). On the other hand, the parameter estimate for ATZ use is significant. The estimated value of -0.719 can be interpreted as follows. If the race of the veteran is held constant, then the odds of developing symptoms was half as high for the veterans who immediately took ATZ than for those who waited. In mathematical symbols:

\(\frac{\mathrm{Pr}(\text{developed symptoms} \mid \text{ATZ immediately})}{\mathrm{Pr}(\text{did not develop symptoms} \mid \text{ATZ immediately})}\)

is roughly equal to

\( 0.5 \times \frac{\mathrm{Pr}(\text{developed symptoms} \mid \text{waited for ATZ use})}{\mathrm{Pr}(\text{did not develop symptoms} \mid \text{waited for ATZ})} \)

The specific value of 0.5 comes from the fact that \(\exp(-0.719) \approx 0.5\). Section 5.4.2 of Categorical Data Analysis by Alan Agresti discusses the interpretation of the parameter estimates for this example in more detail.

R code for fitting all three models is available at the end of this post.3

Footnotes

  1. The data on AZT is from Table 5.6 of Alan Agresti’s Categorical Data Analysis (3rd edition) https://onlinelibrary.wiley.com/doi/book/10.1002/0471249688, https://mybiostats.wordpress.com/wp-content/uploads/2015/03/3rd-ed-alan_agresti_categorical_data_analysis.pdf ↩︎
  2. In statistical jargon, the aggregate data are sufficient statistics for the logistic regression model. ↩︎
  3. R code: ↩︎

# Aggregate data from Agresti Table 5.6
aggregate_data <- data.frame(
  race = c("Black", "Black", "White", "White"),
  atz_use = c("yes", "no", "yes", "no"),
  symptoms = c(11, 12, 14, 32),
  no_symptoms = c(52, 43, 93, 81)
)

# Aggregate model 1 using cbind()
aggregate_model_1 <- glm(
  cbind(symptoms, no_symptoms) ~ race + atz_use,
  data = aggregate_data,
  family = binomial)

# Aggregate model 2 using proportions and weights
aggregate_data$total <- aggregate_data$symptoms + aggregate_data$no_symptoms
aggregate_data$proportion <- aggregate_data$symptoms / aggregate_data$total

aggregate_model_2 <- glm(
  proportion ~ race + atz_use,
  data = aggregate_data,
  weights = total,
  family = binomial)

# Individual model

# Function to "de-aggregate"
deaggregate_data <- function(aggregate_data) {
  # Create an empty list to store rows
  rows_list <- list()
  
  # Iterate through each row of aggregate data
  for (i in 1:nrow(aggregate_data)) {
    race_val <- aggregate_data$race[i]
    atz_val <- aggregate_data$atz_use[i]
    n_symptoms <- aggregate_data$symptoms[i]
    n_no_symptoms <- aggregate_data$no_symptoms[i]
    
    # Create rows for those with symptoms
    if (n_symptoms > 0) {
      rows_list[[length(rows_list) + 1]] <- data.frame(
        race = rep(race_val, n_symptoms),
        atz_use = rep(atz_val, n_symptoms),
        developed_symptoms = rep(1, n_symptoms)
      )
    }
    
    # Create rows for those without symptoms
    if (n_no_symptoms > 0) {
      rows_list[[length(rows_list) + 1]] <- data.frame(
        race = rep(race_val, n_no_symptoms),
        atz_use = rep(atz_val, n_no_symptoms),
        developed_symptoms = rep(0, n_no_symptoms)
      )
    }
  }
  
  # Combine all rows into one data frame
  do.call(rbind, rows_list)
}

individual_data <- deaggregate_data(aggregate_data)

individual_model <- glm(
  developed_symptoms ~ race + atz_use,
  data = individual_data,
  family = binomial
)

# Compare models
summary(aggregate_model_1)
summary(aggregate_model_2)
summary(individual_model)

The beta-binomial distribution

The beta-binomial model is a Bayesian model used to analyze rates. For a great derivation and explanation of this model, I highly recommend watching the second lecture from Richard McElreath’s course Statistical Rethinking. In this model, the data, \(X\), is assumed to be binomially distributed with a fixed number of trail \(N\) but an unknown rate \(\rho \in [0,1]\). The rate \(\rho\) is given a \(\text{Beta}(a,b)\) prior. That is the prior distribution of \(\rho\) has a density

\(p(\rho) = \frac{1}{B(a,b)} \rho^{a-1}(1-\rho)^{b-1},\)

where \(B(a,b) =\int_0^1 \rho^{a-1}(1-\rho)^{b-1}d\rho\) is a normalizing constant. The model can thus be written as

\(\rho \sim \text{Beta}(a,b),\)
\(X | \rho \sim \text{Binom}(N,\rho).\)

This is a conjugate model, meaning that the posterior distribution of \(\rho\) is again a beta distribution. This can be seen by using Bayes rule

\(p(\rho | X) \propto p(X| \rho)p(\rho) \propto \rho^{X+a-1}(1-\rho)^{(N-X)+b-1}.\)

The last expression is proportional to a beta density., specifically \(\rho | X \sim \text{Beta}(X+a, N-X+b)\).

The marginal distribution of \(X\)

In the above model we are given the distribution of \(\rho\) and the conditional distribution of \(X|\rho\). To calculate the distribution of \(X\), we thus need to marginalize over \(\rho\). Specifically,

\(\displaystyle{p(X) = \int_0^1 p(X,\rho)d\rho = \int_0^1 p(X| \rho)p(\rho)d\rho.}\)

The term inside the above integral is

\(\displaystyle{p(X| \rho)p(\rho) = \binom{N}{X}\rho^X(1-\rho)^{N-X}\frac{1}{B(a,b)}\rho^{a-1}(1-\rho)^{b-1} },\)

which is equal to \(\frac{\binom{N}{X}}{B(a,b)}\rho^{X+a-1}(1-\rho)^{N-X+b-1}\) and

\(\displaystyle{\int_0^1 \rho^{X+a-1}(1-\rho)^{N-X+b-1}d\rho=B(X+a, N-X+a)}\)

Thus,

\(\displaystyle{p(X) = \binom{N}{X}\frac{B(X+a, N-X+a)}{B(a,b)}}.\)

This distribution is called the beta-binomial distribution. Below is an image from Wikipedia showing a graph of \(p(X)\) for \(N=10\) and a number of different values of \(a\) and \(b\). You can see that, especially for small value of \(a\) and \(b\) the distribution is a lot more spread out than the binomial distribution. This is because there is randomness coming from both \(\rho\) and the binomial conditional distribution.

A plot of the beta-binomial distribution for different values of the parameters a and b. For small values of a and b, the distribution is very spread out.

Two sample tests as correlation tests

Suppose we have two samples \(Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)}\) and \(Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)}\) and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.

The equal variance t-test

In the equal variance t-test, we assume that \(Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2)\) and \(Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2)\), where \(\sigma^2\) is unknown. Our hypothesis that \(Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)}\) and \(Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)}\) are from the same distribution becomes the hypothesis \(\mu_0 = \mu_1\). The test statistic is

\(t = \frac{\displaystyle \overline{Y}^{(1)} – \overline{Y}^{(0)}}{\displaystyle \hat{\sigma}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}\),

where \(\overline{Y}^{(0)}\) and \(\overline{Y}^{(1)}\) are the two sample means. The variable \(\hat{\sigma}\) is the pooled estimate of the standard deviation and is given by

\(\hat{\sigma}^2 = \displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right)\).

Under the null hypothesis, \(t\) follows the T-distribution with \(n_0+n_1-2\) degrees of freedom. We thus reject the null \(\mu_0=\mu_1\) when \(|t|\) exceeds the \(1-\alpha/2\) quantile of the T-distribution.

Pooling the data

We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let \(Y_1,\ldots,Y_{n_0}, Y_{n_0+1},\ldots,Y_{n_0+n_1}\) be the pooled data and for \(i = 1,\ldots, n_0+n_1\), define \(x_i \in \{0,1\}\) by

\(x_i = \begin{cases} 0 & \text{if } 1 \le i \le n_0,\\ 1 & \text{if } n_0+1 \le i \le n_0+n_1.\end{cases}\)

The assumptions that \(Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2)\) and \(Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2)\) can be rewritten as

\(Y_i = \beta_0+\beta_1x_i + \varepsilon_i,\)

where \(\varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2)\). That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis \(\mu_0 = \mu_1\) is equivalent to \(\beta_1 = 0\). To test \(\beta_1 = 0\) we can use the standard t-test for a coefficient in linear model. The test statistic in this case is

\(t’ = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{(X^TX)^{-1}_{11}}},\)

where \(\hat{\beta}_1\) is the ordinary least squares estimate of \(\beta_1\), \(X \in \mathbb{R}^{(n_0+n_1)\times 2}\) is the design matrix and \(\hat{\sigma}_{OLS}\) is an estimate of \(\sigma\) given by

\(\hat{\sigma}_{OLS}^2 = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1} (Y_i-\hat{Y}_i)^2,\)

where \(\hat{Y} = \hat{\beta}_0+\hat{\beta}_1x_i\) is the fitted value of \(Y_i\).

It turns out that \(t’\) is exactly equal to \(t\). We can see this by writing out the design matrix and calculating everything above. The design matrix has rows \([1,x_i]\) and is thus equal to

\(X = \begin{bmatrix} 1&x_1\\ 1&x_2\\ \vdots&\vdots\\ 1&x_{n_0}\\ 1&x_{n_0+1}\\ \vdots&\vdots\\ 1&x_{n_0+n_1}\end{bmatrix} = \begin{bmatrix} 1&0\\ 1&0\\ \vdots&\vdots\\ 1&0\\ 1&1\\ \vdots&\vdots\\ 1&1\end{bmatrix}.\)

This implies that

\(X^TX = \begin{bmatrix} n_0+n_1 &n_1\\n_1&n_1 \end{bmatrix},\)

And therefore,

\((X^TX)^{-1} = \frac{1}{(n_0+n_1)n_1-n_1^2}\begin{bmatrix} n_1 &-n_1\\-n_1&n_0+n_1 \end{bmatrix} =\begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix} .\)

Thus, \((X^TX)^{-1}_{11} = \frac{1}{n_0}+\frac{1}{n_1}\). So,

\(t’ = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},\)

which is starting to like \(t\) from the two-sample test. Now

\(X^TY = \begin{bmatrix} \displaystyle\sum_{i=1}^{n_0+n_1} Y_i\\ \displaystyle \sum_{i=n_0+1}^{n_0+n_1} Y_i \end{bmatrix} = \begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\ n_1\overline{Y}^{(1)} \end{bmatrix}.\)

And so

\(\hat{\beta} = (X^TX)^{-1}X^TY =\begin{bmatrix} \overline{Y}^{(0)}\\ \overline{Y}^{(1)} -\overline{Y}^{(0)}\end{bmatrix}.\)

Thus, \(\hat{\beta}_1 = \overline{Y}^{(1)} -\overline{Y}^{(0)}\) and

\(t’ = \displaystyle\frac{\overline{Y}^{(1)}-\overline{Y}^{(0)}}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}.\)

This means to show that \(t’ = t\), we only need to show that \(\hat{\sigma}_{OLS}^2=\hat{\sigma}^2\). To do this, note that the fitted values \(\hat{Y}\) are equal to

\(\displaystyle\hat{Y}_i=\hat{\beta}_0+x_i\hat{\beta}_1 = \begin{cases} \overline{Y}^{(0)} & \text{if } 1 \le i \le n_0,\\ \overline{Y}^{(1)} & \text{if } n_0+1\le i \le n_0+n_1\end{cases}.\)

Thus, \(\hat{\sigma}^2_{OLS} = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1}\left(Y_i-\hat{Y}_i\right)^2\) is equal to

\(\displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right),\)

Which is exactly \(\hat{\sigma}^2\). Therefore, \(t’=t\) and the two sample t-test is equivalent to a correlation test.

The Friedman-Rafsky test

In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.

In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.

I plan to write a follow up post explaining these procedures but you can also read about it in Chapter 6 of Group Representations in Probability and Statistics by Persi Diaconis.

References

Persi Diaconis “Group representations in probability and statistics,” pp 104-106, Hayward, CA: Institute of Mathematical Statistics, (1988)

Jerome H. Friedman, Lawrence C. Rafsky “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests,” The Annals of Statistics, Ann. Statist. 7(4), 697-717, (July, 1979)

Jerome H. Friedman, Lawrence C. Rafsky “Graph-Theoretic Measures of Multivariate Association and Prediction,” The Annals of Statistics, Ann. Statist. 11(2), 377-391, (June, 1983).

MCMC for hypothesis testing

A Monte Carlo significance test of the null hypothesis \(X_0 \sim H\) requires creating independent samples \(X_1,\ldots,X_B \sim H\). The idea is if \(X_0 \sim H\) and independently \(X_1,\ldots,X_B\) are i.i.d. from \(H\), then for any test statistic \(T\), the rank of \(T(X_0)\) among \(T(X_0), T(X_1),\ldots, T(X_B)\) is uniformly distributed on \(\{1,\ldots, B+1\}\). This means that if \(T(X_0)\) is one of the \(k\) largest values of \(T(X_b)\), then we can reject the hypothesis \(X \sim H\) at the significance level \(k/(B+1)\).

The advantage of Monte Carlo significance tests is that we do not need an analytic expression for the distribution of \(T(X_0)\) under \(X_0 \sim H\). By generating the i.i.d. samples \(X_1,\ldots,X_B\), we are making an empirical distribution that approximates the theoretical distribution. However, sometimes sampling \(X_1,\ldots, X_B\) is just as intractable as theoretically studying the distribution of \(T(X_0)\) . Often approximate samples based on Markov chain Monte Carlo (MCMC) are used instead. However, these samples are not independent and may not be sampling from the true distribution. This means that a test using MCMC may not be statistically valid

In the 1989 paper Generalized Monte Carlo significance tests, Besag and Clifford propose two methods that solve this exact problem. Their two methods can be used in the same settings where MCMC is used but they are statistically valid and correctly control the Type 1 error. In this post, I will describe just one of the methods – the serial test.

Background on Markov chains

To describe the serial test we will need to introduce some notation. Let \(P\) denote a transition matrix for a Markov chain on a discrete state space \(\mathcal{X}.\) A Markov chain with transition matrix \(P\) thus satisfies,

\(\mathbb{P}(X_n = x_n \mid X_0 = x_0,\ldots, X_{n-1} = x_{n-1}) = P(x_{n-1}, x_n)\)

Suppose that the transition matrix \(P\) has a stationary distribution \(\pi\). This means that if \((X_0, X_1,\ldots)\) is a Markov chain with transition matrix \(P\) and \(X_0\) is distributed according to \(\pi\), then \(X_1\) is also distributed according to \(\pi\). This implies that all \(X_i\) are distributed according to \(\pi\).

We can construct a new transition matrix \(Q\) from \(P\) and \(\pi\) by \(Q(y,x) = P(x,y) \frac{\pi(x)}{\pi(y)}\). The transition matrix \(Q\) is called the reversal of \(P\). This is because for all \(x\) and \(y\) in \(\mathcal{X}\), \(\pi(x)P(x,y) = \pi(y)Q(x,y)\). That is the chance of drawing \(x\) from \(\pi\) and then transitioning to \(y\) according to \(P\) is equal to the chance of drawing \(y\) from \(\pi\) and then transitioning to \(x\) according to \(Q\)

The new transition matrix \(Q\) also allows us to reverse longer runs of the Markov chain. Fix \(n \ge 0\) and let \((X_0,X_1,\ldots,X_n)\) be a Markov chain with transition matrix \(P\) and initial distribution \(\pi\). Also let \((Y_0,Y_1,\ldots,Y_n)\) be a Markov chain with transition matrix \(Q\) and initial distribution \(\pi\), then

\((X_0,X_1,\ldots,X_{n-1}, X_n) \stackrel{dist}{=} (Y_n,Y_{n-1},\ldots, Y_1,Y_0) \),

where \(\stackrel{dist}{=}\) means equal in distribution.

The serial test

Suppose we want to test the hypothesis \(x \sim \pi\) where \(x \in \mathcal{X}\) is our observed data and \(\pi\) is some distribution on \(\mathcal{X}\). To conduct the serial test, we need to construct a Markov chain \(P\) for which \(\pi\) is a stationary distribution. We then also need to construct the reversal \(Q\) described above. There are many possible ways to construct \(P\) such as the Metropolis-Hastings algorithm.

We also need a test statistic \(T\). This is a function \(T:\mathcal{X} \to \mathbb{R}\) which we will use to detect outliers. This function is the same function we would use in a regular Monte Carlo test. Namely, we want to reject the null hypothesis when \(T(x)\) is much larger than we would expect under \(x \sim \pi\).

The serial test then proceeds as follows. First we pick \(\xi\) uniformly in \(\{0,1,\ldots,B\}\) and set \(X_\xi = x\). We then generate \((X_\xi, X_{\xi+1},\ldots,X_B)\) as a Markov chain with transition matrix \(P\) that starts at \(X_\xi = x\). Likewise we generate \((X_\xi, X_{\xi-1},\ldots, X_0)\) as a Markov chain that starts from \(Q\).

We then apply \(T\) to each of \((X_0,X_1,\ldots,X_\xi,\ldots,X_B)\) and count the number of \(b \in \{0,1,\ldots,B\}\) such that \(T(X_b) \ge T(X_\xi) = T(x)\). If there are \(k\) such \(b\), then the reported p-value of our test is \(k/(B+1)\).

We will now show that this test produces a valid p-value. That is, when \(x \sim \pi\), the probability that \(k/(B+1)\) is less than \(\alpha\) is at most \(\alpha\). In symbols,

\(\mathbb{P}\left( \frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) \le \alpha\)

Under the null hypothesis \(x \sim \pi\), \((X_\xi, X_{\xi-1},\ldots, X_0)\) is equal in distribution to generating \(X_0\) from \(\pi\) and using the transition matrix \(P\) to go from \(X_i\) to \(X_{i+1}\).

Thus, under the null hypothesis, the distribution of \((X_0,X_1,\ldots,X_B)\) does not depend on \(\xi\). The entire procedure is equivalent to generating a Markov chain \(\mathbf{X} = (X_0,X_1,\ldots,X_B)\) with initial distribution \(\pi\) and transition matrix \(P\), and then choosing \(\xi \in \{0,1,\ldots,B\}\) independently of \(\mathbf{X}\). This is enough to show that the serial method produces valid p-values. The idea is that since the distribution of \(\mathbf{X}\) does not depend on \(\xi\) and \(\xi\) is uniformly distributed on \(\{0,1,\ldots,B\}\), the probability that \(T(X_\xi)\) is in the top \(\alpha\) proportion of \(T(X_b)\) should be at most \(\alpha\). This is proved more formally below.

For each \(i \in \{0,1,\ldots,B\}\), let \(A_i\) be the event that \(T(X_i)\) is in the top \(\alpha\) proportion of \(T(X_0),\ldots,T(X_B)\). That is,

\(A_i = \left\{ \frac{\# \{b : T(X_b) \ge T(X_i)\} }{B+1} \le \alpha \right\}\).

Let \(I_{A_i}\) be the indicator function for the event \(A_i\). Since at must \(\alpha(B+1)\) values of \(i\) can be in the top \(\alpha\) fraction of \(T(X_0),\ldots,T(X_B)\), we have that

\(\sum_{i=0}^B I_{A_i} \le \alpha(B+1)\),

Therefor, by linearity of expectations,

\(\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha(B+1)\)

By the law of total probability we have that \(\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right)\) is equal to

\(\sum_{i=0}^B \mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right)\mathbb{P}(\xi = i)\),

Since \(\xi\) is uniform on \(\{0,\ldots,B\}\), \(\mathbb{P}(\xi = i)\) is \(\frac{1}{B+1}\) for all \(i\). Furthermore, by independence of \(\xi\) and \(\mathbf{X}\), we have

\(\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha | \xi = i\right) = \mathbb{P}(A_i)\).

Thus, by our previous bound,

\(\mathbb{P}\left(\frac{\#\{b: T(X_b) \ge T(X_\xi)\}}{B+1} \le \alpha \right) = \frac{1}{B+1}\sum_{i=0}^B \mathbb{P}(A_i) \le \alpha\).

Applications

In the original paper by Besag and Clifford, the authors discuss how this procedure can be used to perform goodness-of-fit tests. They construct Markov chains that can test the Rasch model or the Ising model. More broadly the method can be used to tests goodness-of-fit tests for any exponential family by using the Markov chains developed by Diaconis and Sturmfels.

A similar method has also been applied more recently to detect Gerrymandering. In this setting, the null hypothesis is the uniform distribution on all valid redistrictings of a state and the test statistics \(T\) measure the political advantage of a given districting.

References

  1. Besag, Julian, and Peter Clifford. “Generalized Monte Carlo Significance Tests.” Biometrika 76, no. 4 (1989)
  2. Persi Diaconis, Bernd Sturmfels “Algebraic algorithms for sampling from conditional distributions,” The Annals of Statistics, Ann. Statist. 26(1), 363-397, (1998)
  3. Chikina, Maria et al. “Assessing significance in a Markov chain without mixing.” Proceedings of the National Academy of Sciences of the United States of America vol. 114,11 (2017)

Sampling from the non-central chi-squared distribution

The non-central chi-squared distribution is a generalisation of the regular chi-squared distribution. The chi-squared distribution turns up in many statistical tests as the (approximate) distribution of a test statistic under the null hypothesis. Under alternative hypotheses, those same statistics often have approximate non-central chi-squared distributions.

This means that the non-central chi-squared distribution is often used to study the power of said statistical tests. In this post I give the definition of the non-central chi-squared distribution, discuss an important invariance property and show how to efficiently sample from this distribution.

Definition

Let \(Z\) be a normally distributed random vector with mean \(0\) and covariance \(I_n\). Given a vector \(\mu \in \mathbb{R}^n\), the non-central chi-squared distribution with \(n\) degrees of freedom and non-centrality parameter \(\Vert \mu\Vert_2^2\) is the distribution of the quantity

\(\Vert Z+\mu \Vert_2^2 = \sum\limits_{i=1}^n (Z_i+\mu_i)^2. \)

This distribution is denoted by \(\chi^2_n(\Vert \mu \Vert_2^2)\). As this notation suggests, the distribution of \(\Vert Z+\mu \Vert_2^2\) depends only on \(\Vert \mu \Vert_2^2\), the norm of \(\mu\). The first few times I heard this fact, I had no idea why it would be true (and even found it a little spooky). But, as we will see below, the result is actually a simply consequence of the fact that standard normal vectors are invariant under rotations.

Rotational invariance

Suppose that we have two vectors \(\mu, \nu \in \mathbb{R}^n\) such that \(\Vert \mu\Vert_2^2 = \Vert \nu \Vert_2^2\). We wish to show that if \(Z \sim \mathcal{N}(0,I_n)\), then

\(\Vert Z+\mu \Vert_2^2\) has the same distribution as \(\Vert Z + \nu \Vert_2^2\).

Since \(\mu\) and \(\nu\) have the same norm there exists an orthogonal matrix \(U \in \mathbb{R}^{n \times n}\) such that \(U\mu = \nu\). Since \(U\) is orthogonal and \(Z \sim \mathcal{N}(0,I_n)\), we have \(Z’=UZ \sim \mathcal{N}(U0,UU^T) = \mathcal{N}(0,I_n)\). Furthermore, since \(U\) is orthogonal, \(U\) preserves the norm \(\Vert \cdot \Vert_2^2\). This is because, for all \(x \in \mathbb{R}^n\),

\(\Vert Ux\Vert_2^2 = (Ux)^TUx = x^TU^TUx=x^Tx=\Vert x\Vert_2^2.\)

Putting all these pieces together we have

\(\Vert Z+\mu \Vert_2^2 = \Vert U(Z+\mu)\Vert_2^2 = \Vert UZ + U\mu \Vert_2^2 = \Vert Z’+\nu \Vert_2^2\).

Since \(Z\) and \(Z’\) have the same distribution, we can conclude that \( \Vert Z’+\nu \Vert_2^2\) has the same distribution as \(\Vert Z + \nu \Vert\). Since \(\Vert Z + \mu \Vert_2^2 = \Vert Z’+\nu \Vert_2^2\), we are done.

Sampling

Above we showed that the distribution of the non-central chi-squared distribution, \(\chi^2_n(\Vert \mu\Vert_2^2)\) depends only on the norm of the vector \(\mu\). We will now use this to provide an algorithm that can efficiently generate samples from \(\chi^2_n(\Vert \mu \Vert_2^2)\).

A naive way to sample from \(\chi^2_n(\Vert \mu \Vert_2^2)\) would be to sample \(n\) independent standard normal random variables \(Z_i\) and then return \(\sum_{i=1}^n (Z_i+\mu_i)^2\). But for large values of \(n\) this would be very slow as we have to simulate \(n\) auxiliary random variables \(Z_i\) for each sample from \(\chi^2_n(\Vert \mu \Vert_2^2)\). This approach would not scale well if we needed many samples.

An alternative approach uses the rotation invariance described above. The distribution \(\chi^2_n(\Vert \mu \Vert_2^2)\) depends only on \(\Vert \mu \Vert_2^2\) and not directly on \(\mu\). Thus, given \(\mu\), we could instead work with \(\nu = \Vert \mu \Vert_2 e_1\) where \(e_1\) is the vector with a \(1\) in the first coordinate and \(0\)s in all other coordinates. If we use \(\nu\) instead of \(\mu\), we have

\(\sum\limits_{i=1}^n (Z_i+\nu_i)^2 = (Z_1+\Vert \mu \Vert_2)^2 + \sum\limits_{i=2}^{n}Z_i^2.\)

The sum \(\sum_{i=2}^n Z_i^2\) follows the regular chi-squared distribution with \(n-1\) degrees of freedom and is independent of \(Z_1\). The regular chi-squared distribution is a special case of the gamma distribution and can be effectively sampled with rejection sampling for large shape parameter (see here).

The shape parameter for \(\sum_{i=2}^n Z_i^2\) is \(\frac{n-1}{2}\), so for large values of \(n\) we can efficiently sample a value \(Y\) that follows that same distribution as \(\sum_{i=2}^n Z_i^2 \sim \chi^2_{n-1}\). Finally to get a sample from \(\chi^2_n(\Vert \mu \Vert_2^2)\) we independently sample \(Z_1\), and then return the sum \((Z_1+\Vert \mu\Vert_2)^2 +Y\).

Conclusion

In this post, we saw that the rotational invariance of the standard normal distribution gives a similar invariance for the non-central chi-squared distribution.

This invariance allowed us to efficiently sample from the non-central chi-squared distribution. The sampling procedure worked by reducing the problem to sampling from the regular chi-squared distribution.

The same invariance property is also used to calculate the cumulative distribution function and density of the non-central chi-squared distribution. Although the resulting formulas are not for the faint of heart.