Logistic regression with aggregate data

Logistic regression is a technique for analyzing data where the dependent variable is binary. In this post, we will introduce logistic regression and then discuss how to fit a logistic regression model using only aggregated count data.

Logistic regression overview

Logistic regression can be used when the dependent variable \(Y_i\) is binary. For example, in a medical study the dependent variable might be whether or not a patient developed a disease. In this case \(Y_i=1\) will mean that patient \(i\) did develop the disease and \(Y_i=0\) will mean that that patient \(i\) did not develop the disease. In logistic regression, the independent variables can either be categorical or quantitative. In the medical example, the type of treatment received could be a categorical independent variable and the age of the patient could be a quantitative independent variable.

Logistic regression models the probability that \(Y_i=1\) given the independent variables. If \(X_{i,1},\ldots,X_{p,1}\) are the values of the independent variables for patient \(i\), then logistic regression assumes that

\(\mathrm{Pr}(Y_i=1) = \sigma(\beta_0 + \beta_1 X_{i,1} + \cdots +\beta_p X_{i,p}) \),

where \(\mathrm{Pr}(Y_i=1)\) is the probability that \(Y_i = 1\), \(\sigma\) is the logistic function \(\sigma(x) = \frac{1}{1+\exp(-x)}\) and \(\beta_0,\beta_1,\ldots,\beta_p\) are parameters that will be estimated from the data. The interpretation of the parameter \(\beta_j\) is that if \(X_{i,j}\) increases by 1 unit and all the other independent variables are fixed, then the odds \(\frac{\mathrm{Pr}(Y_i=1)}{\mathrm{Pr}(Y_i=0)}\) will increase by a factor of \(\exp(\beta_j)\).

Logistic regression models can be fit to data in using the function glm() in the programming language R. For example, to fit a logistic regression model with dependent variable disease and independent variables treatment and age, you can use the code

logistic_model <- glm(
          disease ~ treatment + age, 
          data = data, 
          family = binomial)

To view the estimates of the parameters of the model (as well as standard errors and p-values), you can run the code summary(logistic_model).

Logistic regression with aggregate data

If all of the independent variables are categorical, then it is possible to fit a logistic model with the aggregate count data. For example, the table below is from a study from the 1980s/1990s on the effect of AZT treatment on slowing the development of AIDS.1

RaceAZT useSymptomsNo Symptoms
BlackYes1152
BlackNo1243
WhiteYes1493
WhiteNo3281

The study featured 338 veterans who were positive for HIV and were beginning to show signs of faltering immune systems. The veterans were randomly assigned to either immediately receive AZT or to receive AZT once their T-cells showed signs of severe immune weakness. The table records whether the veterans received AZT immediately, the race of the veterans and whether they developed AIDS in the three year study period.

The table above shows aggregated data. The table reports the number of people for each combination of treatment, race and disease status. The table does not report individual patient data. However, it is it possible to fit a logistic regression model to the aggregate data.2 We just need to make some small changes to the R code.

To describe the changes, let’s first review what we would do with the patient level data. Suppose we had a dataset individual_data that had a row for each veteran and had columns race, azt_use and developed_symptoms that record the veteran’s race, AZT use and whether they developed symptoms. Our goal is to fit a logistic regression model where dependent variable \(Y_i\) is a binary variable with \(Y_i=1\) meaning that the patient did develop symptoms. The independent variables are race and AZT use. With the individual data, we could use the following:

individual_model <- glm(
    developed_symptoms ~ race + azt_use, 
    family = binomial, 
    data = individual_data)

Now suppose that we only have the aggregate data in a table called aggregate_data. This table has columns race, azt_use, symptoms and no_symptoms like the table presented above. To fit the logistic model with this data, there are two options.

  • The first option is to replace the independent variable developed_symptoms in the individual model with cbind(symptoms, no_symptoms). The function cbind creates a matrix with two columns recording the number of people with and without symptoms for each combination of race and AZT use. We also need to tell R to use the aggregate data. Here is the R code:
aggregate_model_1 <- glm(
    cbind(symptoms, no_symptoms) ~ race + azt_use, 
    family = binomial, 
    data = aggregate_data
  • The other option is to replace the independent variable developed_symptoms with the proportion of people who developed symptoms. To use this option, it is also necessary to compute the total number of people for each combination of race and AZT use. These totals are used as weights in the logistic regression. In the code below, we first compute the proportion and the total number of people and then fit the model.
aggregate_data$total <- aggregate_data$symptoms + aggregate_data$no_symptomsaggregate_data$proportion <- aggregate_data$symptoms / aggregate_data$total

aggregate_model_2 <- glm(
    proportion ~ race + azt_use, 
    family = binomial,    weights = total,    data = aggregate_data)

Results

Both aggregate_model_1 and aggregate_model_2 produce the exact same parameter estimates, standard errors and p-values. More importantly, if we had the individual data, individual_model would also have the same results. We can actually check this by “de-aggregating” the data. This involves making a dataset with, for example, 11 rows with race = "Black", azt_use = "Yes" and developed_symptoms = 1 so that these 11 rows represent the 11 veterans in the first category of the table above.

All three logisitic regression models give the same final results:

EstimateStandard errorp-value
Intercept-1.070.2634e-05
raceWhite0.050.2890.848
atz_useYes-0.7190.2790.01

The parameter estimates shows that Black and White veterans had approximately the same chance of developing symptoms (assuming that they either both received ATZ treatment or that neither of them received ATZ treatment). On the other hand, the parameter estimate for ATZ use is significant. The estimated value of -0.719 can be interpreted as follows. If the race of the veteran is held constant, then the odds of developing symptoms was half as high for the veterans who immediately took ATZ than for those who waited. In mathematical symbols:

\(\frac{\mathrm{Pr}(\text{developed symptoms} \mid \text{ATZ immediately})}{\mathrm{Pr}(\text{did not develop symptoms} \mid \text{ATZ immediately})}\)

is roughly equal to

\( 0.5 \times \frac{\mathrm{Pr}(\text{developed symptoms} \mid \text{waited for ATZ use})}{\mathrm{Pr}(\text{did not develop symptoms} \mid \text{waited for ATZ})} \)

The specific value of 0.5 comes from the fact that \(\exp(-0.719) \approx 0.5\). Section 5.4.2 of Categorical Data Analysis by Alan Agresti discusses the interpretation of the parameter estimates for this example in more detail.

R code for fitting all three models is available at the end of this post.3

Footnotes

  1. The data on AZT is from Table 5.6 of Alan Agresti’s Categorical Data Analysis (3rd edition) https://onlinelibrary.wiley.com/doi/book/10.1002/0471249688, https://mybiostats.wordpress.com/wp-content/uploads/2015/03/3rd-ed-alan_agresti_categorical_data_analysis.pdf ↩︎
  2. In statistical jargon, the aggregate data are sufficient statistics for the logistic regression model. ↩︎
  3. R code: ↩︎

# Aggregate data from Agresti Table 5.6
aggregate_data <- data.frame(
  race = c("Black", "Black", "White", "White"),
  atz_use = c("yes", "no", "yes", "no"),
  symptoms = c(11, 12, 14, 32),
  no_symptoms = c(52, 43, 93, 81)
)

# Aggregate model 1 using cbind()
aggregate_model_1 <- glm(
  cbind(symptoms, no_symptoms) ~ race + atz_use,
  data = aggregate_data,
  family = binomial)

# Aggregate model 2 using proportions and weights
aggregate_data$total <- aggregate_data$symptoms + aggregate_data$no_symptoms
aggregate_data$proportion <- aggregate_data$symptoms / aggregate_data$total

aggregate_model_2 <- glm(
  proportion ~ race + atz_use,
  data = aggregate_data,
  weights = total,
  family = binomial)

# Individual model

# Function to "de-aggregate"
deaggregate_data <- function(aggregate_data) {
  # Create an empty list to store rows
  rows_list <- list()
  
  # Iterate through each row of aggregate data
  for (i in 1:nrow(aggregate_data)) {
    race_val <- aggregate_data$race[i]
    atz_val <- aggregate_data$atz_use[i]
    n_symptoms <- aggregate_data$symptoms[i]
    n_no_symptoms <- aggregate_data$no_symptoms[i]
    
    # Create rows for those with symptoms
    if (n_symptoms > 0) {
      rows_list[[length(rows_list) + 1]] <- data.frame(
        race = rep(race_val, n_symptoms),
        atz_use = rep(atz_val, n_symptoms),
        developed_symptoms = rep(1, n_symptoms)
      )
    }
    
    # Create rows for those without symptoms
    if (n_no_symptoms > 0) {
      rows_list[[length(rows_list) + 1]] <- data.frame(
        race = rep(race_val, n_no_symptoms),
        atz_use = rep(atz_val, n_no_symptoms),
        developed_symptoms = rep(0, n_no_symptoms)
      )
    }
  }
  
  # Combine all rows into one data frame
  do.call(rbind, rows_list)
}

individual_data <- deaggregate_data(aggregate_data)

individual_model <- glm(
  developed_symptoms ~ race + atz_use,
  data = individual_data,
  family = binomial
)

# Compare models
summary(aggregate_model_1)
summary(aggregate_model_2)
summary(individual_model)

Poisson approximations to the negative binomial distribution

This post is an introduction to the negative binomial distribution and a discussion of different ways of approximating the negative binomial distribution.

The negative binomial distribution describes the number of times a coin lands on tails before a certain number of heads are recorded. The distribution depends on two parameters \(p\) and \(r\). The parameter \(p\) is the probability that the coin lands on heads and \(r\) is the number of heads. If \(X\) has the negative binomial distribution, then \(X = x\) means in the first \(x+r-1\) tosses of the coin, there were \(r-1\) heads and that toss number \(x+r\) was a head. This means that the probability that \(X=x\) is given by

\(\displaystyle{f(x) = \binom{x+r-1}{r-1}p^{r}\left(1-p\right)^x}\)

Here is a plot of the function \(f(x)\) for different values of \(r\) and \(p\).

Poisson approximations

When the parameter \(r\) is large and \(p\) is close to one, the negative binomial distribution can be approximated by a Poisson distribution. More formally, suppose that \(r(1-p)=\lambda\) for some positive real number \(\lambda\). If \(r\) is large then, the negative binomial random variable with parameters \(p\) and \(r\), converges to a Poisson random variable with parameter \(\lambda\). This is illustrated in the picture below where three negative binomial distributions with \(r(1-p)=5\) approach the Poisson distribution with \(\lambda =5\).

Total variation distance is a common way to measure the distance between two discrete probability distributions. The log-log plot below shows that the error from the Poisson approximation is on the order of \(1/r\) and that the error is bigger if the limiting value of \(r(1-p)\) is larger.

It turns out that is is possible to get a more accurate approximation by using a different Poisson distribution. In the first approximation, we used a Poisson random variable with mean \(\lambda = r(1-p)\). However, the mean of the negative binomial distribution is \(r(1-p)/p\). This suggests that we can get a better approximation by setting \(\lambda = r(1-p)/p\).

The change from \(\lambda = r(1-p)\) to \(\lambda = r(1-p)/p\) is a small because \(p \approx 1\). However, this small change gives a much better approximation, especially for larger values of \(r(1-p)\). The below plot shows that both approximations have errors on the order of \(1/r\), but the constant for the second approximation is much better.

Second order accurate approximation

It is possible to further improve the Poisson approximation by using a Gram–Charlier expansion. A Gram–Charlier approximation for the Poisson distribution is given in this paper.1 The approximation is

\(\displaystyle{f_{GC}(x) = P_\lambda(x) – \frac{1}{2}q\left((x-\lambda)P_\lambda(x)-(x-1-\lambda)P_\lambda(x-1)\right)},\)

where \(\lambda = \frac{k(1-p)}{p}\), \(q=1-p\), and \(P_\lambda(x)\) is the Poisson pmf evaluated at \(x\).

The Gram–Charlier expansion is considerably more accurate than either Poisson approximation. The errors are on the order of \(1/r^2\). This higher accuracy means that the error curves for the Gram–Charlier expansion has a steeper slope.

  1. The approximation is given in equation (4) of the paper and is stated in terms of the CDF instead of the PMF. The equation also contains a small typo, it should say \(\frac{1}{2}q\) instead of \(\frac{1}{2}p\). ↩︎

Jigsaw puzzles and the isoperimetric inequality

The 2025 World Jigsaw Puzzle Championship recently took place in Spain. In the individual heats, competitors had to choose their puzzle. Each competitor could either complete a 500-piece rectangular puzzle or a 500-piece circular puzzle. For example, in Round F, puzzlers could choose one of the following two puzzles:

There are many things to consider when picking a puzzle. First, the puzzles had already been released. This meant a puzzler could get an advantage by choosing to do a puzzle they had already completed. The image on the puzzle is also important, as is the puzzle’s style. In the example above, most puzzlers picked the circular puzzle because it had a clear colour gradient and distinct sections.

The commentators mentioned that puzzlers might also consider the number of edge pieces. The circular puzzles have fewer edge pieces than the rectangular puzzles. This could make the rectangular puzzle easier because some puzzlers like to do the edge first and then build inwards.

The difference in edge pieces isn’t a coincidence. Circular puzzles have fewer edge pieces than any other 500-piece puzzle. This fact is a consequence of a piece of mathematics called the isoperimetric inequality.

The isopermetric inequality

The isoperimetric inequality is a statement about 2D shapes. The word ‘isoperimetric’ literally translates to ‘having the same perimeter.’ The isoperimetric inequality says that if a shape has the same perimeter as a particular circle, then the circle has a larger (or equal) area. For example, the square and circle below have the same perimeter but the circle has a larger area:

The isoperimetric inequality can also be flipped around to give a statement about shapes with the same area. This ‘having the same area’ inequality states that if a shape has the same area as a circle, then the circle has a smaller (or equal) perimeter. For example, the square and circle below have the same area but the circle has a smaller perimeter:

The ‘having the same area’ inequality applies to the jigsaw puzzles above. In a jigsaw puzzle, the area is proportional to the number of pieces and the perimeter is proportional to the number of edge pieces (assuming the pieces are all roughly the same size and shape). This means that if one puzzle has the same number of pieces as a circular puzzle (for example 500), then the circular puzzle will have fewer edge pieces! This is exactly what the commentator said about the puzzles in the competition!

I don’t think many of the puzzlers thought about the isopermetric inequality when picking their puzzles. But if you ever do a circular puzzle, remember that you’re doing a puzzle with as few edge pieces as possible!

The Multivariate Hypergeometric Distribution

The hypergeometric distribution describes the number of white balls in a sample of \(n\) balls draw without replacement from an urn with \(m_W\) white balls and \(m_B\) black balls. The probability of having \(x\) white balls in the sample is:

\(\displaystyle{f(x; m_W, m_B, n) = \frac{\binom{m_W}{x} \binom{m_B}{n-x}}{\binom{m_W+m_B}{n}}}\)

This is because there are \(\binom{m_W+m_B}{n}\) possible ways to sample \(n\) balls from the urn which has \(m_W+m_B\) black balls in total. Of these possibilities, there are \(\binom{m_W}{x} \binom{m_B}{n-x}\) ways of selecting exactly \(x\) white balls.

The multivariate hypergeometric distribution is a generalization of the hypergeometric distribution. In the multivariate hypergeometric distribution, there are \(k\) colours instead of \(2\). Suppose that the urn contains \(m_1\) balls of the first colour, \(m_2\) balls of the second colour and so on. The total number of balls in the urn is \(M = m_1+m_2+\cdots + m_k\). Next we draw \(n\) balls from the urn. Let \(X_i\) be the number of balls of colour \(i\) in our sample. The vector \(X=(X_1,X_2,\ldots, X_k)\) follows the multivariate hypergeometric distribution. The vector \(X\) satisfies the constraints \(X_i \in \{0,1,\ldots,M_i\}\) and \(\sum_{i=1}^k X_i = n\). The probability of observing \(X=x\) is

\(\displaystyle{f(x; m, n) = \frac{\binom{m_1}{x_1} \binom{m_2}{x_2}\cdots \binom{m_k}{x_k}}{\binom{M}{n} }, \quad x_1+x_2+\cdots+x_k = n}\)

This is because there are \(\binom{M}{n}\) possible ways of sampling \(n\) balls from the urn that has \(M\) balls in total. Likewise, The number of ways of choosing \(x_i\) balls of colour \(m_i\) is \(\binom{m_i}{x_i}\).

Sampling

There are two common methods for sampling from the multivariate hypergeometric distribution. The first option is to sample \(n\) numbers from \(1,\ldots,M\) without replacement. The numbers \(1,\ldots,M\) represent the balls in the urn. We can let \(1,\ldots,m_1\) represent the numbers with colour 1, and let \(m_1+1,\ldots,m_1+m_2\) represent the numbers with colour 2 and so on. Suppose that \(Y_1,\ldots,Y_n\) is the sample of size \(n\) without replacement from \(1,\ldots,M\). Next define:

\(\displaystyle{X_i = \#\left\{j :\sum_{k=1}^{i-1}X_k < Y_j \le \sum_{k=1}^i X_k\right\}}\)

Then the vector \(X=(X_1,\ldots,X_k)\) will be distributed according to the multivariate hypergeometric distribution. This method of sampling has complexity on the order of \(n\log(M)\). This is because sampling a random number of size \(M\) has complexity \(\log(M)\) and we need to sample \(n\) of these numbers.

There is a second method of sampling that is more efficient when \(n\) is large and \(k\) is small. This is a sequential method that samples each \(X_i\) from a (univariate) hypergeometric distribution. The idea is that when sampling \(X_1\) we can treat the balls of colour \(1\) as “white” and all the other colours as “black”. This means that marginally, \(X_1\) is hypergeometric with parameters \(m_W = m_1\) and \(m_B = m_2+\cdots + m_k\) and \(n = n\). Furthermore, conditional on \(X_1\), the second entry, \(X_2\), is also hypergeometric. The parameters for \(X_2\) are \(m_W = m_2\), \(m_B = m_3+\cdots+m_k\) and \(n=n-X_1\). In fact, all of the conditionals of \(X\) are hypergeometric random variables and so can sampled sequentially. The python code at the end of this post implements the sequential sampler.

The complexity of the sequential method is bounded above by \(k\) times the complexity of sampling from the hypergeometric distribution. By using the ratio of uniform distribution this can be done in complexity on the order of \(\log(m_w+m_b)\). The overall complexity of the sequential method is therefore on the order of \(k\log(M)\) which can be much smaller than \(n\log(M)\).

import numpy as np

def multivariate_hypergeometric_sample(m, n):
    x = np.zeros_like(m)
    k = len(m)
    for i in range(k-1):
        x[i] = np.random.hypergeometric(m[i], np.sum(m[(i+1):]), n)
        n -= x[i]
    x[k-1] = n
    return x

For more about the multivariate hyper-geometric distribution see this post about contingency tables and double cosets.

The maximum of geometric random variables

I was working on a problem involving the maximum of a collection of geometric random variables. To state the problem, let \((X_i)_{i=1}^n\) be i.i.d. geometric random variables with success probability \(p=p(n)\). Next, define \(M_n = \max\{X_i : 1 \le i \le n\}\). I wanted to know the limiting distribution of \(M_n\) as \(p \to 0\) and \(n \to \infty\). I was particularly interested in the case when \(p\) was on the order of \(1/n\).

It turns out that this problem has a very nice solution: If \(p\to 0\) as \(n \to \infty\), then for all real numbers \(c\)

\(\displaystyle{\lim_{n \to \infty}\mathbb{P}\left(M_n \le \frac{\log n}{p}+\frac{c}{p}\right) \to \exp(-e^{-c})}\)

That is, \(M_n\) is on the order of \(\frac{\log n}{p}\) and has fluctuations of order \(1/p\). Furthermore, these fluctuations are distributed according to the Gumbel distribution. Importantly, this result places no restrictions on how quickly \(p\) goes to \(0\).

In the rest of this post, I’ll derive the above identity. The first step relates \(M_n\) to the maximum of a collection of exponential random variables. The second step uses a coupling to prove that the exponential and geometric random variables are close to each other.

An easier version

When \(p\) goes to \(0\), the scaled geometric random variables \(p X_i\) converge to exponential random variables. Exponential random variables are nicer to work because they have a continuous distribution.

This suggests that we could try replacing all the geometric random variables with exponential random variables. More precisely, we will replace \(X_i\) with \(Y_i/p\) where \(Y_i\) are exponentially distributed. Next, we will derive the main result for the exponential random variables. Then, in the next section, we will prove that \(X_i\) is sufficiently close to \(Y_i/p\).

Let \(\widetilde{M}_n = \max\{ Y_i/p : 1 \le i \le n\}\) where \(Y_i\) are i.i.d and exponentially distributed. This means that for \(y \ge 0\), \(\mathbb{P}(Y_i \le y)=1-e^{-y}\). The condition \(\widetilde{M}_n \le y\) is equivalent to \(Y_i \le py\) for all \(i =1,\ldots,n\). Since the \(Y_i\)’s are independent, this implies that for \(y \ge 0\)

\(\displaystyle{\mathbb{P}(\widetilde{M}_n \le y) = \prod_{i=1}^n \mathbb{P}(Y_i \le py) = \left(1-e^{-py}\right)^n}\)

Now fix a real number \(c\) and let \(y = \frac{\log n}{p}+\frac{c}{p}\). Since \(n \to \infty\), we know that \(y \ge 0\) for large enough \(n\). Thus, for sufficiently large \(n\)

\(\displaystyle{\mathbb{P}\left(\widetilde{M}_n \le \frac{\log n}{p}+\frac{c}{p}\right) = \left(1-e^{-\log n – c})\right)^n = \left(1-\frac{e^{-c}}{n}\right)^n},\)

And \(\left(1-\frac{e^{-c}}{n}\right)^n \to \exp(-e^{-c})\) as \(n \to \infty\). So the main result holds for \(\widetilde{M}_n = \max\{Y_i/p : 1 \le i \le n\}\). We want to show that it holds for \(M_n = \max\{X_i : 1 \le i \le n\}\). The next section proves a close relationship between \(X_i\) and \(Y_i\).

Exponential and geometric coupling

It is true that as \(p \to 0\), \(pX_i\) converges in distribution to an exponential random variable. This is a result that holds for a single random variable. For an approximation like \(M_n \approx \widetilde{M}_n\), we need a result that works for every \(X_i\) at once. This seems tricky since it involves the convergence of an infinite collection of random variables. Fortunately, there is a simple approach based on coupling.

For each geometric random variable \(X_i\) it is possible to produce an exponential random variables \(Y_i\) with

\(\displaystyle{\frac{1}{\lambda} Y_i \le X_i \le \frac{1}{\lambda} Y_i + 1},\)

where \(\lambda = -\log(1-p)\). This is because \(\mathbb{P}(X_i \le x) = 1-(1-p)^{[x]}\) and \(\mathbb{P}(Y_i/\lambda \le x) = 1-e^{-x/\lambda} = 1-(1-p)^{x}\). This means that

\( \displaystyle{\mathbb{P}(Y_i/\lambda \le x-1) \le \mathbb{P}(X_i \le x)\le \mathbb{P}(Y_i/\lambda \le x-1) }.\)

If we sample both \(X_i\) and \(Y_i\) with inverse transform sampling, and use the same uniform random variables, then we will have

\(\displaystyle{ \frac{1}{\lambda} Y_i \le X_i \le \frac{1}{\lambda} Y_i + 1},\)

as claimed. If we use this inequality for every single pair \((X_i,Y_i)\) we will get

\(\displaystyle{\max\{Y_i/ \lambda\} \le M_n \le \max\{Y_i/\lambda\} +1}\)

As \(p \to 0\), we have that \(\lambda \to 0\) and \(\frac{p}{\lambda}=\frac{p}{-\log(1-p)} \to 1\). This means that the random variable \(\max\{Y_i/\lambda\} \) has the same limit as \(\widetilde{M}_n = \max\{ Y_i/p : 1 \le i \le n\}\). Furthermore, \(\max\{Y_i/\lambda\} +1\) also has the same limit as \(\widetilde{M}_n\). Since \( M_n\) is “sandwiched” between \(\max\{Y_i/\lambda\}\) and \(\max\{Y_i/\lambda\} +1\), the result must also hold for \(M_n\). And so

\(\displaystyle{\lim_{n \to \infty}\mathbb{P}\left(M_n \le \frac{\log n}{p}+\frac{c}{p}\right) \to \exp(-e^{-c})}\)