data-science – mathstoshare

Logistic regression with aggregate data

Logistic regression is a technique for analyzing data where the dependent variable is binary. In this post, we will introduce logistic regression and then discuss how to fit a logistic regression model using only aggregated count data.

Logistic regression overview

Logistic regression can be used when the dependent variable \(Y_i\) is binary. For example, in a medical study the dependent variable might be whether or not a patient developed a disease. In this case \(Y_i=1\) will mean that patient \(i\) did develop the disease and \(Y_i=0\) will mean that that patient \(i\) did not develop the disease. In logistic regression, the independent variables can either be categorical or quantitative. In the medical example, the type of treatment received could be a categorical independent variable and the age of the patient could be a quantitative independent variable.

Logistic regression models the probability that \(Y_i=1\) given the independent variables. If \(X_{i,1},\ldots,X_{p,1}\) are the values of the independent variables for patient \(i\), then logistic regression assumes that

\(\mathrm{Pr}(Y_i=1) = \sigma(\beta_0 + \beta_1 X_{i,1} + \cdots +\beta_p X_{i,p}) \),

where \(\mathrm{Pr}(Y_i=1)\) is the probability that \(Y_i = 1\), \(\sigma\) is the logistic function \(\sigma(x) = \frac{1}{1+\exp(-x)}\) and \(\beta_0,\beta_1,\ldots,\beta_p\) are parameters that will be estimated from the data. The interpretation of the parameter \(\beta_j\) is that if \(X_{i,j}\) increases by 1 unit and all the other independent variables are fixed, then the odds \(\frac{\mathrm{Pr}(Y_i=1)}{\mathrm{Pr}(Y_i=0)}\) will increase by a factor of \(\exp(\beta_j)\).

Logistic regression models can be fit to data in using the function glm() in the programming language R. For example, to fit a logistic regression model with dependent variable disease and independent variables treatment and age, you can use the code

logistic_model <- glm(
          disease ~ treatment + age, 
          data = data, 
          family = binomial)

To view the estimates of the parameters of the model (as well as standard errors and p-values), you can run the code summary(logistic_model).

Logistic regression with aggregate data

If all of the independent variables are categorical, then it is possible to fit a logistic model with the aggregate count data. For example, the table below is from a study from the 1980s/1990s on the effect of AZT treatment on slowing the development of AIDS.¹

Race	AZT use	Symptoms	No Symptoms
Black	Yes	11	52
Black	No	12	43
White	Yes	14	93
White	No	32	81

The study featured 338 veterans who were positive for HIV and were beginning to show signs of faltering immune systems. The veterans were randomly assigned to either immediately receive AZT or to receive AZT once their T-cells showed signs of severe immune weakness. The table records whether the veterans received AZT immediately, the race of the veterans and whether they developed AIDS in the three year study period.

The table above shows aggregated data. The table reports the number of people for each combination of treatment, race and disease status. The table does not report individual patient data. However, it is it possible to fit a logistic regression model to the aggregate data.² We just need to make some small changes to the R code.

To describe the changes, let’s first review what we would do with the patient level data. Suppose we had a dataset individual_data that had a row for each veteran and had columns race, azt_use and developed_symptoms that record the veteran’s race, AZT use and whether they developed symptoms. Our goal is to fit a logistic regression model where dependent variable \(Y_i\) is a binary variable with \(Y_i=1\) meaning that the patient did develop symptoms. The independent variables are race and AZT use. With the individual data, we could use the following:

individual_model <- glm(
    developed_symptoms ~ race + azt_use, 
    family = binomial, 
    data = individual_data)

Now suppose that we only have the aggregate data in a table called aggregate_data. This table has columns race, azt_use, symptoms and no_symptoms like the table presented above. To fit the logistic model with this data, there are two options.

The first option is to replace the independent variable developed_symptoms in the individual model with cbind(symptoms, no_symptoms). The function cbind creates a matrix with two columns recording the number of people with and without symptoms for each combination of race and AZT use. We also need to tell R to use the aggregate data. Here is the R code:

aggregate_model_1 <- glm(
    cbind(symptoms, no_symptoms) ~ race + azt_use, 
    family = binomial, 
    data = aggregate_data

The other option is to replace the independent variable developed_symptoms with the proportion of people who developed symptoms. To use this option, it is also necessary to compute the total number of people for each combination of race and AZT use. These totals are used as weights in the logistic regression. In the code below, we first compute the proportion and the total number of people and then fit the model.

aggregate_data$total <- aggregate_data$symptoms + aggregate_data$no_symptomsaggregate_data$proportion <- aggregate_data$symptoms / aggregate_data$total

aggregate_model_2 <- glm(
    proportion ~ race + azt_use, 
    family = binomial,    weights = total,    data = aggregate_data)

Results

Both aggregate_model_1 and aggregate_model_2 produce the exact same parameter estimates, standard errors and p-values. More importantly, if we had the individual data, individual_model would also have the same results. We can actually check this by “de-aggregating” the data. This involves making a dataset with, for example, 11 rows with race = "Black", azt_use = "Yes" and developed_symptoms = 1 so that these 11 rows represent the 11 veterans in the first category of the table above.

All three logisitic regression models give the same final results:

	Estimate	Standard error	p-value
Intercept	-1.07	0.263	4e-05
raceWhite	0.05	0.289	0.848
atz_useYes	-0.719	0.279	0.01

The parameter estimates shows that Black and White veterans had approximately the same chance of developing symptoms (assuming that they either both received ATZ treatment or that neither of them received ATZ treatment). On the other hand, the parameter estimate for ATZ use is significant. The estimated value of -0.719 can be interpreted as follows. If the race of the veteran is held constant, then the odds of developing symptoms was half as high for the veterans who immediately took ATZ than for those who waited. In mathematical symbols:

\(\frac{\mathrm{Pr}(\text{developed symptoms} \mid \text{ATZ immediately})}{\mathrm{Pr}(\text{did not develop symptoms} \mid \text{ATZ immediately})}\)

is roughly equal to

\( 0.5 \times \frac{\mathrm{Pr}(\text{developed symptoms} \mid \text{waited for ATZ use})}{\mathrm{Pr}(\text{did not develop symptoms} \mid \text{waited for ATZ})} \)

The specific value of 0.5 comes from the fact that \(\exp(-0.719) \approx 0.5\). Section 5.4.2 of Categorical Data Analysis by Alan Agresti discusses the interpretation of the parameter estimates for this example in more detail.

R code for fitting all three models is available at the end of this post.³

Footnotes

The data on AZT is from Table 5.6 of Alan Agresti’s Categorical Data Analysis (3rd edition) https://onlinelibrary.wiley.com/doi/book/10.1002/0471249688, https://mybiostats.wordpress.com/wp-content/uploads/2015/03/3rd-ed-alan_agresti_categorical_data_analysis.pdf ↩︎
In statistical jargon, the aggregate data are sufficient statistics for the logistic regression model. ↩︎
R code: ↩︎


# Aggregate data from Agresti Table 5.6
aggregate_data <- data.frame(
  race = c("Black", "Black", "White", "White"),
  atz_use = c("yes", "no", "yes", "no"),
  symptoms = c(11, 12, 14, 32),
  no_symptoms = c(52, 43, 93, 81)
)

# Aggregate model 1 using cbind()
aggregate_model_1 <- glm(
  cbind(symptoms, no_symptoms) ~ race + atz_use,
  data = aggregate_data,
  family = binomial)

# Aggregate model 2 using proportions and weights
aggregate_data$total <- aggregate_data$symptoms + aggregate_data$no_symptoms
aggregate_data$proportion <- aggregate_data$symptoms / aggregate_data$total

aggregate_model_2 <- glm(
  proportion ~ race + atz_use,
  data = aggregate_data,
  weights = total,
  family = binomial)

# Individual model

# Function to "de-aggregate"
deaggregate_data <- function(aggregate_data) {
  # Create an empty list to store rows
  rows_list <- list()
  
  # Iterate through each row of aggregate data
  for (i in 1:nrow(aggregate_data)) {
    race_val <- aggregate_data$race[i]
    atz_val <- aggregate_data$atz_use[i]
    n_symptoms <- aggregate_data$symptoms[i]
    n_no_symptoms <- aggregate_data$no_symptoms[i]
    
    # Create rows for those with symptoms
    if (n_symptoms > 0) {
      rows_list[[length(rows_list) + 1]] <- data.frame(
        race = rep(race_val, n_symptoms),
        atz_use = rep(atz_val, n_symptoms),
        developed_symptoms = rep(1, n_symptoms)
      )
    }
    
    # Create rows for those without symptoms
    if (n_no_symptoms > 0) {
      rows_list[[length(rows_list) + 1]] <- data.frame(
        race = rep(race_val, n_no_symptoms),
        atz_use = rep(atz_val, n_no_symptoms),
        developed_symptoms = rep(0, n_no_symptoms)
      )
    }
  }
  
  # Combine all rows into one data frame
  do.call(rbind, rows_list)
}

individual_data <- deaggregate_data(aggregate_data)

individual_model <- glm(
  developed_symptoms ~ race + atz_use,
  data = individual_data,
  family = binomial
)

# Compare models
summary(aggregate_model_1)
summary(aggregate_model_2)
summary(individual_model)

Auxiliary variables sampler

The auxiliary variables sampler is a general Markov chain Monte Carlo (MCMC) technique for sampling from probability distributions with unknown normalizing constants [1, Section 3.1]. Specifically, suppose we have \(n\) functions \(f_i : \mathcal{X} \to (0,\infty)\) and we want to sample from the probability distribution \(P(x) \propto \prod_{i=1}^n f_i(x).\) That is

\(\displaystyle{ P(x) = \frac{1}{Z}\prod_{i=1}^n f_i(x)},\)

where \(Z = \sum_{x \in \mathcal{X}} \prod_{i=1}^n f_i(x)\) is a normalizing constant. If the set \(\mathcal{X}\) is very large, then it may be difficult to compute \(Z\) or sample from \(P(x)\). To approximately sample from \(P(x)\) we can run an ergodic Markov chain with \(P(x)\) as a stationary distribution. Adding auxiliary variables is one way to create such a Markov chain. For each \(i \in \{1,\ldots,n\}\), we add a auxiliary variable \(U_i \ge 0\) such that

\(\displaystyle{P(U \mid X) = \prod_{i=1}^n \mathrm{Unif}[0,f_i(X)]}.\)

That is, conditional on \(X\), the auxiliary variables \(U_1,\ldots,U_n\) are independent and \(U_i\) is uniformly distributed on the interval \([0,f_i(X)]\). If \(X\) is distributed according to \(P\) and \(U_1,\ldots,U_n\) have the above auxiliary variable distribution, then

\(\displaystyle{P(X,U) =P(X)P(U\mid X)\propto \prod_{i=1}^n f_i(X) \frac{1}{f_i(X)} I[U_i \le f(X_i)] = \prod_{i=1}^n I[U_i \le f(X_i)]}.\)

This means that the joint distribution of \((X,U)\) is uniform on the set

\(\widetilde{\mathcal{X}} = \{(x,u) \in \mathcal{X} \times (0,\infty)^n : u_i \le f(x_i) \text{ for all } i\}.\)

Put another way, suppose we could jointly sample \((X,U)\) from the uniform distribution on \(\widetilde{\mathcal{X}}\). Then, the above calculation shows that if we discard \(U\) and only keep \(X\), then \(X\) will be sampled from our target distribution \(P\).

The auxiliary variables sampler approximately samples from the uniform distribution on \(\widetilde{\mathcal{X}}\) is by using a Gibbs sampler. The Gibbs samplers alternates between sampling \(U’\) from \(P(U \mid X)\) and then sampling \(X’\) from \(P(X \mid U’)\). Since the joint distribution of \((X,U)\) is uniform on \(\widetilde{\mathcal{X}}\), the conditional distributions \(P(X \mid U)\) and \(P(U \mid X)\) are also uniform. The auxiliary variables sampler thus transitions from \(X\) to \(X’\) according to the two step process

Independently sample \(U_i\) uniformly from \([0,f_i(X)]\).
Sample \(X’\) uniformly from the set \(\{x \in \mathcal{X} : f_i(x) \ge u_i \text{ for all } i\}\).

Since the auxiliary variables sampler is based on a Gibbs sampler, we know that the joint distribution of \((X,U)\) will converge to the uniform distribution on \(\mathcal{X}\). So when we discard \(U\), the distribution of \(X\) will converge to the target distribution \(P\)!

Auxiliary variables in practice

To perform step 2 of the auxiliary variables sampler we have to be able to sample uniformly from the sets

\(\mathcal{X}_u = \{x \in \mathcal{X}:f_i(x) \ge u_i \text{ for all } i\}. \)

Depending on the nature of the set \(\mathcal{X}\) and the functions \(f_i\), it might be difficult to do this. Fortunately, there are some notable examples where this step has been worked out. The very first example of auxiliary variables is the Swendsen-Wang algorithm for sampling from the Ising model [2]. In this model it is possible to sample uniformly from \(\mathcal{X}_u\). Another setting where we can sample exactly is when \(\mathcal{X}\) is the real numbers \(\mathcal{R}\) and each \(f_i\) is an increasing function of \(x\). This is explored in [3] where they apply auxiliary variables to sampling from Bayesian posteriors.

There is an alternative to sampling exactly from the uniform distribution on \(\mathcal{X}_u\). Instead of sampling \(X’\) uniformly from \(\mathcal{X}_u\), we can run a Markov chain from the old \(X\) that has the uniform distribution as a stationary distribution. This approach leads to another special case of auxiliary variables which is called “slice sampling” [4].

References

[1] Andersen HC, Diaconis P. Hit and run as a unifying device. Journal de la societe francaise de statistique & revue de statistique appliquee. 2007;148(4):5-28. http://www.numdam.org/item/JSFS_2007__148_4_5_0/
[2] Swendsen RH, Wang JS. Nonuniversal critical dynamics in Monte Carlo simulations. Physical review letters. 1987 Jan 12;58(2):86. https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.58.86
[3] Damlen P, Wakefield J, Walker S. Gibbs sampling for Bayesian non‐conjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1999 Apr;61(2):331-44. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00179
[4] Neal RM. Slice sampling. The annals of statistics. 2003 Jun;31(3):705-67. https://projecteuclid.org/journals/annals-of-statistics/volume-31/issue-3/Slice-sampling/10.1214/aos/1056562461.full

Finding Australia’s youngest electorates with R

My partner recently wrote an article for Changing Times, a grassroots newspaper that focuses on social change. Her article, Who’s not voting? Engaging with First Nations voters and young voters, is about voter turn-out in Australia and an important read.

While doing research for the article, she wanted to know which electorates in Australia had the highest proportion of young voters. Fortunately the Australian Electoral Commission (AEC) keeps a detailed record of the number of electors in each electorate available here. The records list the number of voters of a given sex and age bracket in each of Australia’s 153 electorates. To calculate and sort the proportion of young voters (18-29) in each electorate using the most recent records, I wrote the below R code for her:

library(tidyverse)

voters <- read_csv("elector-count-fe-2022.csv", 
                   skip = 8,
                   col_names = c("Description",
                                 "18-19", 
                                 "20-24", 
                                 "25-29", 
                                 "30-34", 
                                 "35-39", 
                                 "40-44", 
                                 "45-49", 
                                 "50-54", 
                                 "55-59", 
                                 "60-64", 
                                 "65-69", 
                                 "70+",  
                                 "Total"),
                   col_select = (1:14),
                   col_types = cols(Description = "c",
                                    .default = "n"))


not_electorates = c("NSW", "VIC", "ACT", "WA", "SA", "TAS", 
                    "QLD", "NT", "Grand Total","Female", 
                    "Male", "Indeterminate/Unknown")

electorates <- voters %>% 
  filter(!(Description %in% not_electorates),
         !is.na(Description))  

young_voters <- electorates %>% 
  mutate(Young = `18-19` + `20-24`,
         Proportion = Young/Total,
         rank = min_rank(desc(Proportion))) %>% 
  arrange(rank) %>% 
  select(Electorate = Description, Total, Young, Proportion, rank) 

young_voters

To explain what I did, I’ll first describe the format of the data set from the AEC. The first column contained a description of each row. All the other columns contained the number of voters in a given age bracket. Rows either corresponded to an electorate or a state and either contained the total electorate or a given sex.

For the article my partner wanted to know which electorate had the highest proportion of young voters (18-24) in total so we removed the rows for each state and the rows that selected a specific sex. The next step was to calculate the number of young voters across the age brackets 18-19 and 20-24 and then calculate the proportion of such voters. Once ranked, the five electorates with the highest proportion of young voters were

Electorate	Proportion of young voters
Ryan	0.146
Brisbane	0.132
Griffith	0.132
Canberra	0.131
Kooyong	0.125

In Who’s not Voting?, my partner points out that the three seats with the highest proportion of voters all swung to the Greens at the most recent election. To view all the proportion of young voters across every electorate, you can download this spreadsheet.

Visualising Strava data with R

I’ve recently had some fun downloading and displaying my running data from Strava. I’ve been tracking my runs on Strava for the last five years and I thought it would be interesting to make a map showing where I run. Here is one of the plots I made. Each circle is a place in south east Australia where I ran in the given year. The size of the circle corresponds to how many hours I ran there.

I think the plot is a nice visual diary. Looking at the plot, you can see that most of my running took place in my hometown Canberra and that the time I spend running has been increasing. The plot also shows that most years I’ve spent some time running on the south coast and on my grandfather’s farm near Kempsey. You can also see my trips in 2019 and 2020 to the AMSI Summer School. In 2019, the summer school was hosted by UNSW in Sydney and in 2020 it was hosted by La Trobe University in Melbourne. You can also see a circle from this year at Mt Kosciuszko where I ran the Australian Alpine Ascent with my good friend Sarah.

I also made a plot of all my runs on a world map which shows my recent move to California. In this plot all the circles are the same size and I grouped all the runs across the five different years.

I learnt a few things creating these plots and so I thought I would document how I made them.

Creating the plots

Strava lets you download all your data by doing a bulk export. The export includes a zipped folder with all your activities in their original file format.
My activities where saved as .gpx files and I used this handy python library to convert them to .csv files which I could read into R. For the R code I used the packages “tidyverse”, “maps” and “ozmaps”.
Now I had a .csv files for each run. In these files each row corresponded to my location at a given second during the run. What I wanted was a single data frame where each row corresponded to a different run. I found the following way to read in and edit each .csv file:

files <- list.files(path = ".", pattern = "*.csv")
listcsv <- lapply(files, function(x) read_csv(paste0(x)) %>% 
                    select(lat, lon, time) %>% 
                    mutate(hours = n()/3600) %>% 
                    filter(row_number() == 1)
                  )

The first line creates a list of all the .csv files in the working directory. The second line then goes through the list of file names and converts each .csv file into a tibble. I then selected the rows with the time and my location and added a new column with the duration of the run in hours. Finally I removed all the rows except the first row which contains the information about where my run started.

Next I combined these separate tibbles into a single tibble using rbind(). I then added some new columns for grouping the runs. I added a column with the year and columns with the longitude and latitude rounded to the nearest whole number.

runs <- do.call(rbind, listcsv) %>% 
  mutate(year = format(time,"%Y"),
         approx_lat = round(lat),
         approx_lon = round(lon))

To create the plot where you can see where I ran each year, I grouped the runs by the approximate location and by year. I then calculated the total time spent running at each location each year and calculated the average longitude and latitude. I also removed the runs in the USA by only keeping the runs with a negative latitude.

run_counts_year <- runs %>% 
  group_by(approx_lat, approx_lon, year) %>% 
  summarise(hours = sum(hours),
            lat = mean(lat),
            lon = mean(lon),
            .groups = "drop") %>% 
  select(!approx_lat & !approx_lon)

oz_counts_year <- run_counts_year %>% 
  filter(lat < 0)

I then used the package “ozmaps” to plot my running locations on a map of the ACT, New South Wales and Victoria.

oz_states <- ozmaps::ozmap_states %>% 
  filter(NAME == "New South Wales" |
         NAME == "Victoria" |
         NAME == "Australian Capital Territory")

ggplot() + 
  geom_sf(data = oz_states) +
  coord_sf() +
  geom_point(data = oz_counts_year,
             mapping = aes(x = lon, 
                           y = lat,
                           size = hours),
            color = "blue",
            shape = 21) +
  facet_wrap(~ year) +
  theme_bw()

Creating the world map was similar except I didn’t group by year and I kept the runs with positive latitude.

run_counts <- runs %>% 
  group_by(approx_lat, approx_lon) %>% 
  summarise(hours = sum(hours),
            lat = mean(lat),
            lon = mean(lon),
            .groups = "drop") %>% 
  select(!approx_lat & !approx_lon)

world <- map_data("world")
ggplot() +
  geom_polygon(data = world,
               mapping = aes(x = long, y = lat, group = group),
               fill = "lightgrey",
               color = "black",
               size = 0.1) +
  geom_point(data = run_counts,
             mapping = aes(x = lon, 
                           y = lat),
             size = 2,
             shape = 1,
             color = "blue") +
  theme_bw()