Suppose we have two samples \(Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)}\) and \(Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)}\) and we want to test if they are from the same distribution. Many popular tests can be reinterpreted as correlation tests by pooling the two samples and introducing a dummy variable that encodes which sample each data point comes from. In this post we will see how this plays out in a simple t-test.
The equal variance t-test
In the equal variance t-test, we assume that \(Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2)\) and \(Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2)\), where \(\sigma^2\) is unknown. Our hypothesis that \(Y_1^{(0)}, Y_2^{(0)},\ldots, Y_{n_0}^{(0)}\) and \(Y_1^{(1)},Y_2^{(1)},\ldots, Y_{n_1}^{(1)}\) are from the same distribution becomes the hypothesis \(\mu_0 = \mu_1\). The test statistic is
\(t = \frac{\displaystyle \overline{Y}^{(1)} – \overline{Y}^{(0)}}{\displaystyle \hat{\sigma}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}\),
where \(\overline{Y}^{(0)}\) and \(\overline{Y}^{(1)}\) are the two sample means. The variable \(\hat{\sigma}\) is the pooled estimate of the standard deviation and is given by
\(\hat{\sigma}^2 = \displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right)\).
Under the null hypothesis, \(t\) follows the T-distribution with \(n_0+n_1-2\) degrees of freedom. We thus reject the null \(\mu_0=\mu_1\) when \(|t|\) exceeds the \(1-\alpha/2\) quantile of the T-distribution.
Pooling the data
We can turn this two sample test into a correlation test by pooling the data and using a linear model. Let \(Y_1,\ldots,Y_{n_0}, Y_{n_0+1},\ldots,Y_{n_0+n_1}\) be the pooled data and for \(i = 1,\ldots, n_0+n_1\), define \(x_i \in \{0,1\}\) by
\(x_i = \begin{cases} 0 & \text{if } 1 \le i \le n_0,\\ 1 & \text{if } n_0+1 \le i \le n_0+n_1.\end{cases}\)
The assumptions that \(Y_i^{(0)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_0,\sigma^2)\) and \(Y_i^{(1)} \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu_1,\sigma^2)\) can be rewritten as
\(Y_i = \beta_0+\beta_1x_i + \varepsilon_i,\)
where \(\varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2)\). That is, we have expressed our modelling assumptions as a linear model. When working with this linear model, the hypothesis \(\mu_0 = \mu_1\) is equivalent to \(\beta_1 = 0\). To test \(\beta_1 = 0\) we can use the standard t-test for a coefficient in linear model. The test statistic in this case is
\(t’ = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{(X^TX)^{-1}_{11}}},\)
where \(\hat{\beta}_1\) is the ordinary least squares estimate of \(\beta_1\), \(X \in \mathbb{R}^{(n_0+n_1)\times 2}\) is the design matrix and \(\hat{\sigma}_{OLS}\) is an estimate of \(\sigma\) given by
\(\hat{\sigma}_{OLS}^2 = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1} (Y_i-\hat{Y}_i)^2,\)
where \(\hat{Y} = \hat{\beta}_0+\hat{\beta}_1x_i\) is the fitted value of \(Y_i\).
It turns out that \(t’\) is exactly equal to \(t\). We can see this by writing out the design matrix and calculating everything above. The design matrix has rows \([1,x_i]\) and is thus equal to
\(X = \begin{bmatrix} 1&x_1\\ 1&x_2\\ \vdots&\vdots\\ 1&x_{n_0}\\ 1&x_{n_0+1}\\ \vdots&\vdots\\ 1&x_{n_0+n_1}\end{bmatrix} = \begin{bmatrix} 1&0\\ 1&0\\ \vdots&\vdots\\ 1&0\\ 1&1\\ \vdots&\vdots\\ 1&1\end{bmatrix}.\)
This implies that
\(X^TX = \begin{bmatrix} n_0+n_1 &n_1\\n_1&n_1 \end{bmatrix},\)
And therefore,
\((X^TX)^{-1} = \frac{1}{(n_0+n_1)n_1-n_1^2}\begin{bmatrix} n_1 &-n_1\\-n_1&n_0+n_1 \end{bmatrix} = \frac{1}{n_0n_1}\begin{bmatrix} n_1&-n_1\\-n_1&n_0+n_1\end{bmatrix} =\begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix} .\)
Thus, \((X^TX)^{-1}_{11} = \frac{1}{n_0}+\frac{1}{n_1}\). So,
\(t’ = \displaystyle\frac{\hat{\beta}_1}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}},\)
which is starting to like \(t\) from the two-sample test. Now
\(X^TY = \begin{bmatrix} \displaystyle\sum_{i=1}^{n_0+n_1} Y_i\\ \displaystyle \sum_{i=n_0+1}^{n_0+n_1} Y_i \end{bmatrix} = \begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\ n_1\overline{Y}^{(1)} \end{bmatrix}.\)
And so
\(\hat{\beta} = (X^TX)^{-1}X^TY = \begin{bmatrix} \frac{1}{n_0}&-\frac{1}{n_0}\\-\frac{1}{n_0}&\frac{1}{n_0}+\frac{1}{n_1}\end{bmatrix}\begin{bmatrix} n_0\overline{Y}^{(0)} + n_1\overline{Y}^{(1)}\\ n_1\overline{Y}^{(1)} \end{bmatrix}=\begin{bmatrix} \overline{Y}^{(0)}\\ \overline{Y}^{(1)} -\overline{Y}^{(0)}\end{bmatrix}.\)
Thus, \(\hat{\beta}_1 = \overline{Y}^{(1)} -\overline{Y}^{(0)}\) and
\(t’ = \displaystyle\frac{\overline{Y}^{(1)}-\overline{Y}^{(0)}}{\hat{\sigma}_{OLS}\sqrt{\frac{1}{n_0}+\frac{1}{n_1}}}.\)
This means to show that \(t’ = t\), we only need to show that \(\hat{\sigma}_{OLS}^2=\hat{\sigma}^2\). To do this, note that the fitted values \(\hat{Y}\) are equal to
\(\displaystyle\hat{Y}_i=\hat{\beta}_0+x_i\hat{\beta}_1 = \begin{cases} \overline{Y}^{(0)} & \text{if } 1 \le i \le n_0,\\ \overline{Y}^{(1)} & \text{if } n_0+1\le i \le n_0+n_1\end{cases}.\)
Thus,
\(\hat{\sigma}^2_{OLS} = \displaystyle\frac{1}{n_0+n_1-2}\sum_{i=1}^{n_0+n_1}\left(Y_i-\hat{Y}_i\right)^2=\displaystyle\frac{1}{n_0+n_1-2}\left(\sum_{i=1}^{n_0}\left(Y_i^{(0)}-\overline{Y}^{(0)}\right)^2 + \sum_{i=1}^{n_1}\left(Y_i^{(1)}-\overline{Y}^{(1)}\right)^2\right),\)
Which is exactly \(\hat{\sigma}^2\). Therefore, \(t’=t\) and the two sample t-test is equivalent to a correlation test.
The Friedman-Rafsky test
In the above example, we saw that the two sample t-test was a special case of the t-test for regressions. This is neat but both tests make very strong assumptions about the data. However, the same thing happens in a more interesting non-parametric setting.
In their 1979 paper, Jerome Friedman and Lawrence Rafsky introduced a two sample tests that makes no assumptions about the distribution of the data. The two samples do not even have to real-valued and can instead be from any metric space. It turns out that their test is a special case of another procedure they devised for testing for association (Friedman & Rafsky, 1983). As with the t-tests above, this connection comes from pooling the two samples and introducing a dummy variable.
I plan to write a follow up post explaining these procedures but you can also read about it in Chapter 6 of Group Representations in Probability and Statistics by Persi Diaconis.
References
Persi Diaconis “Group representations in probability and statistics,” pp 104-106, Hayward, CA: Institute of Mathematical Statistics, (1988)
Jerome H. Friedman, Lawrence C. Rafsky “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests,” The Annals of Statistics, Ann. Statist. 7(4), 697-717, (July, 1979)
Jerome H. Friedman, Lawrence C. Rafsky “Graph-Theoretic Measures of Multivariate Association and Prediction,” The Annals of Statistics, Ann. Statist. 11(2), 377-391, (June, 1983).
Leave a Reply