When performing multiple hypothesis tests, we should rightly be concerned that by repeated testing, we might reject null hypotheses more frequently than our \(\alpha\)-levels (or equivalently \(p\)-values) would guarantee. In particular, the family-wise error rate, the probability of rejecting at least one true null hypothesis, might be very different than the \(\alpha\) levels used in each individual test. In general, if we were to conduct \(k\) independent hypothesis tests and all the the null hypotheses were true, we would reject at least one null hypothesis with probability \(1 - (1 - \alpha)^k\). For \(k = 2\) this is \(2 \alpha - \alpha^2\). As we test more trials, we run the risk of thinking we would reject the truth of any individual trial say 5% of the time, whereas we would actually be rejecting the truth far more often. XKCD says it better than I ever could.

Letâ€™s say you wish to conduct a hypothesis test based on some test statistic \(Z\) that is Normally distributed with mean zero and variance one, under the null hypothesis (in the following \(\Phi^{-1}\) is the quantitle function for \(Z\)). In general, you care about alternatives that imply that \(Z\) still has unit variance, but now is centered somewhere else. You could either test a two-side hypothesis, where you would reject the null if \(Z\) less than \(\Phi^{-1}(\alpha/2)\) or \(Z\) is greater than \(\Phi^{-1}(1 - \alpha/2)\), or a one sided hypothesis, where you would reject the null if \(Z\) exceeds \(\Phi^{-1}(1 - \alpha)\). Both of these tests would have proper size, rejecting the truth only \(\alpha\) percentage of the time. Individually, either of these tests would meet the usual basic requirement we place on tests.

But what about multiple testing? Should we be concerned? What is the probability of rejecting one or both tests when \(Z\) is distributed as specified under the null hypothesis (i.e., standard Normal). For the one sided test, we would reject when \(Z > \Phi^{-1}(\alpha)\). For the two sided, we reject when \(Z > \Phi^{-1}(1- \alpha/2)\) or \(Z < \Phi^{-1}(\alpha/2)\). Putting these together:

\begin{align} & P([Z > \Phi^{-1}(1 - \alpha)] \cup [Z > \Phi^{-1}(1 - \alpha/2)] \cup [Z < \Phi^{-1}(\alpha/2)])\\ &= 1 - P(\Phi^{-1}(\alpha/2) < Z < \Phi^{-1}(1 - \alpha)) \\ &= 1 - (1 - \alpha - \alpha/2) = \frac{3\alpha}{2} \end{align}So would reject at least one of the test more often than \(\alpha\), but the result is not as bad as the independent tests. For example, with two independent tests, and \(\alpha = 0.05\), we would reject at least one of the true two nulls 9.75% of the time, but only reject at least one of the two tests of the same null 7.5% of the time. This difference comes from the strong dependence introduced by testing the same null with the same test statistic. Since rejecting the two sided test implies rejecting the one sided, we have fewer opportunities to mislead ourselves with respect to error rates than when considering independent tests. For independent tests, we can do little better than applying Bonferroni corrections across the multiple tests, but with known dependence, we can greatly improve testing procedures to maintain family wise error rates, for example using closed testing procedure.

Finally, it is worth noting that if the researcher first does a two tailed test and then decides to throw away the results unconditionally in order to perform a one-tailed test, all of the usual guarantees about the testing procedure hold. The probability of rejecting the one tailed test would be \(P(Z > \Phi^{-1}(1 - \alpha)) = \alpha\), which is exactly what we would want.

published 2015-09-17