I am a 2nd year PhD student in the Statistics Depart­ment of the University of Illinois at Urbana-Champaign. My re­search int­erests are ex­peri­mental design and ran­domi­zation in­ference, causal in­ference, non-parametric stat­istics, and com­putational methods. My current research expands upon my published work on evaluating spillover hypotheses in randomized experiments. I am currently focused on finding optimal test statistics for experiments with spillover and writing models that best capture the scientific theories explaining spillover and contagion. I am also working on connecting randomization inference with empirical likelihood methods and Bayesian approaches.

My background spans statistics, computer science, and social science. I began my graduate studies at UIUC in the Political Science Department studying American politics and political methodology. During this time, I also pursued a concurrent master’s degree in statistics (completed August, 2012). In 2014, I transferred to the Statistics Department to focus on my statistical training and research. I still maintain a strong interest in social science research, particularly political science, and draw upon substantive social science questions to motivate my statistical work. With an undergraduate degree in computer science and several years as a professional programmer, I continue to look for better ways to organize, implement, and disseminate research through improved tools. I am especially concerned about reproducible research and how the tools we use shape the questions we ask.


On performing two hypothesis tests with the same null.

When performing multiple hypothesis tests, we should rightly be concerned that by repeated testing, we might reject null hypotheses more frequently than our \(\alpha\)-levels (or equivalently \(p\)-values) would guarantee. In particular, the family-wise error rate, the probability of rejecting at least one true null hypothesis, might be very different than the \(\alpha\) levels used in each individual test. In general, if we were to conduct \(k\) independent hypothesis tests and all the the null hypotheses were true, we would reject at least one null hypothesis with probability \(1 - (1 - \alpha)^k\). For \(k = 2\) this is \(2 \alpha - \alpha^2\). As we test more trials, we run the risk of thinking we would reject the truth of any individual trial say 5% of the time, whereas we would actually be rejecting the truth far more often. XKCD says it better than I ever could.

Let’s say you wish to conduct a hypothesis test based on some test statistic \(Z\) that is Normally distributed with mean zero and variance one, under the null hypothesis (in the following \(\Phi^{-1}\) is the quantitle function for \(Z\)). In general, you care about alternatives that imply that \(Z\) still has unit variance, but now is centered somewhere else. You could either test a two-side hypothesis, where you would reject the null if \(Z\) less than \(\Phi^{-1}(\alpha/2)\) or \(Z\) is greater than \(\Phi^{-1}(1 - \alpha/2)\), or a one sided hypothesis, where you would reject the null if \(Z\) exceeds \(\Phi^{-1}(1 - \alpha)\). Both of these tests would have proper size, rejecting the truth only \(\alpha\) percentage of the time. Individually, either of these tests would meet the usual basic requirement we place on tests.

But what about multiple testing? Should we be concerned? What is the probability of rejecting one or both tests when \(Z\) is distributed as specified under the null hypothesis (i.e., standard Normal). For the one sided test, we would reject when \(Z > \Phi^{-1}(\alpha)\). For the two sided, we reject when \(Z > \Phi^{-1}(1- \alpha/2)\) or \(Z < \Phi^{-1}(\alpha/2)\). Putting these together:

\begin{align} & P([Z > \Phi^{-1}(1 - \alpha)] \cup [Z > \Phi^{-1}(1 - \alpha/2)] \cup [Z < \Phi^{-1}(\alpha/2)])\\ &= 1 - P(\Phi^{-1}(\alpha/2) < Z < \Phi^{-1}(1 - \alpha)) \\ &= 1 - (1 - \alpha - \alpha/2) = \frac{3\alpha}{2} \end{align}

So would reject at least one of the test more often than \(\alpha\), but the result is not as bad as the independent tests. For example, with two independent tests, and \(\alpha = 0.05\), we would reject at least one of the true two nulls 9.75% of the time, but only reject at least one of the two tests of the same null 7.5% of the time. This difference comes from the strong dependence introduced by testing the same null with the same test statistic. Since rejecting the two sided test implies rejecting the one sided, we have fewer opportunities to mislead ourselves with respect to error rates than when considering independent tests. For independent tests, we can do little better than applying Bonferroni corrections across the multiple tests, but with known dependence, we can greatly improve testing procedures to maintain family wise error rates, for example using closed testing procedure.

Finally, it is worth noting that if the researcher first does a two tailed test and then decides to throw away the results unconditionally in order to perform a one-tailed test, all of the usual guarantees about the testing procedure hold. The probability of rejecting the one tailed test would be \(P(Z > \Phi^{-1}(1 - \alpha)) = \alpha\), which is exactly what we would want.

Comments »