When performing multiple hypothesis tests, we should rightly be concerned that by repeated testing, we might reject null hypotheses more frequently than our \(\alpha\)-levels (or equivalently \(p\)-values) would guarantee. In particular, the family-wise error rate, the probability of rejecting at least one true null hypothesis, might be very different than the \(\alpha\) levels used in each individual test. In general, if we were to conduct \(k\) independent hypothesis tests and all the the null hypotheses were true, we would reject at least one null hypothesis with probability \(1 - (1 - \alpha)^k\). For \(k = 2\) this is \(2 \alpha - \alpha^2\). As we test more trials, we run the risk of thinking we would reject the truth of any individual trial say 5% of the time, whereas we would actually be rejecting the truth far more often. XKCD says it better than I ever could.

Let’s say you wish to conduct a hypothesis test based on some test statistic \(Z\) that is Normally distributed with mean zero and variance one, under the null hypothesis (in the following \(\Phi^{-1}\) is the quantitle function for \(Z\)). In general, you care about alternatives that imply that \(Z\) still has unit variance, but now is centered somewhere else. You could either test a two-side hypothesis, where you would reject the null if \(Z\) less than \(\Phi^{-1}(\alpha/2)\) or \(Z\) is greater than \(\Phi^{-1}(1 - \alpha/2)\), or a one sided hypothesis, where you would reject the null if \(Z\) exceeds \(\Phi^{-1}(1 - \alpha)\). Both of these tests would have proper size, rejecting the truth only \(\alpha\) percentage of the time. Individually, either of these tests would meet the usual basic requirement we place on tests.

But what about multiple testing? Should we be concerned? What is the probability of rejecting one or both tests when \(Z\) is distributed as specified under the null hypothesis (i.e., standard Normal). For the one sided test, we would reject when \(Z > \Phi^{-1}(\alpha)\). For the two sided, we reject when \(Z > \Phi^{-1}(1- \alpha/2)\) or \(Z < \Phi^{-1}(\alpha/2)\). Putting these together:

\begin{align} & P([Z > \Phi^{-1}(1 - \alpha)] \cup [Z > \Phi^{-1}(1 - \alpha/2)] \cup [Z < \Phi^{-1}(\alpha/2)])\\ &= 1 - P(\Phi^{-1}(\alpha/2) < Z < \Phi^{-1}(1 - \alpha)) \\ &= 1 - (1 - \alpha - \alpha/2) = \frac{3\alpha}{2} \end{align}So would reject at least one of the test more often than \(\alpha\), but the result is not as bad as the independent tests. For example, with two independent tests, and \(\alpha = 0.05\), we would reject at least one of the true two nulls 9.75% of the time, but only reject at least one of the two tests of the same null 7.5% of the time. This difference comes from the strong dependence introduced by testing the same null with the same test statistic. Since rejecting the two sided test implies rejecting the one sided, we have fewer opportunities to mislead ourselves with respect to error rates than when considering independent tests. For independent tests, we can do little better than applying Bonferroni corrections across the multiple tests, but with known dependence, we can greatly improve testing procedures to maintain family wise error rates, for example using closed testing procedure.

Finally, it is worth noting that if the researcher first does a two tailed test and then decides to throw away the results unconditionally in order to perform a one-tailed test, all of the usual guarantees about the testing procedure hold. The probability of rejecting the one tailed test would be \(P(Z > \Phi^{-1}(1 - \alpha)) = \alpha\), which is exactly what we would want.

published 2015-09-17

Comments »
I’ve added two new working papers on the use of covariance adjustment in causal and randomization inference, particulary using machine learning techniques (with Jake Bowers, Ben B. Hansen, and Costa Panagopoulos) and on improving test statistic selection for inference in spillover models (with Jake Bowers and Peter M. Aronow). Abstracts and PDFs can be found on my Academics page.

published 2015-09-08

Comments »
*Update: The meeting is Friday, 9/18 at 4pm at the statistics annex (909 Nevada). Sign up for emails about future meetings.*

For the September meeting of the W. S. Gosset Society, we’ll pay honor to our namesake and investigate Gosset’s 1908 paper and its continuing influence:

- Student (1908). The Probable Error of a Mean.
*Biometrika*, 6: 1-25 - Zabell, S. L. (2008). On Student’s 1908 Article ‘The Probable Error of a Mean.’
*Journal of the American Statistical Association*, 103(481): 1-7 - Delaigle, A.; Hall, P. & Jin, J. (2011). Robustness and accuracy of methods for high dimensional data analysis based on Student’s t-statistic.
*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, 73: 283-301 - Chatterjee, A. & Lahiri, S. N. (2013). Rates of convergence of the Adaptive LASSO estimators to the Oracle distribution and higher order refinements by the bootstrap.
*Ann. Statist.*, 41(3): 1232-1259

I’d like to hold the meeting during the week of 9/13 – 9/19. If you are interested in joining the group, please send me your availability that week (`mark.m.fredrickson@gmail.com`

).

*The W. S. Gosset Society is an informal reading group for UIUC statistics students. The group focuses on connecting modern research to seminal works in the field of statistics. Whenever possible, we werve as a venue for students to present their own research and get feedback.*

published 2015-09-05

Comments »
2012-04-29

Notes from the 5th St. Louis Area Meeting Meeting

2011-08-20

Citizen responses under election and sortition

2011-08-02

Brief thoughts on Polmeth 2011 and a notice of a working paper/poster.

2011-07-16

A few thoughts after a month of using the "test_that" package.

2011-06-03

Announcing the public development of optmatch and RItools as well as the publication of the complete blocking, balancing, and analysis article.

2011-05-22

Highlights from ACIC 2011 at UMich

2011-04-29

Theory can play a useful role in designing experiments, but keep it out of the analysis.

2011-04-26

The first in a series of posts on using Optmatch and RItools for experimental studies.

2011-04-20

A brief summary of SLAMM 2011.

2011-04-16

How do young researchers find partners for field experiments in the developing world?

2011-04-14

A formal model of party platforms with an intuitive result.

2011-04-03

My article "Collaboration for Social Scientists" published in The Political Methodologist

2011-03-29

Are juries democratic? Schwartzberg tackles this question.

2011-03-18

Do elections destabilize African nations?

2011-02-27

Tracing voters moves within and across states.

2011-02-20

Participant observation in Wisconsin.

2011-02-19

Can twin studies tell us about mediating factors between genetics and behavior?

2011-02-06

Exploring R function environments

2011-01-30

An interesting pair of puzzles concerning opinions on capital punishment.

2011-01-26

A travel guide to creating a dissertation.

2011-01-22

The Political Science department invited Kevin Clarke of Rochester for a talk on his upcoming book, "A Model Discipline."

2010-12-06

See the "Academics" section for some new working papers

2010-11-12

Stratification of large matching problems has many benefits.

2010-08-06

How to generate combinations one at a time.

2010-08-02

Assessing balance adjustment using xBalance and the MatchIt package

2010-07-30

A soup-to-nuts propensity score matching analysis.

2010-07-07

Why not use more than the ASCII character set when programming?

2010-06-23

In which the author considers similar bugs in R and JavaScript.

2010-06-17

Dropping MacPorts, picking up a tall one

2010-06-07