So, what can be done about this problem? We will look at three methods that account for multiple testing. The first one is called Bonferroni correction. It simply means that if you do M tests, you multiply the P-values by M. What that correction does is, it makes sure that the probability of a type one error is still at most five percent. That means, the probability of any of the M tests rejecting an error is smaller than five percent. On the other hand, the price you have to pay is that the P-values get bigger. After all, you multiply them by M. Note that this is a very restrictive condition. If you do a large number of tests, let's say M is a thousand, you guard against even having one false positive. A consequence of this is that it may be difficult to claim a significant finding even if a noticeable effect is present. For example, if you do a thousand tests and the P-value is let's say 0.1 percent which would look quite significant, we have to multiply it by a thousand and then it doesn't look significant anymore at all. So the Bonferroni correction is probably working well if you don't have a large number of tests to look at. There is an alternative way that's much more promising if the number of tests is large. And that has to do with a false discovery proportion, which is abbreviated FDP. That is the proportion of false discoveries over the total number of discoveries, where discoveries simply means that a test rejects. That thing is best explained with an example. Suppose we test 1,000 hypotheses. So that picture shows 1,000 squares, each one corresponds to one hypothesis. Now, let's assume 900 of these hypothesis are null, that means there's nothing going on. And 100 of them are alternative hypothesis, so there is an effect there. In that picture, the alternative hypotheses would be the blue square on the top. So, ideally what we would like to do is, we would like to do a test for each of these 1,000 hypotheses and we would like to find that the tests reject for the blue sub squares but not for the green ones. So, if we do that we might get an outcome like this. The red squares are tests that reject. So those are resulting in discoveries. And the gray squares are tests that don't reject. So those are the none discoveries. Now, remember a test can make two kinds of errors. It can reject the null hypothesis even though it's true. Those type one errors would be the red squares among the null hypothesis, for example, these. The second type of error a test can make is it can fail to reject the null even if the alternative is true. Those would be the gray squares among the right apex corner. So what you see in this picture, that among the alternative hypothesis in the right upper corner, the test does quite well, there's a few errors but overall it finds all the alternatives. On the other hand among the null hypothesis, there is quite a number of red squares. So, the test makes quite a number of errors. But keep in mind that if we use a test that has a probability of a type one error of five percent which is our usual threshold, then we would expect about five percent of the null hypothesis to be rejected in error. Since we have 900 null hypotheses and five percent should be rejected in error, we would expect to see about 45 false discoveries, and that's roughly what's going on in this picture. So, that's not at all unusual. So, if we tally up the results of this procedure, then we see among the area where the alternative is true on the right upper corner, we made 82 discoveries and we missed 20. And then among the null hypothesis, we made 41 discoveries which are false discoveries. So, the false discovery proportion is the fraction of the number of false discoveries which is 41 divided by the total number of discoveries. And there we have 82 discoveries and 41 false discoveries. So, the false discovery proportion is 41 over 121 which is 34 percent. And the procedure we discuss next tries to control this proportion. For example, it tries to keep it at five percent. So, that means we may well allow some of the red squares, but we want to make sure that those false discoveries are a small fraction of the total number of discoveries. Here is how the procedure works. It's called a false discovery rate procedure, and it controls the expected proportion of discoveries that are false. Just as we explained on the previous slide. That procedure was published in 1995 by Benjamini-Hochberg, so it carries their name. It consists of three steps. First, what you do is you sort the p-values. So, you look at the smallest which is P with a subscript of one in brackets up to Pm, which is the largest p-value. So, this is simply a sorting of the p-values according to their size. Next, you look at the largest number k, such that the k smallest p-value is not more than k over m times alpha. And of course, you would simply write a small computer code to do that for you. And that's pretty much all because in the third step you declare discoveries for all tests i from one to k. That is, for the k most significant tests you just clear them to be true discoveries. It turns out that this very simple procedure controls the expected proportion of false discoveries at a rate of alpha. The third approach we are looking at is called the validation set approach. The idea is you split your data set into two parts. One is a model building set, and the other is a validation set. Now you use the first part, the model building set, to find something interesting. So, you may well you data snooping on that one. After you come up with hypothesis, you test this hypothesis on the validation set. This is not any more multiple testing because the test you did on the validation set is only one test, and the validation set is completely different from the model-building set. However, that approach requires strict discipline. You have to be sure that you never peek at the validation set until right away at and when you're ready to evaluate your test.