Severe Testing

Does severe testing overcome underdetermination?

Tetsuji Iseda

"The manner in which the severity criterion eliminates such gellerized alternatives is important for it hinges on the distinctive feature of error statistical approaches -- the centrality of error probabilities.... They [gellarized alternatives] are deemed poorly tested because in gellerized constructions the tactic yields a low or 0 severity test. A primary hypothesis that fits the outcome less well than this type of rigged alternative may actually be better, more severely, tested." (Mayo 1996, 203; emphasis in original)

In the chapter 6 of her book, Deborah Mayo formulates one of the key notions of her theory, namely severity of a test. In the same chapter she applies the notion to the problem of underdetermination of theories by evidence. The above quotation appears as the conclusion of her argument against a version of the underdetermination problem. It seems to me that she has overlooked a line of argument which can overturn her conclusion here, however.

Before I can raise the criticism, there are so many terms to be explained -- the severity criterion, gellerized alternatives, 'fit' of hypothesis, etc. I start with explaining what she means by these terms, in the first section. In the second section, I briefly look at her argument that gellerized alternatives are not actually severely tested. In the third section, I show why her strategy fails. In the fourth section, I explore a consequence of my argument which is even more disastrous to her overall enterprise.

1. Mayo's severity criterion

There are several different versions of the underdetermination thesis proposed by philosophers, but the one Mayo discusses here is "the thesis of methodological underdetermination (MUD)" (Mayo 1996, 176; hereafter citations refer to Mayo's book unless otherwise noted). MUD claims that, for any pair of a hypothesis H and an evidence e supporting H, there is another hypothesis H' which is equally well supported by e. If this thesis is true, evidential factors cannot single out a hypothesis to accept, and we need to appeal to extra-evidential factors. This is a quite uncomfortable consequence for Mayo's project in the book, namely to establish the error statistics as a basis of our knowledge.

Mayo tries to deal with this problem by introducing the notion of the severity of a test. It may be true that there are many different alternatives supported by e, but the severity of the test (therefore the degree of support by e) can be different from one hypothesis to another. She articulates characteristics of her notion of severity as follows. First, when a hypothesis H passes the test with result e, the result e must 'fit' H (178). She is open about the exact character of this 'fit', but at least "H does not fit e if e is improbable under H" (179). Second, e's fitting H must constitute a "severe" test of H (179-180). Then, what is a severe test? Mayo proposes a variety of severity criteria slightly different from one another, but the differences are not significant for the purposes of this paper. So let us take the one that she recommends keeping in mind, because of its simplicity (181):

(SC*) The severity criterion for a "pass - fail" test: There is a very high probability that the test procedure T fails H, given that H is false.

For example, suppose that the tested hypothesis H is that a given coin is not 'fair' (fifty-fifty probability of heads and tails), and also suppose that we decide that H passes the test when we get 60 heads out of 100 coin-tossing trials. If H is false (i.e., if the coin is fair), there is a very high probability (.97) that the test procedure fails H. This is a severe test. The value of the probability of failure is called the severity of the test, and ranges from 0 to 1. Needless to say, the greater the severity in this sense is, the more severe the test is.

2. Mayo's argument against 'gellerization'

Now it is time to look at the way the severity criterion is used to dismiss MUD. Mayo considers several versions of MUD, but here we concentrate on one of her arguments, the argument against "maximally likely alternatives" (200). A maximally likely alternative is "one constructed after the fact to perfectly fit the data in hand" (ibid.). More precisely, an evidence e makes a hypothesis H maximally likely when P(e|H) = 1. For example, when we do curve-fitting with some sample points as data, a curve drawn deliberately to connect all the points fits the data perfectly. Of course we are not impressed with most of these alternatives, but the question is whether Mayo's severity criterion can be used to rule out these maximally likely alternatives. Here she concentrates on a specific procedure to construct a maximally likely alternative: gellerization (200 - 201). The baseline of Mayo's reply is that the problem is "not with the hypothesis itself, but with the unreliability (lack of severity) of the test procedure as a whole" (201).

Let us take her own example of a gellerized hypothesis test to illustrate her point. Her example is coin-tossing trials (201-202). Suppose that we are testing a hypothesis H0 that a certain coin is fair. Suppose also that we obtain the result of H,T,T,H from four trials. We can gellerize a hypothesis from the result, namely the probability of heads is 1 on trials one and four, 0 on two and three. Of course this hypothesis is maximally likely. In general, we can construct a maximally likely alternative hypothesis G(e) for any result (202):

G(e): the probability of heads equals 1 on just those trials that result in heads, 0 on the others.

Is this not a better hypothesis than H0? According to Mayo, it is not. It is true that G(e) fits better than H0, but there is a significant difference in the severity of the test between H0 and G(e). If we construct G(e) looking at the result and say that G(e) passes the test on the basis of e, then thus 2. Mayo's argumentconstructed G(e) never fails the test, whether G(e) is false or not. Thus the severity of the test of G(e) is 0. To generalize, Mayo refines her severity criterion as follows (202):

SC with hypothesis construction: There is a very high probability that test procedure T would not pass the hypothesis it tests, given that the hypothesis is false.

Since gellerized hypotheses have low or zero probability of failing in this way, they do not meet the severity criterion.

The passages quoted at the beginning of this paper appears right after this argument. Now we are in a position to appreciate what she means by these passages. The severity criteria are arranged so as to avoid an erroneous acceptance of a false hypothesis (so-called type II error), and this must be what she calls "the centrality of error probability" (203). Mayo does not offer a definition of gellerization. Judging from her two examples of gellerization (one is the coin-tossing trials case discussed here, and the other is the case of a Texas sharpshooter who draws the target after he shoots), gellerization seems to mean that the hypothesis is constructed after the experimental result to provide an excellent fit.

3. A critical analysis of Mayo's argument

Mayo describes the gellerization procedure as "a classically erroneous procedure for constructing maximally likely rivals" (200). Curiously enough, she does not consider any other test procedure which yield maximally likely alternatives, and actually the above quoted passages conclude the section on maximally likely alternatives. If she really wants to refuse MUD, this is not excusable. For the existence of one procedure which yields a maximally likely hypothesis, and which can be applied generally, is enough to establish MUD. What I am going to show is that there is a candidate for such a procedure.

According Mayo's definition, a maximally likely hypothesis is a hypothesis. But to construct such a hypothesis, we do not have to wait until we get the test result. To illustrate the point, take the coin-tossing example used above. Suppose we conduct only four coin-tossing trials. In this case, instead of using G(e) as an alternative, we can construct 16 alternative hypotheses before the test:

H1: P(HHHH)=1
H2: P(HHHT)=1
H3: P(HHTH)=1
H4: P(HTHH)=1
H5: P(THHH)=1
H6: P(HHTT)=1
H7: P(HTHT)=1
H8: P(THHT)=1<
H9: P(HTTH)=1
H10: P(THTH)=1
H11: P(TTHH)=1
H12: P(HTTT)=1
H13: P(THTT)=1
H14: P(TTHT)=1
H15: P(TTTH)=1
H16: P(TTTT)=1

For example, H1 says that probability of 4 straight heads is 1, H2 says that probability of three successive heads followed by a tail is 1, and so forth. How severe is the test for these alternatives? For any of H1~H16, the possibility that the hypothesis will pass the test is 1/16, given H0 is true. Therefore the severity is 15/16 ( .9375).[1] This is a fairly severe test in Mayo's sense, and actually severer than the test of H0.[2] Moreover, the process of the hypothesis construction does not increase the probability that Hi will pass the test, since the construction is done before the test. Therefore none of 16 hypotheses can be ruled out by Mayo's criterion. But still one of the hypotheses will pass the test (let us call it Hi), and if Hi passes the test, Hi becomes a maximally likely alternative. This means that Hi fits the evidence better than H0.

If this problem occurs only for coin tossing, this is not a serious threat to Mayo's argument, but basically the same argument is applicable to curve fitting. Suppose that we are going to obtain n sample points from an experiment. The number of possible combinations of the n sample points is finite.[3] We can draw a curve for each combination which connects the points in it. By this procedure, we can obtain a finite number of alternative curves before the experiment. For example, if we are going to obtain n sample points and each of the points have r possible measurement outcomes, the number of the alternatives is rⁿ. Most of the alternatives have extremely low probability to pass the test, but at least one of them passes the test, and when it passes, it becomes a maximally likely alternative. In general, the same maneuver can be used whenever (1) the range of the possible outcomes is known before the test; (2) the outcomes do not entail the tested hypothesis.

Of course there is something wrong with this test procedure (by the way I am looking for a good name for this procedure. Do you have any good idea?). The test is designed so that one of the absurd alternatives passes the test, come what may. But the problem is whether we can rule out this procedure by refining Mayo's severity criterion.

Here is a possible reply: "the test procedure always passes one of the 16 alternative hypotheses, therefore this is not a severe test." We need to clearly distinguish two different kinds of hypotheses, namely H1~H16 and

H17: One of H1~H16 passes the test.

The above reply says that H17 is not severely tested, but the fact that H17 is not severely tested does not mean that each of H1~H16 is not tested severely. They are just different hypotheses.

Another possible reply: "After we get the test result and pick up the maximally likely hypothesis Hi, Hi cannot fail the test. Therefore Hi is not severely tested." This reply can destroy the severity criterion itself. Of course, after we know that a hypothesis has passed the test, it cannot fail the test. If this can be a reason to say the test is not severe, no hypothesis which passes the test meet the criterion.

One more possible reply: "If we describe the maximally likely hypothesis as

Hj: one of H1~H16 that will pass the test,

then Hj always exists and will always pass the test. Thus Hj is not severely tested. Since Hj coincides with Hi, Hi is not severely tested either." This argument does not work either. Even though Hj and Hi coincides after the fact, the test procedure for Hj and Hi are significantly different. When we speak of any one of H1~H16 before the test, we cannot say that that one will pass the test. On the other hand we can surely say that Hj will pass the test which ever it is. Picking up Hj before the test and picking up one of H1~H16 after the test are totally different procedure. Therefore the fact that Hj is not severely tested does not mean that Hi is not severely tested.

Maybe Mayo can avoid the problem by a minor change of her criterion. One possibility is to redefine the 'severity criterion with hypothesis construction' so that it deals with not only a single hypothesis, but also a group of hypotheses, as follows:

SC with hypotheses construction: There is a very high probability that test procedure T would not pass any of the hypotheses it tests, given that all the hypotheses are false.

There are several unfavorable consequences of this criterion, however. First, to judge the severity of a hypothesis, according to this criterion we need to know about alternative hypotheses and the probability that they will pass the test. But if Popper is right, there are always infinitely many alternatives (see Mayo 1996, 209). This makes the evaluation of severity extremely difficult. Second, this criterion mechanically rules out the possibility that H1~H16 are tested severely. But this sounds too strict. For example, according to this criterion, even if we get 100 straight heads, this does not mean that H1, the probability of 100 straight heads is 1, has passed a severe test. This conclusion follows from the revised version of the criterion because H1 is a part of a set of hypotheses one of which always passes the test (whether they are true or false). Thus I think that this revised version does not work. (of course this conclusion does not rule out the possibility that Mayo can come up with another minor revision which can deal with the problem.)

By the way, how do Bayesians deal with this problem? Bayesians assign extremely low prior probabilities to the hypotheses gellerized with my procedure. The maximal fit may increase the posterior probability of whichever hypothesis happens to fit the obtained result, but the significant difference in the prior probability will allow us to prefer the main hypothesis (in the coin-tossing trials, H0) to the maximally likely hypothesis.[4] Of course this maneuver is not available to Mayo, since she rejects assignment of prior probabilities to hypotheses.

For now, I do not have any better idea about how to refine the severity criterion, so my tentative verdict is that Mayo failed to rule out maximally likely hypotheses, thus she failed to reject MUD.

4. An even more disastrous consequence

In the last section, I showed that there are absurd alternatives which are severely tested (in Mayo's sense) and fit the evidence very well. Failure to rule out these alternatives has an even more disastrous consequence to Mayo's notion of severity. In this last section I would like to explore this problem briefly.

Let us look at her severity criterion again:

(SC*) The severity criterion for a "pass - fail" test: There is a very high probability that the test procedure T fails H, given that H is false.

Please notice that the probability is calculated under the condition that H is false, i.e., one of H's rivals is true. This means that the calculation should count in infinitely many other rival hypotheses. Moreover, we should assign prior probabilities to the rivals to calculate the probability of the failing result.[5] How can Mayo assign such probabilities, especially when she accuses Bayesians of assigning subjective prior probabilities (75)? Mayo acknowledges this problem, and proposes a solution: "it [the severity criterion] requires that the severity be high against each such alternative" (195, emphases in original). Thus what she proposes is a revision of the severity criterion in the following way:

(SC**) The severity criterion for a "pass - fail" test of H: For any rival hypothesis H', there is a very high probability that the test procedure T fails H, given H' is true.

It is true that Mayo can avoid assigning prior probabilities in this way, but the result is that almost no hypotheses are severely tested. Some maximally likely hypotheses entail results which also fit the tested hypothesis very well. Just imagine that one of such hypotheses, Hi, is true. Given Hi is true, since Hi entails a result fitting the tested hypothesis, the probability that the test procedure T fails H is very low. Therefore H is not severely tested according to (SC**).

Of course she is not claiming that we can construct a severe test for any hypothesis. But if paradigmatic cases of good experimental design do not meet her severity criterion, this is certainly an unfavorable consequence. And as I argued above, these maximally likely hypotheses can be constructed whenever (1) the range of the possible outcomes is known before the test; and (2) the outcomes do not entail the tested hypothesis. I believe that typical experimental settings usually meet these conditions. Therefore if Mayo wants to claim that her severity criterion is a realistic one, then she need to revise her severity criterion to rule out this problem.

Again, this problem can be easily solved if she admits that she needs to assign prior probabilities to hypotheses. Of course by this move she loses her advantage to Bayesians, but this seems to me a cost worth paying to save her severity criterion.

5. Conclusion

In this paper, I have argued that Mayo's argument to rule out maximally likely hypotheses is not sufficient for her purpose. I need to add a caveat here. I have not established MUD in Mayo's original sense. My argument does not apply to certain experimental situations (e.g. those in which the range of experimental results is totally unpredictable). But I believe that the scope of my argument is wide enough to urge Mayo to revise her severity criterion.[6]

Reference

Mayo, D. (1996) Error and the Growth of Experimental Knowledge. Chicago: The University of Chicago Press.

Notes

[1] Actually there is a problem in this reasoning, but this is a problem of her severity criterion itself. I discuss the problem in section 4 below.

[2] In our four-trial case, if we decide to reject H0 when the sample mean is more than .6 or less than .4, the probability that H0 will pass the test is 6/16, given H0 is true.

[3] The number of possible outcomes is finite because (1) any specific measurement procedure has a limit of resolution, and (2) any specific measurement procedure has upper and lower limits of values obtainable by the measurement. Therefore whenever we choose an experimental design, we also pick up a finite set of possible outcomes of the experiment.

[4] Technical details: if the prior probability that one of maximally likely hypotheses is true is p, and the number of the gellerized hypotheses is n, then average prior probability that a maximally likely hypothesis Hi is true is p/n. The average expectedness of evidence Ei that fits Hi is 1/n. If Hi entails Ei, by Bayes's theorem, the average posterior probability P(Hi|Ei) = p. Since p is supposed to be very small, usually we do not have to worry about these hypotheses. When we do have to, there may be good reasons for that.

[5] In the ordinary Bayesian way, the probability of the result E under hypotheses H1 ~Hn is given by the formula

[[Sigma]] P(Hi) P(E|Hi).

[6] I am thankful to Professor Suppe and students in his seminar for their really helpful comments on the first draft of this paper.