Committee on the History and Philosophy of Science
Department of Philosophy
University of Maryland, College Park
This paper analyzes Deborah Mayo's recent criticism of use-novelty requirement . She claims that her severity criterion captures actual scientific practice better than use-novelty, and that use-novelty is not a necessary condition for severity. Even though she is right in that there are certain cases in which evidence used for the construction of the hypothesis can test the hypothesis severely, I do not think that her severity criterion fits better with our intuition about good tests than use-novelty. I argue for this by showing a parallelism in terms of severity between the confidence interval case and what she calls 'gellerization'. To account for the difference between these cases, we need to take into account certain additional considerations like a systematic neglect of relevant alternatives.
1. Introduction. Deborah Mayo's recent analysis of the notion of use-novelty (1996, especially chs. 6, 8 and 9) adds interesting new insights on the issue of desirable requirements for hypothesis testing. Her conclusion is that sometimes evidence used to construct a theory can also be used to test the theory severely. The purpose of this paper is to examine her analysis. First, we look at these two notions, use novelty and severity. Then I sumarize Mayo's criticism of use-novelty. I also show that Mayo is right about this claim in the most extreme cases. Then I proceed to a more realistic case, Confidence interval estimation. Even here, there is one way to interpret severity and use-construction so that some evidence used to construct a theory tests the theory severely, but if she takes this interpretation, even what she calls 'gellerized' hypotheses (the worst kind of use-constructed hypotheses) should be considered as severely tested. In the last section, I discuss briefly the implications of my analyses.
2. Use-novelty and Severity. The notion of bold conjecture as a criterion of good testing has a long history, but Popper was the one who brought the notion into contemporary debate (e.g., Popper 1962, 241; see also Giere 1983 for the historical background of this idea). Philosophers influenced by Popper have tried many different ways to refine this idea, like Lakatos's requirement of temporal-novelty. Among them, the notion of use-novelty, proposed by Zahar (1973) and Worrall (1978a, 1978b, 1989), seems to be the most promising one.1
The basic idea of use-novelty is simple: if evidence e is used in the construction of a hypothesis H, i.e. if H is deliberately constructed so that H fits with e, then e does not count as a support for H. Even though there is some intuitive plausibility with this idea, there have been several attacks on the idea of use-novelty from various directions.2 Among them, Mayo's criticism seems particularly forceful because she shares most of the background intuitions with those who advocate use-novelty.
Mayo claims that her severity criterion captures Popper's intuition and scientists' practice better than use-novelty, and use-novelty is not a necessary condition for severity in her sense. Mayo proposes several different formulations of her notion of severity criterion (Mayo 1996, 180-181), but differences between them are not important for our present purposes. Let us pick a simple formulation which suits our purposes here (181):
(SC*) The severity criterion for a "pass-fail" test: There is a very high probability that the test procedure T fails H, given that H is false.
For example, suppose that the tested hypothesis H is that a given coin is not 'fair' (fifty-fifty probability of heads and tails), and also suppose that we decide that H passes the test when we get more than 60 heads or less than 40 heads out of 100 coin-tossing trials. If H is false (i.e., if the coin is fair), there is a very high probability that the test procedure fails H. This is a severe test in Mayo's sense. The value of the probability of failure is called the severity of the test, and ranges from 0 to 1. The greater the severity in this sense, the more severe the test is. In terms of the relationship with evidence e, H passes the test when it fits well with e, but it is important to note that two hypotheses which fit equally well with e can be totally different in terms of the severity of the test.
In her criticism of the use-novelty requirement, Mayo first shows that logically speaking use-novelty and severity are independent. Then she proceeds to some concrete cases (presumably examples of good scientific reasoning) in which use-novelty seems violated while the severity criterion is met. So let us look at her conceptual argument first.
Through analyses of Worrall's (1989) and Giere's (1983) arguments for use-novelty, Mayo finds that in both cases there is a background intuition that a test should be reliable or severe behind the use-novelty requirement (Mayo 1996, 263-271). Mayo's strategy is to show that this background intuition does not necessarily require use-novelty. Mayo summarizes Giere's reasons why use-novelty is required in the following way (268-269; I paraphrase Mayo's summary):
1. A successful fit does not count as a good test of a hypothesis if such a success is highly probable even if the hypothesis is incorrect. (That is, a test of H is poor if its severity is low.)
2. If H is use-constructed, then a successful fit is assured, no matter what. Therefore its success is high ("near unity") even if it is false.
Proposition (1) is virtually the same as Mayo's severity criterion, and naturally Mayo has no complaint with it. But, according to Mayo, the first sentence of (2) does not support the conclusion. The successful fit is assured "no matter what the data are," but this is not the same as "no matter if H is true or false," which is required to draw the conclusion (269).3 More formally, Mayo distinguishes the following two probabilities (270-271)4:
A. The probability that (use-constructed) test T passes the hypothesis it tests
B. The probability that (use-constructed) test T passes the hypothesis it tests, given that it is false.
If the hypothesis is use-constructed, probability A is 1 (or at least close to 1). However, according to Mayo, probability B can be less than 1 (or even small) when probability A is 1. Since severity is calculated by probability B, not probability A, a use-constructed hypothesis can be severely tested. Here we should note that it is provable hat if probability A is 1, probability B is also 1 unless the probability that the hypothesis is false is 0. There are four categories in the sample space (Table 1).
H is correct H is false T passes H C1 C2 T fails H C3 C4
A is calculated by [C1+C2] / [C1+C2+C3+C4]. B is calculated by [C2] /[C2+C4]. Now, A is unity only when both C3 and C4 are empty. Therefore if A is unity, B is equivalent to C2 / C2 (because C4 is empty). Thus C2 is also unity. This calculation breaks down only when both C2 and C4 are empty, in which case the result depends on how we define 0 / 0 in probability calculus. This sounds like a very limited case, but there is a way for Mayo to get around this problem, as we will see soon. But let us first consider a case that does illustrate the conceptual difference successfully.
The example Mayo uses is the average SAT score of the students in a given class (Mayo 1996, 271-272). When you sum up all the SAT scores of the students in the class and divide the sum by the number of the students, you obtain a number, say, 1121. From this, you can construct a hypothesis that the average SAT score of the students in the class is 1121. Here you used the data to construct the hypothesis, and therefore probability A, the probability that the hypothesis fits with the data about the individual scores, is 1. But probability B, namely the probability that the hypothesis passes the test given that the average is not 1121, depends on the way we define 0 / 0 in probability calculus. If we are allowed to appeal to intuition here, Probability B seems like 0; for, if the average is not 1121, then the result of the calculation cannot be 1121, so the hypothesis that the average is 1121 never passes the test (it is not even constructed).5
Therefore, Mayo is right in that the violation of the use-novelty requirement does not entail a violation of the severity criterion. But one question arises here: is this result widely applicable? In the average SAT score case, the hypothesis is assured to be true by virtue of the construction procedure. This is certainly a very exceptional case, and if the difference matters only in this kind of case, the distinction between use-novelty and severity is not a big deal in real scientific practice.
3. Problems with the Severity Criterion. In this section, our argument concentrates on two cases: one in which the use-constructed hypothesis is intuitively admissible, and claimed to be tested severely; the other in which the use-constructed hypothesis is intuitively inadmissible and claimed to be not severely tested. What I am going to show is that these two cases are actually parallel with respect to the severity of the test. This implies that the severity criterion cannot account for the difference in the intuitive admissibility between these cases.
3-1. Confidence Interval. Mayo's second example of severe testing of a use-constructed hypothesis is the confidence interval (Mayo 1996, 272). Suppose that a random sample of the U.S. population is polled and the result shows that 45 percent of the sample approves of the President with a margin of error of 3 percent (let us call this observed sample proportion e). We can construct a hypothesis from e.
H(e): the population proportion of approval of the President is in the interval between 42 percent and 48 percent.
If the margin of error is set for 95 percent confidence, this result means that the probability that the true population proportion is inside this interval is 95 percent.6 Let us call this procedure CI. CI includes details about the random sampling, data analysis and decision rules.
This is a widely used method, and, intuitively speaking, this sounds admissible. At the same time, this is a case in which the most relevant evidence is used in the construction of the hypothesis. Then the question is whether the evidence constitutes a severe test. Given the above proof about probability A and probability B, the answer should be no, because in this case the probability that H(e) is false is not 0, which necessitates that probability B be 1, given that probability A is 1. But, as I mentioned above, this is not what Mayo means to say. Mayo claims that this case is a severe test: "the fit with the resulting hypothesis (the interval estimate H(e) ) is given by the specified margin of error d, say 2 standard deviation units. It is rare for so good a fit with H(e) to occur unless the interval estimate H(e) is true (i.e. unless the population proportion really is within 2 standard deviation units of e). So the severity in passing H(e) is high" (1996, 273).
To understand her argument better, it is convenient to distinguish three different interpretations of the severity criterion applied to this case:
(SC*-1) There is a very high probability that the test procedure CI fails H(e), given that H(e) is false.
(SC*-2) There is a very high probability that the test procedure CI fails the H(x) constructed in each case, given that the H(x) is false in that case.
(SC*-3) There is a very high probability that the test procedure CI fails H(e) when the obtained sample proportion is e, given that H(e) is false.
x refers to the observed proportions generated through the same procedure in various cases (and e is a particular instance of x in the present case). H(e) refers to the concrete hypothesis in the present case, namely the hypothesis that the population proportion of approval of the President is in the interval between 42 percent and 48 percent, while H(x) refers to the hypothesis constructed using the sample proportion obtained in each case (which may be different from the present one). Let us look at SC*-1 first. If H(e) is false, the same observed sample proportion as the present one, e, is highly unlikely, and H(e) will not pass the test (H(e) will not be even constructed without e). Thus the test seems severe, as Mayo argues. But there is a problem with SC*-1, given her purpose for introducing this case. The corresponding probability A, namely the probability that the test procedure T fails H(e), is not unity. It is easy to imagine a case in which we follow exactly the same procedure, and still construct (and pass) a different H(x) due to a different sample proportion x. Because the sampling is supposed to be random (this is a part of the definition of the situation, see above), the possibility of different sample proportions seems almost inevitable. Thus, even though we need further background information to calculate such a probability, probability A is almost certainly not unity, therefore this is not the case of use-construction Giere was discussing.7
There are (at least) two interpretations of the severity criterion which make probability A unity, and SC*-2 and SC*-3 represent them. SC*-2 refers not to the present H(e), but to the H(x) to be constructed in each case. The probability that H(x) fits with the observed sample proportion x in that case is unity, given the testing procedure. SC*-3 tries to solve the problem by adding one more constraint, namely the constraint that the obtained sample proportion is e (this may sound like a very ad-hoc modification, but if we see this as a part of procedure CI, it sounds more natural). In either case, the corresponding probability A is unity, but, as a result, B is also unity, thus the severity of the test calculated by either of them becomes zero (the good fit between the constructed hypothesis and the observed proportion is already assured whether the hypothesis is false or not). Whether we compare H(x) in each case with x, or we compare the present H(e) with e, the very good fit is guaranteed. Again, Mayo has trouble here.
What is the best strategy for Mayo to get out of this trouble? The reply I would choose is the following. The judgment of use-construction is done by either
A': The probability that the test procedure CI fails the H(x) constructed in each case,
A'': The probability that the test procedure CI fails H(e) when the obtained sample proportion is e.
But the calculation of severity has nothing to do with these probabilities, that is, severity is calculated through SC*-1. Thus, still, the severity criterion can account for a case of use-construction which is intuitively admissible. By adopting this reply, Mayo has to give up her original line of argument based on the contrast between probability A and probability B, but she can retain the substantial point, namely the superiority of her severity criterion over the use-novelty criterion.
3-2. Gellerization. When we interpret Mayo in the way I suggested at the end of the last section, however, a further problem occurs. Many other intuitively inadmissible use-construction procedures can be severely tested according to this view! Let me show this using what Mayo calls 'gellerization' as an example.
Gellerization is a typical erroneous way of creating "maximally likely alternatives" (Mayo 1996, 200). Mayo's example of gellerization is coin-tossing trials (201-202). Suppose that we are testing a hypothesis H0 that a certain coin is fair. Suppose also that we obtain the results of H,T,T,H from four trials (let us call this obtained sequence e). We can gellerize a hypothesis that the probability of heads is 1 on trials one and four, 0 on two and three. In general, we can construct a maximally likely alternative hypothesis G(e) for any result (202):
G(e): the probability of heads equals 1 on just those trials that result in heads, 0 on the others.
G(e) fits better with e than H0 does, because G(e) entails e, while H0 merely makes e plausible. Is G(e) not a better hypothesis than H0? According to Mayo, it is not. Think about the following test procedure (202): "Fail (null) hypothesis H0 and pass the maximally likely hypothesis G(e) on the basis of data e." Let us call this procedure GL. Mayo claims: "There is no probability that test [GL] would not pass G(e), even if G(e) were false. Hence the severity is minimal (i.e., 0)" (ibid.; emphasis in original). Thus, even though G(e) fits better with e, H0 is a better hypothesis in the sense that H0 is tested more severely than G(e).
So, Mayo arrives at a radically different conclusion from the case of the confidence interval. But are they that different in terms of severity? Let us introduce three alternative interpretations of the severity criterion parallel to the above SC*-1 ~ SC*-3:
(SC*-4) There is a very high probability that the test procedure GL fails G(e), given that G(e) is false.
(SC*-5) There is a very high probability that the test procedure GL fails the G(x) constructed in each case, given that the G(x) is false in that case.
(SC*-6) There is a very high probability that the test procedure GL fails G(e) when the obtained sequence is e, given that G(e) is false.
Given the above interpretation of Mayo's severity criterion, SC*-4, parallel to SC*-1, is the natural candidate to be used for the calculation. But the result of such a calculation does not match with her account of the severity of the test. Suppose that G(e) is false and H0 is true.8 Then the probability that G(e) passes the test is 1/16 (roughly .06), because G(e) passes the test only when the obtained sequence is H,T,T,H. In fact, in other cases, the same G(e) as the present one is not even constructed, because of the different sequences. Thus the severity is high according to SC*-4. Probability A for G(e) corresponding to SC*-4 seems fairly low, though, again, we need more background information for the calculation. On the other hand, we obtain a zero severity if we use SC*-5 or SC*-6 for the calculation. But if Mayo wants to use one of them in the case of gellerization, she needs to explain why she does not use the corresponding SC*-2 or SC*-3 in the case of the confidence interval. Also, the probability A corresponding to SC*-5 or SC*-6 is unity. To sum up: the case of confidence interval and the case of gellerization have a strong parallelism in terms of the assessment of use-construction and the estimation of severity. If one is use-constructed, so is the other. If one is tested severely, so is the other. If one is not tested severely, the other is not either.
This result suggests that neither the use-novelty nor the severity criterion can account for our intuition that the confidence interval estimation is a good method and gellerization is not. What Mayo needs to do is find a good reason for the differential applications of the severity criterion. I do not think she provides this in her book, however, and in this respect her treatment of the use-novelty issue is deeply unsatisfactory.
4. Systematic Neglect of Relevant Alternatives. If my assessment is valid and I can rely on the intuition that CI estimation is a good testing procedure and gellerization is a bad one, then it seems to me that neither severity nor use-novelty provide a satisfactory account of a good test as they are. Then, how can we account for the difference between the confidence interval and gellerization? Let me conclude this paper with one attempt at such an account. In the case of the confidence interval, the class of all hypotheses that can be use-constructed (the class of all hypotheses of the form "The population proportion of approval of the President is in the interval between x percent and y percent" with concrete numbers in x and y) is exhaustive in the sense that at least one of the hypotheses is true. Thus, simply eliminating implausible alternatives can lead us to a reliable hypothesis. The same thing does not apply to gellerization. As Mayo recognizes, the test procedure rejects H0 come what may (Mayo 1996, 202), and this makes the class of possible use-constructed hypotheses non-exhaustive. I would like to call this feature of gellerization a systematic neglect of relevant alternatives. I think this may be the key to refining the notions of severity and use-novelty: namely, a theory construction and testing procedure is not justified if the procedure involves a systematic neglect of relevant alternatives. Though I do not have enough space to develop this idea here, certain initial thoughts may be helpful to understand the idea.
Let me illustrate the idea with certain historical cases. The acceptance of the general theory of relativity based on the anomalous advance of the perihelion of Mercury seems justified even if the phenomenon is used in the construction of the theory (see Brush 1989 and Earman 1992). Why? Because the general theory of relativity was the only available theory which could account for the phenomenon. No neglect of relevant alternatives was involved. On the other hand, sticking to geocentricism in the face of heliocentricism seems unjustified, if it is done by systematic ad hoc changes to save geocentricism; such ad hoc changes seem to involve a systematic neglect of a relevant alternative, namely heliocentricism. I think that this is the true reason why use construction is considered a bad scientific method in ordinary cases.
A few more words may be preferable for the qualifications, 'systematic' and 'relevant'. The problem with gellerization is that we know we can reject H0 come what may, even before we obtain the data. Similarly, if we can tell how to update geocentric theory so that we can reject heliocentric theory come what may even before we look at the data, something wrong is going on. The problem is the systematic procedure for accommodating any new data, which makes empirical data useless. I agree with Lakatos (1970) that sometimes ad hoc changes to save the hard core of a research programme are desirable, but I disagree with him on where to draw the line between admissible and inadmissible ad hoc changes.
The reason I talk about 'relevant' alternatives is that there are alternatives which are rightly neglected because of their irrelevance. For example, alternatives of the form of G(x) in the gellerization case are usually neglected in ordinary coin-tossing trials. Such alternatives call for a very strange mechanism involved in the coin tossing, and we usually have no reason to suspect the existence of such a mechanism. If we do consider G(x) as relevant alternatives, we would conduct a different kind of test (rather than mere coin tossing) which can differentiate H0 from alternatives of the form G(x).
Can we formalize the idea of systematic neglect of relevant alternatives into a criterion, like Mayo's severity criterion? I am not sure if such a formalization is possible. The idea may be represented by an unfairly low probability of accepting a member of a certain class of relevant alternatives given any possible outcome. But 'unfairly low probability' sounds too subjective to be incorporated into a formal rule. Still, it seems to me that any attempt to refine the ideas of use-novelty and severity should take into account this kind of consideration.
Brush, Stephen G. (1989), "Prediction and Theory Evaluation: The Case of Light Bending", Science 246: 1124-1129.
Earman, John (1992), Bayes or Bust?: A Critical Examination of Bayesian Confirmation Theory. Cambridge: The MIT Press.
Giere, Ronald N. (1983), "Testing theoretical hypotheses", in John Earman, (ed.), Testing Scientific Theories, Minnesota Studies in the Philosophy of Science vol. X. Minneapolis: University of Minnesota Press, 269-298.
Howson, Colin and Peter Urbach (1993), Scientific Reasoning: The Bayesian Approach, second edition. La Salle: Open Court.
Lakatos, Imre (1970) "Falsification and the methodology of scientific research programmes", in Imre Lakatos and Alan Musgrave, (eds.) Criticism and the Growth of Knowledge. Cambridge: Cambridge University Press, 91-196.
Mayo, Deborah G. (1996), Error and the Growth of Experimental Knowledge. Chicago: The University of Chicago Press.
Musgrave, Alan (1978), "Evidential Support, Falsification, Heuristics, and Anarchism", in Gerard Radnitzky and Gunner Andersson, (eds.), Progress and Rationality in Science, Boston Studies in the Philosophy of Science vol. LVIII. Dordrecht: Reidel, 181-201.
Popper, Karl (1962), Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.
Worrall, John (1978a), "The Ways in Which the Methodology of Scientific Research Programmes Improves on Popper's Methodology", in Gerard Radnitzky and Gunner Andersson, (eds.), Progress and Rationality in Science, Boston Studies in the Philosophy of Science vol. LVIII. Dordrecht: Reidel, 45-70.
Worrall, John (1978b), "Research Programmes, Empirical Support, and the Duhem Problem: Replies to Criticism", in Gerard Radnitzky and Gunner Andersson, (eds.), Progress and Rationality in Science, Boston Studies in the Philosophy of Science vol. LVIII. Dordrecht: Reidel, 321-338.
Worrall, John (1989), "Fresnel, Poisson and the White Spot: The Role of Successful Predictions in the Acceptance of Scientific Theories", in David Gooding, Trevor Pinch and Simon Schaffer (eds.), The Uses of Experiment: Studies in the Natural Sciences. Cambridge: Cambridge University Press, 135-157.
Zahar, ElieG. (1973), "Why Did Einstein's Programme Supersede Lorentz's?", The British Journal for the Philosophy of Science 24: 93-123 and 223-262.
+ I am very thankful to Professor Mayo for her detailed comments on eariler versions of this paper, and for her patience in replying to my questions. I would also like to thank to faculty members and graduate students at University of Maryland, especially to Rob Skipper and Nancy Hall, for their comments.
1 Actually 'use-novelty' is not the term these authors use to
describe their position. For example, Worrall seems to prefer 'heuristic
account' 'or 'Zahar-Worrall criterion' as the name of their position
1989, 148-155). I use the terms 'use-novelty' and 'use-construction'
because these are the terms Mayo uses in her analyses.
I do not have enough space to discuss other proposed notions of novelty and to compare the relative merits of these notions. See, for example, the exchange between Musgrave (1978) and Worrall (1978b) for more about this.
2 For example, Brush (1989) revealed a historical case (acceptance of the general theory of relativity) in which scientists preferred non-novel evidence (the advance of the perihelion of Mercury, which, as Earman 1992 shows, was used in the construction of the theory) to novel evidence (star-light bending by the sun). Another attack comes from Bayesians who argue that use-novelty is not a necessary condition for good evidence from the Bayesian point of view (Howson and Urbach 1993, 408-409).
3 Actually Giere does claim that the evidence used for the construction does not constitute a good test because the test passes the hypothesis "no matter whether the corresponding hypothesis is true or false" (Giere 1983, 286). This suggests that Mayo misunderstands Giere's argument, but maybe it is Giere who misstates his own position. Here, I assume that Mayo correctly characterizes Giere for the sake of the argument. 4 To quote exactly, Mayo's own formulation of B is "..., even if it is false" (270). I replaced it with "given that" because (1) "Even if " is a very ambiguous expression, and (2) Mayo herself use "given that" version in a general (not restricted to use-construction) formulation of probability in B (269).
5 Here, actually I am exploiting an equivocation in the severity criterion to interpret Mayo's argument most charitably. The equivocation will show up as a serious problem later in this paper.
6 Actually this is a bit inaccurate. Strictly speaking, a 95 percent confidence interval means that whatever the true population value is, 95% of the estimates to be produced with this procedure include the true value. Once the estimate is produced, either the true value is inside or not; namely the probability is either 1 or 0. See Mayo 1996, 273 n.14. By the way, this 95% has nothing to do with Mayo's severity. To use the categories in Table 1, this 95 % corresponds to C1 / [C1+ C2], which has no direct relationship with C2/ [C2+C4] .
7 Of course, even when we understand H(e) in this way, H(e) is in some sense use-constructed. In the draft version of this paper, I used a distinction between 'accidentally use-constructed' and 'genuinely use-constructed' hypotheses, but the matter is more with the way we look at the hypotheses than with the hypotheses themselves (Mayo pointed this out to me in personal communication).
8 Obviously there are many other alternative hypotheses of the
"probability of heads is p (!=.5)," or even other possible maximally
hypotheses like "The probability of heads is 1 on trials one and two, 0
three and four." How can we calculate probability B when "G1(e) is false"
that one of these alternative is true? This is particularly problematic
Mayo because she rejects assigning prior probabilities to hypotheses
1996, 75). Mayo's answer is that her severity criterion "requires that
severity be high against each such alternative" (195; emphasis in
original). This sounds like too strict a requirement, but she does not
logically possible alternatives. As her argument in chapter 5 of the book
(and as she pointed out to me in her private communication), it is enough
her if severity is high against all relevant alternatives specified by
research design. Therefore, since the question here is the comparison of
G1(e), I think I am warranted to concentrate on H0 where the severity
(formally) requires to consider all other alternatives.
By the way, if Mayo does not add this qualification about the relevant alternatives, then it can be shown that almost alway there is at least one maximally likely alternative which is incompatible with the tested hypothesis and still assures high probability of passing result of the hypothesis (though I do not have enough space to prove it here). This makes the severity criterion almost vacuous.