Committee on the History and Philosophy of Science

Department of Philosophy

University of Maryland, College Park

Abstract

This paper analyzes Deborah Mayo's recent criticism of use-novelty requirement . She claims that her severity criterion captures actual scientific practice better than use-novelty, and that use-novelty is not a necessary condition for severity. Even though she is right in that there are certain cases in which evidence used for the construction of the hypothesis can test the hypothesis severely, I do not think that her severity criterion fits better with our intuition about good tests than use-novelty. I argue for this by showing a parallelism in terms of severity between the confidence interval case and what she calls 'gellerization'. To account for the difference between these cases, we need to take into account certain additional considerations like a systematic neglect of relevant alternatives.

**1. Introduction.** Deborah Mayo's recent analysis of the notion of
use-novelty (1996, especially chs. 6, 8 and 9) adds interesting new
insights on
the issue of desirable requirements for hypothesis testing. Her
conclusion is
that sometimes evidence used to construct a theory can also be used to
test the
theory severely. The purpose of this paper is to examine her analysis.
First,
we look at these two notions, use novelty and severity. Then I sumarize
Mayo's
criticism of use-novelty. I also show that Mayo is right about this claim
in
the most extreme cases. Then I proceed to a more realistic case,
Confidence
interval estimation. Even here, there is one way to interpret severity
and
use-construction so that some evidence used to construct a theory tests
the
theory severely, but if she takes this interpretation, even what she
calls
'gellerized' hypotheses (the worst kind of use-constructed hypotheses)
should
be considered as severely tested. In the last section, I discuss briefly
the
implications of my analyses.

**2. Use-novelty and Severity.** The notion of bold conjecture as a
criterion of good testing has a long history, but Popper was the one who
brought the notion into contemporary debate (e.g., Popper 1962, 241; see
also
Giere 1983 for the historical background of this idea). Philosophers
influenced
by Popper have tried many different ways to refine this idea, like
Lakatos's
requirement of temporal-novelty. Among them, the notion of use-novelty,
proposed by Zahar (1973) and Worrall (1978a, 1978b, 1989), seems to be
the most
promising one.^{1}

The basic idea of use-novelty is simple: *if evidence e is used in the
construction of a hypothesis H, i.e. if H is deliberately constructed so
that H
fits with e, then e does not count as a support for H.* Even though
there is
some intuitive plausibility with this idea, there have been several
attacks on
the idea of use-novelty from various directions.^{2} Among them,
Mayo's
criticism seems particularly forceful because she shares most of the
background
intuitions with those who advocate use-novelty.

Mayo claims that her severity criterion captures Popper's intuition and scientists' practice better than use-novelty, and use-novelty is not a necessary condition for severity in her sense. Mayo proposes several different formulations of her notion of severity criterion (Mayo 1996, 180-181), but differences between them are not important for our present purposes. Let us pick a simple formulation which suits our purposes here (181):

(SC*) The severity criterion for a "pass-fail" test: There is a very high probability that the test procedure T fails H, given that H is false.

For example, suppose that the tested hypothesis H is that a given coin is not 'fair' (fifty-fifty probability of heads and tails), and also suppose that we decide that H passes the test when we get more than 60 heads or less than 40 heads out of 100 coin-tossing trials. If H is false (i.e., if the coin is fair), there is a very high probability that the test procedure fails H. This is a severe test in Mayo's sense. The value of the probability of failure is called the severity of the test, and ranges from 0 to 1. The greater the severity in this sense, the more severe the test is. In terms of the relationship with evidence e, H passes the test when it fits well with e, but it is important to note that two hypotheses which fit equally well with e can be totally different in terms of the severity of the test.

In her criticism of the use-novelty requirement, Mayo first shows that logically speaking use-novelty and severity are independent. Then she proceeds to some concrete cases (presumably examples of good scientific reasoning) in which use-novelty seems violated while the severity criterion is met. So let us look at her conceptual argument first.

Through analyses of Worrall's (1989) and Giere's (1983) arguments for use-novelty, Mayo finds that in both cases there is a background intuition that a test should be reliable or severe behind the use-novelty requirement (Mayo 1996, 263-271). Mayo's strategy is to show that this background intuition does not necessarily require use-novelty. Mayo summarizes Giere's reasons why use-novelty is required in the following way (268-269; I paraphrase Mayo's summary):

1. A successful fit does not count as a good test of a hypothesis if such a success is highly probable even if the hypothesis is incorrect. (That is, a test of H is poor if its severity is low.)

2. If H is use-constructed, then a successful fit is assured, no matter what. Therefore its success is high ("near unity") even if it is false.

Proposition (1) is virtually the same as Mayo's severity criterion, and
naturally Mayo has no complaint with it. But, according to Mayo, the
first
sentence of (2) does not support the conclusion. The successful fit is
assured
"no matter what the data are," but this is not the same as "no matter if
H is
true or false," which is required to draw the conclusion
(269).^{3}
More formally, Mayo distinguishes the following two probabilities
(270-271)^{4}:

A. The probability that (use-constructed) test T passes the hypothesis it tests

B. The probability that (use-constructed) test T passes the hypothesis it tests, given that it is false.

If the hypothesis is use-constructed, probability A is 1 (or at least close to 1). However, according to Mayo, probability B can be less than 1 (or even small) when probability A is 1. Since severity is calculated by probability B, not probability A, a use-constructed hypothesis can be severely tested. Here we should note that it is provable hat if probability A is 1, probability B is also 1 unless the probability that the hypothesis is false is 0. There are four categories in the sample space (Table 1).

H is correct H is false T passes H C1 C2 T fails H C3 C4

**Table 1**

A is calculated by [C1+C2] / [C1+C2+C3+C4]. B is calculated by [C2] /[C2+C4]. Now, A is unity only when both C3 and C4 are empty. Therefore if A is unity, B is equivalent to C2 / C2 (because C4 is empty). Thus C2 is also unity. This calculation breaks down only when both C2 and C4 are empty, in which case the result depends on how we define 0 / 0 in probability calculus. This sounds like a very limited case, but there is a way for Mayo to get around this problem, as we will see soon. But let us first consider a case that does illustrate the conceptual difference successfully.

The example Mayo uses is the average SAT score of the students in a given
class
(Mayo 1996, 271-272). When you sum up all the SAT scores of the students
in the
class and divide the sum by the number of the students, you obtain a
number,
say, 1121. From this, you can construct a hypothesis that the average SAT
score
of the students in the class is 1121. Here you used the data to construct
the
hypothesis, and therefore probability A, the probability that the
hypothesis
fits with the data about the individual scores, is 1. But probability B,
namely
the probability that the hypothesis passes the test given that the
average is
not 1121, depends on the way we define 0 / 0 in probability calculus. If
we are
allowed to appeal to intuition here, Probability B seems like 0; for, if
the
average is not 1121, then the result of the calculation cannot be 1121,
so the
hypothesis that the average is 1121 never passes the test (it is not even
constructed).^{5}

Therefore, Mayo is right in that the violation of the use-novelty requirement does not entail a violation of the severity criterion. But one question arises here: is this result widely applicable? In the average SAT score case, the hypothesis is assured to be true by virtue of the construction procedure. This is certainly a very exceptional case, and if the difference matters only in this kind of case, the distinction between use-novelty and severity is not a big deal in real scientific practice.

**3. Problems with the Severity Criterion.** In this section, our
argument
concentrates on two cases: one in which the use-constructed hypothesis is
intuitively admissible, and claimed to be tested severely; the other in
which
the use-constructed hypothesis is intuitively inadmissible and claimed to
be
not severely tested. What I am going to show is that these two cases are
actually parallel with respect to the severity of the test. This implies
that
the severity criterion cannot account for the difference in the intuitive
admissibility between these cases.

*3-1. Confidence Interval.* Mayo's second example of severe testing
of a
use-constructed hypothesis is the confidence interval (Mayo 1996, 272).
Suppose
that a random sample of the U.S. population is polled and the result
shows that
45 percent of the sample approves of the President with a margin of error
of 3
percent (let us call this observed sample proportion e). We can construct
a
hypothesis from e.

H(e): the population proportion of approval of the President is in the interval between 42 percent and 48 percent.

If the margin of error is set for 95 percent confidence, this result
means
that the probability that the true population proportion is inside this
interval is 95 percent.^{6} Let us call this procedure CI. CI
includes
details about the random sampling, data analysis and decision rules.

This is a widely used method, and, intuitively speaking, this sounds admissible. At the same time, this is a case in which the most relevant evidence is used in the construction of the hypothesis. Then the question is whether the evidence constitutes a severe test. Given the above proof about probability A and probability B, the answer should be no, because in this case the probability that H(e) is false is not 0, which necessitates that probability B be 1, given that probability A is 1. But, as I mentioned above, this is not what Mayo means to say. Mayo claims that this case is a severe test: "the fit with the resulting hypothesis (the interval estimate H(e) ) is given by the specified margin of error d, say 2 standard deviation units. It is rare for so good a fit with H(e) to occur unless the interval estimate H(e) is true (i.e. unless the population proportion really is within 2 standard deviation units of e). So the severity in passing H(e) is high" (1996, 273).

To understand her argument better, it is convenient to distinguish three different interpretations of the severity criterion applied to this case:

(SC*-1) There is a very high probability that the test procedure CI fails H(e), given that H(e) is false.

(SC*-2) There is a very high probability that the test procedure CI fails the H(x) constructed in each case, given that the H(x) is false in that case.

(SC*-3) There is a very high probability that the test procedure CI fails H(e) when the obtained sample proportion is e, given that H(e) is false.

x refers to the observed proportions generated through the same procedure
in
various cases (and e is a particular instance of x in the present case).
H(e)
refers to the concrete hypothesis in the present case, namely the
hypothesis
that the population proportion of approval of the President is in the
interval
between 42 percent and 48 percent, while H(x) refers to the hypothesis
constructed using the sample proportion obtained in each case (which may
be
different from the present one). Let us look at SC*-1 first. If H(e) is
false,
the same observed sample proportion as the present one, e, is highly
unlikely,
and H(e) will not pass the test (H(e) will not be even constructed
without e).
Thus the test seems severe, as Mayo argues. But there is a problem with
SC*-1,
given her purpose for introducing this case. The corresponding
probability A,
namely the probability that the test procedure T fails H(e), is not
unity. It
is easy to imagine a case in which we follow exactly the same procedure,
and
still construct (and pass) a different H(x) due to a different sample
proportion x. Because the sampling is supposed to be random (this is a
part of
the definition of the situation, see above), the possibility of different
sample proportions seems almost inevitable. Thus, even though we need
further
background information to calculate such a probability, probability A is
almost
certainly not unity, therefore this is not the case of use-construction
Giere
was discussing.^{7}

There are (at least) two interpretations of the severity criterion which make probability A unity, and SC*-2 and SC*-3 represent them. SC*-2 refers not to the present H(e), but to the H(x) to be constructed in each case. The probability that H(x) fits with the observed sample proportion x in that case is unity, given the testing procedure. SC*-3 tries to solve the problem by adding one more constraint, namely the constraint that the obtained sample proportion is e (this may sound like a very ad-hoc modification, but if we see this as a part of procedure CI, it sounds more natural). In either case, the corresponding probability A is unity, but, as a result, B is also unity, thus the severity of the test calculated by either of them becomes zero (the good fit between the constructed hypothesis and the observed proportion is already assured whether the hypothesis is false or not). Whether we compare H(x) in each case with x, or we compare the present H(e) with e, the very good fit is guaranteed. Again, Mayo has trouble here.

What is the best strategy for Mayo to get out of this trouble? The reply I would choose is the following. The judgment of use-construction is done by either

A': The probability that the test procedure CI fails the H(x) constructed in each case,

or

A'': The probability that the test procedure CI fails H(e) when the obtained sample proportion is e.

But the calculation of severity has nothing to do with these probabilities, that is, severity is calculated through SC*-1. Thus, still, the severity criterion can account for a case of use-construction which is intuitively admissible. By adopting this reply, Mayo has to give up her original line of argument based on the contrast between probability A and probability B, but she can retain the substantial point, namely the superiority of her severity criterion over the use-novelty criterion.

*3-2. Gellerization.* When we interpret Mayo in the way I suggested
at the
end of the last section, however, a further problem occurs. Many other
intuitively inadmissible use-construction procedures can be severely
tested
according to this view! Let me show this using what Mayo calls
'gellerization'
as an example.

Gellerization is a typical erroneous way of creating "maximally likely alternatives" (Mayo 1996, 200). Mayo's example of gellerization is coin-tossing trials (201-202). Suppose that we are testing a hypothesis H0 that a certain coin is fair. Suppose also that we obtain the results of H,T,T,H from four trials (let us call this obtained sequence e). We can gellerize a hypothesis that the probability of heads is 1 on trials one and four, 0 on two and three. In general, we can construct a maximally likely alternative hypothesis G(e) for any result (202):

G(e): the probability of heads equals 1 on just those trials that result in heads, 0 on the others.

G(e) fits better with e than H0 does, because G(e) entails e, while H0
merely
makes e plausible. Is G(e) not a better hypothesis than H0? According to
Mayo,
it is not. Think about the following test procedure (202): "Fail (null)
hypothesis H0 and pass the maximally likely hypothesis G(e) on the basis
of
data e." Let us call this procedure GL. Mayo claims: "There is no
probability
that test [GL] would *not* pass G(e), even if G(e) were false. Hence
the
severity is minimal (i.e., 0)" (ibid.; emphasis in original). Thus, even
though
G(e) fits better with e, H0 is a better hypothesis in the sense that H0
is
tested more severely than G(e).

So, Mayo arrives at a radically different conclusion from the case of the confidence interval. But are they that different in terms of severity? Let us introduce three alternative interpretations of the severity criterion parallel to the above SC*-1 ~ SC*-3:

(SC*-4) There is a very high probability that the test procedure GL fails G(e), given that G(e) is false.

(SC*-5) There is a very high probability that the test procedure GL fails the G(x) constructed in each case, given that the G(x) is false in that case.

(SC*-6) There is a very high probability that the test procedure GL fails G(e) when the obtained sequence is e, given that G(e) is false.

Given the above interpretation of Mayo's severity criterion, SC*-4,
parallel to
SC*-1, is the natural candidate to be used for the calculation. But the
result
of such a calculation does not match with her account of the severity of
the
test. Suppose that G(e) is false and H0 is true.^{8} Then the
probability that G(e) passes the test is 1/16 (roughly .06), because G(e)
passes the test only when the obtained sequence is H,T,T,H. In fact, in
other
cases, the same G(e) as the present one is not even constructed, because
of the
different sequences. Thus the severity is high according to SC*-4.
Probability
A for G(e) corresponding to SC*-4 seems fairly low, though, again, we
need more
background information for the calculation. On the other hand, we obtain
a zero
severity if we use SC*-5 or SC*-6 for the calculation. But if Mayo wants
to use
one of them in the case of gellerization, she needs to explain why she
does not
use the corresponding SC*-2 or SC*-3 in the case of the confidence
interval.
Also, the probability A corresponding to SC*-5 or SC*-6 is unity. To sum
up:
the case of confidence interval and the case of gellerization have a
strong
parallelism in terms of the assessment of use-construction and the
estimation
of severity. If one is use-constructed, so is the other. If one is tested
severely, so is the other. If one is not tested severely, the other is
not
either.

This result suggests that neither the use-novelty nor the severity criterion can account for our intuition that the confidence interval estimation is a good method and gellerization is not. What Mayo needs to do is find a good reason for the differential applications of the severity criterion. I do not think she provides this in her book, however, and in this respect her treatment of the use-novelty issue is deeply unsatisfactory.

**4. Systematic Neglect of Relevant Alternatives.** If my assessment
is
valid and I can rely on the intuition that CI estimation is a good
testing
procedure and gellerization is a bad one, then it seems to me that
neither
severity nor use-novelty provide a satisfactory account of a good test as
they
are. Then, how can we account for the difference between the confidence
interval and gellerization? Let me conclude this paper with one attempt
at such
an account. In the case of the confidence interval, the class of all
hypotheses
that can be use-constructed (the class of all hypotheses of the form "The
population proportion of approval of the President is in the interval
between x
percent and y percent" with concrete numbers in x and y) is exhaustive in
the
sense that at least one of the hypotheses is true. Thus, simply
eliminating
implausible alternatives can lead us to a reliable hypothesis. The same
thing
does not apply to gellerization. As Mayo recognizes, the test procedure
rejects
H0 come what may (Mayo 1996, 202), and this makes the class of possible
use-constructed hypotheses non-exhaustive. I would like to call this
feature of
gellerization a *systematic neglect of relevant alternatives*. I
think
this may be the key to refining the notions of severity and use-novelty:
namely, a theory construction and testing procedure is not justified if
the
procedure involves a systematic neglect of relevant alternatives. Though
I do
not have enough space to develop this idea here, certain initial thoughts
may
be helpful to understand the idea.

Let me illustrate the idea with certain historical cases. The acceptance of the general theory of relativity based on the anomalous advance of the perihelion of Mercury seems justified even if the phenomenon is used in the construction of the theory (see Brush 1989 and Earman 1992). Why? Because the general theory of relativity was the only available theory which could account for the phenomenon. No neglect of relevant alternatives was involved. On the other hand, sticking to geocentricism in the face of heliocentricism seems unjustified, if it is done by systematic ad hoc changes to save geocentricism; such ad hoc changes seem to involve a systematic neglect of a relevant alternative, namely heliocentricism. I think that this is the true reason why use construction is considered a bad scientific method in ordinary cases.

A few more words may be preferable for the qualifications, 'systematic' and 'relevant'. The problem with gellerization is that we know we can reject H0 come what may, even before we obtain the data. Similarly, if we can tell how to update geocentric theory so that we can reject heliocentric theory come what may even before we look at the data, something wrong is going on. The problem is the systematic procedure for accommodating any new data, which makes empirical data useless. I agree with Lakatos (1970) that sometimes ad hoc changes to save the hard core of a research programme are desirable, but I disagree with him on where to draw the line between admissible and inadmissible ad hoc changes.

The reason I talk about 'relevant' alternatives is that there are alternatives which are rightly neglected because of their irrelevance. For example, alternatives of the form of G(x) in the gellerization case are usually neglected in ordinary coin-tossing trials. Such alternatives call for a very strange mechanism involved in the coin tossing, and we usually have no reason to suspect the existence of such a mechanism. If we do consider G(x) as relevant alternatives, we would conduct a different kind of test (rather than mere coin tossing) which can differentiate H0 from alternatives of the form G(x).

Can we formalize the idea of systematic neglect of relevant alternatives into a criterion, like Mayo's severity criterion? I am not sure if such a formalization is possible. The idea may be represented by an unfairly low probability of accepting a member of a certain class of relevant alternatives given any possible outcome. But 'unfairly low probability' sounds too subjective to be incorporated into a formal rule. Still, it seems to me that any attempt to refine the ideas of use-novelty and severity should take into account this kind of consideration.

REFERENCES

Brush, Stephen G. (1989), "Prediction and Theory Evaluation: The Case of
Light
Bending", __Science__ 246: 1124-1129.

Earman, John (1992), __Bayes or Bust?: A Critical Examination of
Bayesian
Confirmation Theory__. Cambridge: The MIT Press.

Giere, Ronald N. (1983), "Testing theoretical hypotheses", in John
Earman,
(ed.), __Testing Scientific Theories__, Minnesota Studies in the
Philosophy
of Science vol. X. Minneapolis: University of Minnesota Press,
269-298.

Howson, Colin and Peter Urbach (1993), __Scientific Reasoning: The
Bayesian
Approach__, second edition. La Salle: Open Court.

Lakatos, Imre (1970) "Falsification and the methodology of scientific
research
programmes", in Imre Lakatos and Alan Musgrave, (eds.) __Criticism and
the
Growth of Knowledge__. Cambridge: Cambridge University Press,
91-196.

Mayo, Deborah G. (1996), __Error and the Growth of Experimental
Knowledge__.
Chicago: The University of Chicago Press.

Musgrave, Alan (1978), "Evidential Support, Falsification, Heuristics,
and
Anarchism", in Gerard Radnitzky and Gunner Andersson, (eds.), __Progress
and
Rationality in Science__, Boston Studies in the Philosophy of Science
vol.
LVIII. Dordrecht: Reidel, 181-201.

Popper, Karl (1962), __Conjectures and Refutations: The Growth of
Scientific
Knowledge__. New York: Basic Books.

Worrall, John (1978a), "The Ways in Which the Methodology of Scientific
Research Programmes Improves on Popper's Methodology", in Gerard
Radnitzky and
Gunner Andersson, (eds.), __Progress and Rationality in Science__,
Boston
Studies in the Philosophy of Science vol. LVIII. Dordrecht: Reidel,
45-70.

Worrall, John (1978b), "Research Programmes, Empirical Support, and the
Duhem
Problem: Replies to Criticism", in Gerard Radnitzky and Gunner Andersson,
(eds.), __Progress and Rationality in Science__, Boston Studies in the
Philosophy of Science vol. LVIII. Dordrecht: Reidel, 321-338.

Worrall, John (1989), "Fresnel, Poisson and the White Spot: The Role of
Successful Predictions in the Acceptance of Scientific Theories", in
David
Gooding, Trevor Pinch and Simon Schaffer (eds.), __The Uses of
Experiment:
Studies in the Natural Sciences__. Cambridge: Cambridge University
Press,
135-157.

Zahar, ElieG. (1973), "Why Did Einstein's Programme Supersede
Lorentz's?",
__The British Journal for the Philosophy of Science__ 24: 93-123 and
223-262.

FOOTNOTES

^{+ }I am very thankful to Professor Mayo for her detailed
comments on
eariler versions of this paper, and for her patience in replying to my
questions. I would also like to thank to faculty members and graduate
students
at University of Maryland, especially to Rob Skipper and Nancy Hall, for
their
comments.

^{1} Actually 'use-novelty' is not the term these authors use to
describe their position. For example, Worrall seems to prefer 'heuristic
account' 'or 'Zahar-Worrall criterion' as the name of their position
(Worrall
1989, 148-155). I use the terms 'use-novelty' and 'use-construction'
simply
because these are the terms Mayo uses in her analyses.

I do not have enough space to discuss other proposed notions of novelty
and to
compare the relative merits of these notions. See, for example, the
exchange
between Musgrave (1978) and Worrall (1978b) for more about this.

^{2} For example, Brush (1989) revealed a historical case
(acceptance
of the general theory of relativity) in which scientists preferred
non-novel
evidence (the advance of the perihelion of Mercury, which, as Earman 1992
shows, was used in the construction of the theory) to novel evidence
(star-light bending by the sun).^{ }Another attack comes from
Bayesians
who argue that use-novelty is not a necessary condition for good evidence
from
the Bayesian point of view (Howson and Urbach 1993, 408-409).

^{3} Actually Giere does claim that the evidence used for the
construction does not constitute a good test because the test passes the
hypothesis "no matter whether the corresponding hypothesis is true or
false"
(Giere 1983, 286). This suggests that Mayo misunderstands Giere's
argument, but
maybe it is Giere who misstates his own position. Here, I assume that
Mayo
correctly characterizes Giere for the sake of the argument.
^{4} To quote exactly, Mayo's own formulation of B is "..., even
if it
is false" (270). I replaced it with "given that" because (1) "Even if "
is a
very ambiguous expression, and (2) Mayo herself use "given that" version
in a
general (not restricted to use-construction) formulation of probability
in B
(269).

^{5} Here, actually I am exploiting an equivocation in the
severity
criterion to interpret Mayo's argument most charitably. The equivocation
will
show up as a serious problem later in this paper.

^{6} Actually this is a bit inaccurate. Strictly speaking, a 95
percent
confidence interval means that whatever the true population value is, 95%
of
the estimates to be produced with this procedure include the true value.
Once
the estimate is produced, either the true value is inside or not; namely
the
probability is either 1 or 0. See Mayo 1996, 273 n.14. By the way, this
95% has
nothing to do with Mayo's severity. To use the categories in Table 1,
this 95
% corresponds to C1 / [C1+ C2], which has no direct relationship with C2/
[C2+C4] .

^{7} Of course, even when we understand H(e) in this way, H(e) is
in
some sense use-constructed. In the draft version of this paper, I used a
distinction between 'accidentally use-constructed' and 'genuinely
use-constructed' hypotheses, but the matter is more with the way we look
at the
hypotheses than with the hypotheses themselves (Mayo pointed this out to
me in
personal communication).

^{8} Obviously there are many other alternative hypotheses of the
form
"probability of heads is p (!=.5)," or even other possible maximally
likely
hypotheses like "The probability of heads is 1 on trials one and two, 0
on
three and four." How can we calculate probability B when "G1(e) is false"
means
that one of these alternative is true? This is particularly problematic
for
Mayo because she rejects assigning prior probabilities to hypotheses
(Mayo
1996, 75). Mayo's answer is that her severity criterion "requires that
the
severity be high against *each such alternative*" (195; emphasis in
original). This sounds like too strict a requirement, but she does not
mean all
logically possible alternatives. As her argument in chapter 5 of the book
shows
(and as she pointed out to me in her private communication), it is enough
for
her if severity is high against all relevant alternatives specified by
the
research design. Therefore, since the question here is the comparison of
H0 and
G1(e), I think I am warranted to concentrate on H0 where the severity
criterion
(formally) requires to consider all other alternatives.

By the way, if Mayo does not add this qualification about the relevant
alternatives, then it can be shown that almost alway there is at least
one
maximally likely alternative which is incompatible with the tested
hypothesis
and still assures high probability of passing result of the hypothesis
(though
I do not have enough space to prove it here). This makes the severity
criterion
almost vacuous.