Problems with Hypothesis Testing

   stats main page  

The importance of falsification

Scientists - like everyone else - use inductive reasoning all the time. We all assume the sun will rise tomorrow morning, because it has risen every morning so far. If we examine a large number of mammals and find that all of them are viviparous (ie. they don't lay eggs), we might conclude: Mammals don't lay eggs.

However, inductive reasoning is not logically watertight. We can't prove that mammals don't lay eggs by looking at lots of mammals and seeing that none of them lays eggs. On the other hand, if we find a single mammal that does lay eggs (maybe a duck-billed platypus), we have disproved the hypothesis that mammals don't lay eggs.

Thus arose the idea that "proper science" proceeds by falsification: setting up hypotheses and then disproving (falsifying) them. This view was argued strongly by Karl Popper (1959), but is not accepted by all scientists (or philosophers of science).

Falsification in biology

If you are interested in the effect of salt on the boiling point of water, you take two identical beakers of water, add salt to one, heat both until they boil and check the the temperatures. If you are interested in the effect of fertilizer on crop yields, you take two identical plots of land... and here the problems start! No two plots of land are identical, nor are two living organisms identical. Even if fertilizer has no effect, the plot with fertilizer may have a higher yield than the one without. So biologists can't do proper science!

Fisher looked at the probability of getting the observed result (eg. increase in yield) if the null hypothesis (fertilizer doesn't increase yields) is true. If the probability is sufficiently small, the result is deemed significant, and is seen as evidence that the null hypothesis is wrong. Neyman and Pearson took null hypothesis significance testing (NHST) a step further by talking of "accepting" or "rejecting" the null hypothesis. It seemed that NHST enabled biologists too to do proper science, and respectable journals insisted on it.

But NHST has its critics, and criticism among wildlife biologists came to a head at the 1998 annual conference of the Wildlife Society (Johnson, 1999).

What's wrong with NHST?

The more important criticisms of NHST are:

1. Logical operations are not valid for probabilistic propositions (Stephens et al, 2007; Cohen, 1994). Deduction depends on syllogisms, with a conclusion based on two unarguable premises, eg:

  • If I don't have a lottery ticket, I cannot win the lottery.
  • I won the lottery.
  • Therefore, I have a lottery ticket.

Of course, if one of the premises is wrong, the conclusion is also wrong:

  • If I do have a lottery ticket, I cannot win the lottery. (Wrong!)
  • I won the lottery.
  • Therefore, I don't have a lottery ticket.

Now see what happens if we slip in probability:

  • If I have a lottery ticket, I am unlikely to win the lottery. (True!)
  • I won the lottery.
  • Therefore, I am unlikely to have a lottery ticket.

Obvious nonsense! But is it any different from this:

  • If the null hypothesis is true, I'm unlikely to observe data Y.
  • I observed data Y.
  • Therefore, the null hypothesis is unlikely to be true.

2. The P-value is not the probability that the hypothesis is true given the data (Cohen, 1994). For a frequentist statistician, such a statement is meaningless: the hypothesis is either true or false. Here we are using the subjective or Bayesian meaning, ie. the everyday meaning. And any Bayesian will tell you that Prob(H | data) is not the same as Prob(data | H).

Of course, what we want to know is whether the hypothesis is true or false. As Cohen (1994:997) puts it, "[NHST] does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" Hence the progression: small P-value significance evidence that the hypothesis is false ... and then low probability that the hypothesis is true.

3. Significance depends on the probability of observing "more extreme data", and that depends in turn on the design of the study (Berger and Berry, 1988; Johnson, 1999, whose example we've adapted below).

For example, we may be concerned about the sex ratio of turtle hatchlings in view of global warming. We have data for 13 hatchlings, 10 females, 3 males. The null hypothesis is that the sex ratio is balanced. The probability of 3 males out of 13 hatchlings is 0.035. But to this we must add the probability of more extreme results which we might have observed but did not. This depends on the "stopping rule":

  • Stop at 13 total: more extreme = 2 males/13 hatchlings, 1 male/13, 0 males/13, p = 0.046, significant.
  • Stop at 10 females: more extreme = 2 males/12, 1 male/11, 0 males/10, p = 0.057, not significant.
  • Stop at 3 males: more extreme = 3 males/14, 3 males/15, 3 males/16, 3 males/17, etc etc, p = 0.092, not significant.
  • Stop when the difference is 7: more extreme = 0 males/7, 1 male/9, 2 males/11, p = 0.087, not significant.
  • ... and you can probably think of more!

It seems unsatisfactory that our conclusion depends not only on the facts of the case, but the intentions of the researcher.

4. The results can be inadequate or misleading for management decision-making (Wade, 2001; McGarvey, 2007). NHST and the conventional cut-off of p = 0.05 reflect scientific caution; we need good evidence against the null hypothesis before publishing our results. Managers, however, work on the precautionary principle; they should act if there is evidence that a problem is emerging.

Wade gives the example of data for beluga whale sightings in Cook Inlet, Alaska. After six years of monitoring, the estimated trend was a decline of 9% per annum. However, NHST gave a P-value of 0.06, so the null hypothesis of "no change" was not rejected. (The corresponding 95% confidence interval was from -18 to +0.5%, so it (just) included 0.) In contrast, a Bayesian posterior based on the same data (with a uniform prior) indicated a 86% probability that the decline was greater than 5% per annum.

5. Most null hypotheses are false anyway (Anderson et al, 2000; Johnson, 1999). Null hypotheses which specify zero effect or a point effect are never true in biology, if measured sufficiently precisely; these have been termed trivial null hypotheses or nil hypotheses. With a sufficiently large sample size, they will always be rejected. If they are not rejected, that merely indicates a lack of power in the study.

6. The focus on a null and a single alternative hypothesis limits scientific advance (Stephens et al, 2007; Chamberlain, 1890). Ecological systems are typically complex, with many possible explanations. Focusing on falsifying one explanation limits progress and gives a false sense of certainty in the explanations retained.

7. The common assumption that a null hypothesis which is not rejected is therefore true. This mistake is increasingly prevalent in the wildlife literature (Fidler et al, 2006). It could be argued that a method should not be faulted if it is misused, but Johnson (1999) considers to "encourage misuse" is an intrinsic fault of NHST.

For more, here is Bill Thompson's "402 citations questioning the indiscriminate use of null hypothesis significance tests in observational studies"

What next?

Well, after that diatribe against NHST, we'd better move on to Alternatives to NHST.

 


wcsmalaysia.org home

Page updated 27 March 2010 by Mike Meredith