Fisher's approach to statistics
   stats main page

Spreadsheet with calculations (xls, 24KB)

Ronald A. Fisher

Sir Ronald Fisher (1890-1962) was employed as statistician at the Rothamsted Agricultural Experiment Station from 1919 to 1933. Rothamsted had been established in 1843 and had a large archive of data needing proper analysis.

While at Rothamsted, Fisher developed the statistical techniques which dominated biostatistics for most of the 20th century. He introduced the concept of likelihood in 1921 and the distinction between statistics and parameters in 1922. His textbook, Statistical Methods for Research Workers, was published in 1925 and remained in print until 1958.

After leaving Rothamsted, he worked on genetics, first at London University then at Cambridge.

More on Fisher at St Andrews University
More on Rothamsted at Rothamsted Research


Jump to:

Effects of fertilizer on oats

To illustrate Fisher's approach, we'll use data from an experiment at Rothamsted on the effect of fertilizer on the yield of oats. These were published by Yates (1935) shortly after Fisher left Rothamsted; it has become a classic data set for teachers.

The calculations are shown in a spreadsheet which you can download here.

The experiment

The area of the experiment was divided into plots of 50 m2 each. Of 12 plots planted with the variety called "Marvellous", 6 were randomly allocated to receive 75kg/ha nitrogen fertilizer (ammonium sulphate) and 6 plots received no fertilizer. Each plot was harvested separately. The yields (expressed as tonnes/ha) are shown below:

No fertilizer (control) 75kg/ha ammonium sulphate
2.33 3.66
2.59 4.33
3.29 4.48
3.55 4.59
3.59 5.33
3.88 5.77
mean = 3.20 mean = 0.69
sd = 0.612 sd = 0.751

The difference in the sample means is 1.49 tonnes/ha.

Precision of the difference in means

The figure of 1.49 tonnes/ha is an estimate based on samples, so is not the true value. We need to give some indication of how precise we think this estimate is.

Just as with the "squirrels" example (see here), we can calculate a standard error; if you're not familiar with that, please look at the "squirrels" example again.

Calculating the standard error of the difference in means of two samples is similar to calculating the standard error for the mean of a sample (this is done in the spreadsheet):

  1. Calculate the deviation for each observation; this is the difference between the observed value and the mean for that group of values, 3.20 or 4.69 as appropriate.
  2. Square each of the deviations and add them up, then divide by the number of degrees of freedom. [Remember that degrees of freedom = number of observations (12 in this case) - number of parameters estimated from the data (2 in this case, the means 3.20 and 4.69).] This is the pooled variance of all the observations.
  3. Take the square root of the pooled variance to get the pooled standard deviation.
  4. The last step is less intuitive! For a single sample we divide the s.d. by . But now we need to use:

where sp is the pooled standard deviation and n1 and n2 the sample sizes.

The standard error of the difference in means is 0.40, so we could report:

"The difference in yields due to fertilizer is 1.49 ± 0.40 tonnes/ha."

Is there evidence of a real difference?

The difference in the means is 1.49 tonnes/ha. But the values within each group vary widely, as can be seen in the graph below: the difference between the highest and lowest in each group is greater than 1.49. There are big differences in fertility among plots, even without fertilizer.

Is it possible that fertilizer did not cause an increase in yield, and we have just by chance chosen the most fertile plots to apply fertilizer?

We'll use null hypothesis significance testing (NHST) to investigate this.

Permutation test

Suppose fertilizer makes no difference to the yield (this is called the null hypothesis). Then it wouldn't matter which six plots we selected for the fertilizer, we'd get exactly the same 12 numbers shown in the table above, except that they would be in different groups.

Now there are 924 ways of dividing 12 plots into two groups of six; these are called permutations. We could try all 924 permutations and see how many times we get a difference of 1.49 tonnes/ha or more - this is easy to do in R. If you look at the diagram above, you'll probably see that only one permutation would give a bigger difference: switch the smallest "with N" plot (3.66) for the biggest "no N" plot (3.88) and the difference in means is 1.56 tonnes/ha.

So out of 924 possible ways of arranging the plots, only 2 have a difference of 1.49 or more. Provided the selection of plots really was random and independent (eg. pulling numbered balls out of a bag), the probability of choosing one of these two just by chance is 2/924 = 0.002. This is the P-value. A result is called statistically significant if it is unlikely to have occurred by chance, and our results are statistically significant.

 We have two possible explanations for this tiny P-value:

  • either a rare event has happened,
  • or the assumption that fertilizer makes no difference is wrong.

If we consider the improbable event to be implausible, we can interpret small P-values as evidence that the hypothesis is wrong. The diagram below (Ramsey & Schaffer 2002:47) shows the usual interpretation of P-values:

So our experiment provides convincing evidence that fertilizer has an effect on the yield of oats.

Two-sided tests

So far we have only considered permutations which give differences of 1.49 tonnes/ha; this seems reasonable, as we're mainly interested in increases in yields. But large decreases in yields would also provide evidence that the effect of fertilizer is not zero. (With too much nitrogen, oats produce lots of green leaves but hardly any grain.) There are 2 permutations with differences of -1.49 and -1.56; if we include these, we have a P-value of 4/924 = 0.004. This is a two-sided test; the test above is a one-sided test, as we only considered possible increases in yields.

Student's t-test

With bigger samples, the number of permutations quickly becomes enormous, and permutation tests were rarely practical until electronic computers became available.

Fisher used "Student's" t-test to compare sample means (Fisher 1946:122). This uses the t-statistic:

t = difference in means / standard error of the difference
t
= 1.49 / 0.396 = 3.76

Now we can restate our question: If the true value is t = 0, what is the probability of observing t 3.76?

We can calculate this if we make one more assumption: the value of t comes from a t-distribution with 10 degrees of freedom; that is, if we did the experiment many, many times and plotted the values, we would get a t-distribution with a mean (according to our null hypothesis) of 0. That would look like the distribution below:

If this is the case, then the probability of observing t 3.76 is the area under the curve to the right of the red line in the graph. It works out at 0.002 (you can do this with  TDIST()  in Excel or with  pt  in R), 0.2% or 1 in 500. This is the P-value for a one-sided test. For a two-sided test, we add in the probability of observing t  ‑3.76, which is also 0.002, to get a P-value of 0.004.

The P-value from the t-test has the same interpretation as for the permutation test, ie. we have convincing evidence that fertilizer affects the yield of oats.

Testing other hypotheses

So far we've only considered the hypothesis: fertilizer makes no difference, ie. the true difference in yields is 0. We can also try other hypotheses, such as: fertilizer increases yields by 1 tonne/ha. (Perhaps a farmer has decided that the extra work and cost of spreading fertilizer was only worthwhile if the increase was more than one tonne/ha.)

If fertilizer increases yields by only 1 tonne/ha, the additional increase we observed, 1.49 - 1.00 = 0.49, must be due to chance.

We can do this easily with the t-test. The standard error hasn't changed, so we can calculate a new value for t:

t = (1.49 - 1.00) / 0.396 = 1.23

The 2-sided P-value, the probability of getting t 1.23 or t  ‑1.23 just by chance, is 0.30. That is not be a rare event, so we have no reason to suspect that the hypothesis is wrong. Looked at the other way, a true value of +1 tonne/ha is consistent with the data we observed.

The confidence set

Clearly, we could consider lots of hypotheses of the form: fertilizer increases yields by x. We'd calculate t = (1.49 - x) / 0.396 for each and find the P-value.

But we can make life easier for ourselves, provided we agree on a P-value for a rare event. Fisher wrote "We shall not often be astray if we draw a conventional line at .05" (Fisher, 1946:80) and 0.05 or 5% has indeed become the convention.

Now we can look at the t-distribution with 10 degrees of freedom and see what values correspond to a two-sided P-value of 0.05. The result is 2.23 (you can calculate this in Excel with  =TINV(0.05,10) ); this is called the critical value for t. If we turn the t formula around we get:

x = 1.49 - 0.396 x t

Plugging in the value of t = 2.23, we get x = 0.61 tonnes/ha. Any value for x < 0.61 will give a P-value < 0.05, and we'll interpret this as evidence that it's wrong.

There is an upper limit too: if the true increase in yield was, say, 10 tonnes/ha, our observed value of 1.49 would again be a rare occurrence. The upper limit corresponds to t = ‑2.23, which gives x = 1.49 + 0.396 x 2.34 = 2.37. Any value for x > 2.37 will also give a P-value < 0.05, and we'll interpret this too as evidence that it's wrong.

Hypothetical values for the increase in yield between 0.61 and 2.37 tonnes/ha are consistent with the data, in the sense that the data provide no evidence that they are wrong. Because of the 5% cut-off, this is called the 95% confidence set. (95% of the area under the t-curve with 10 degrees of freedom lies between ‑2.23 and +2.23.)

Confidence set or confidence interval?

If you are familiar with confidence intervals, you will recognize that the method used above for a confidence set (ie. 1.49 ± 0.396 x 2.34) is the same as that for a confidence interval. If you want to say, "The 95% confidence interval is 0.61 to 2.37 tonnes/ha", go ahead, you're right. The difference is in the meaning:

  • 95% confidence set: The values inside the confidence set result in P-values > 0.05 when we do our statistical test, so we have no evidence that they are wrong on the basis of our data. The true value may lie outside the confidence set and we have by chance got some weird data; we can only report what we infer from the data.
     
  • 95% confidence interval: If we repeat the experiment many, many times and calculate a 95% confidence interval each time, then 95% of the times the true value will be inside the confidence interval. We've done the experiment once and calculated a 95% confidence interval; we have no idea whether the true value is inside this confidence interval or not.

What about the alternative hypothesis?

Around 1930, Jerzy Neyman and Egon Pearson took a somewhat different approach. They presented the problem as a decision, a choice between two hypotheses. For example, if δ  (the Greek letter delta) is the difference in yields due to fertilizer, they might have:

  • H0 : δ = 0
  • HA : δ = 5 tonnes/ha

They were interested, for example, in the power of the experiment, that is, how often H0 would be rejected if HA was true, for many repetitions of the experiment. To do this, they need an actual value for the alternative hypothesis (HA) ie. it must be a point hypothesis.

You'll often see the hypotheses set up like this:

  • H0 : δ = 0
  • HA : δ ≠ 0 or δ > 0

Here HA is not a point hypothesis, and is useless for Neyman and Pearson's methods.

In practice, it seems, standard Null Hypothesis Significance Testing (NHST) has become a mish-mash of Fisher's model fitting approach and the Neyman-Pearson decision method which none of them would approve of! (See Christensen 2005 for a discussion of Fisher vs Neyman-Pearson vs Bayes)

Summary

  • Fisher introduced modern designs for experiments with multiple plots (replication) and random assignment of plots to control and treatment (randomization). This allows probability theory to be used to analyse the results.
     
  • Fisher is concerned with a single hypothesis, calculating the probability of observing the actual or more extreme data if the hypothesis is right P‑value.
     
  • A small P-value (ie. statistical significance) means either a rare event has occurred or the hypothesis is wrong. Thus are small P-values interpreted as evidence that the hypothesis is wrong. The P-value does not tell us the probability that the null hypothesis is true; this is a common misunderstanding.
     
  • Fisher's 95% confidence set is a range of hypothetical values for which a statistical test (eg. t-test) results in a P-value > 0.05, so we have no evidence that they are wrong. A 95% confidence interval has the same values but a different interpretation.

See Ramsey & Schafer (2002) for a good, 21st century introduction to the sort of statistics Fisher used.

Warning!

Although NHST dominated biostatistics for most of the 20th century, it has been increasingly criticized, both on theoretical grounds and because it is used inappropriately. New methods have been developed in the last 20 years which are preferable to NHST in most circumstances. Please check the pages on:


wcsmalaysia.org home

Text by Mike Meredith, updated 6 April 2010