Which bag? Likelihood and AIC
   stats main page

 

On this page:

Objectives

Likelihoods can be used to compare the evidence for competing hypotheses. This avoids many of the criticisms levelled at hypothesis testing.

Here we try a game to see how likelihood ratios can be interpreted as "strength of evidence" and we will extend this to include AIC (Akaike's Information Criterion).

This assumes that you are familiar with the idea of 'likelihood'; if not, you should look at Frog calls 2 - likelihood and MLE. You will also need spreadsheet software, such as OpenOffice Calc or Microsoft Excel.

The final section assumes you know about AIC: see Frog calls 3 -modelling.

This is adapted from the "canonical experiment" described by Richard Royall (1997:11-12).


The experiment

Bag with Go stones.We have a collection of identical small bags, each containing Go stones.

Some of the bags contain only white stones; in others, 9 of the stones are white and 1 is black. You take a bag a random and don't look inside.

The aim is to decide whether the bag you took contains all white stones or mixed stones by pulling out one stone at random. Shake the bag, put in your hand, and pull out one stone.

  • If the first stone is black, you can be certain that you have a 'mixed' bag, as it is impossible to pull a black stone out of an 'all white' bag.
  • But if the first stone is white, you are less certain...

Do you think drawing 1 white stone is good evidence for an 'all white' bag? Or is the 'mixed' hypothesis still plausible?

Replace the stone you took out, shake the bag, and draw again. Replace the second stone and draw a third time.

  • If any of the three stones is black, you can be certain you have a mixed bag.
  • If all three are white, you now have stronger evidence that it is an all-white bag.

How strong do you think the evidence is with three white stones drawn? How many would you have to draw before you decided that you had moderate evidence? How many for strong evidence? How many for convincing evidence?

Likelihood ratios

Our two hypotheses can be turned into mathematical models with a few extra assumptions, notably that all the stones are identical, with equal probability of being drawn, and draws are independent. The probability of drawing white or black stones in a single draw is then:

  • all-white model: p(white) = 1, p(black) = 0
  • mixed model: p(white) = 0.9, p(black) = 0.1

Set up a spreadsheet as shown below (assuming for the moment you drew 3 white stones):

Draw: 1 2 3
Colour: white white white
all-white 1 1 1
mixed 0.9 0.81 0.73

Rows 3 and 4 correspond to our two models. Cells B3 and B4 show the probability of drawing 1 white stone corresponding to each model. Cells C3 and C4 are for 2 successive white stones, and so on.

Hint: In cell C4, I typed  = B4 * 0.9  and then copied it across to the right, so that cell D4 became  = C4 * 0.9 .

Likelihood:
When comparing models, we use the term likelihood. After drawing 1 white stone the likelihood of the all-white model is 1 and the likelihood of the mixed model is 0.9.

Likelihood ratio:
After drawing 1 white stone the ratio of the likelihoods = 1/0.9 = 1.111. We can use the likelihood ratio as a measure of the strength of evidence in favour of the all-white model compared with the mixed model.

In row 5 of the spreadsheet, add the likelihood ratio for each stage. In the example below, the fifth stone was black:

Draw: 1 2 3 4 5
Colour: white white white white black
all-white 1 1 1 1 0
mixed 0.9 0.81 0.73 0.66 0.07
Llh ratio 1.11 1.23 1.37 1.52 0

As we draw out more white stones, the evidence for the all-white model increases steadily (a likelihood ratio of 1 means no evidence either way, and <1 means evidence against the all-white model). As soon as we draw a black stone, we are certain that the all-white model is wrong, and the likelihood ratio plummets to zero.

(The formula in cell F4 is  = E4 * 0.1 , since the probability of drawing a black stone from the mixed bag is only 0.1. This is the probability of drawing 4 consecutive white stones and then a black one. If you didn't know the order - just that someone had drawn 4 white stones and out of 5, you could use the binomial distribution,  = BINOMDIST( 4, 5, 0.9, FALSE) which gives 0.33, but the likelihood ratio is still 0.)

Finally, see how the likelihood ratio changes if you keep on pulling out white stones: copy the columns further across to the right - if you pulled out a black stone, delete that column first. What values of the likelihood ratio correspond to moderate evidence for the all-white hypothesis? What values to strong evidence? What values to convincing evidence?

Akaike's Information Criterion (AIC)

We can calculate Akaike's Information Criterion (AIC) for the two models. Since neither model involves parameters estimated from the data,

AIC = -2log(likelihood)

Add two more rows to the spreadsheet with the AIC for each model; remember to use LN() in Excel, rather than LOG(), to get the natural logarithm. Add another row with the difference in AIC's, deltaAIC:

Draw: 1 2 3 4 5
Colour: white white white white black
all-white 1 1 1 1 0
mixed 0.9 0.81 0.73 0.66 0.07
Llh ratio 1.11 1.23 1.37 1.52 0
AIC:          
all-white 0 0 0 0 #NUM!
mixed 0.21 0.42 0.63 0.84 5.45
deltaAIC 0.21 0.42 0.63 0.84 #NUM!

(If you pull out a black stone, the likelihood for the all-white model becomes 0, and log(0) = -; Excel doesn't do infinity, so you'll see a #NUM! error. deltaAIC in this case is ... really huge!)

Now see what values of deltaAIC correspond to moderate evidence for the all-white hypothesis? What values to strong evidence? What values to convincing evidence?

Conclusions

  • Likelihood ratios and deltaAIC allow us to compare the evidence for pairs of models.
     
  • We can extend to >2 models by using one model as the standard and comparing each of the others with that (deltaAIC is always calculated with the model with the lowest AIC as the standard).
     
  • Likelihoods depend only on the data observed; data which might have been observed but weren't have no effect. This avoids a major criticism of p-values.
     
  • Interpretation as "strength of evidence" is still subjective, but it can be based on the experiment we just conducted (or Royall's (1997) canonical urns). You can make your own assessment, but it will probably look rather like this:
  Number of white
stones
Likelihood ratio deltaAIC
Moderate evidence: at least 10 >3 or <0.33 >2
Strong evidence: at least 20 >8 or <0.125 >4
Convincing evidence: at least 50 >200 or <0.005 >10

Keep these values in mind (or save your spreadsheet where you can find it) and use it to interpret deltaAIC values or likelihood ratios.


What next?

We will return to this experiment to see how likelihoods can be turned into probabilities, so that we can state probabilities for the models.

wcsmalaysia.org home

Text by Mike Meredith, updated 24 March 2010