Frogs 3 : AIC and model selection
   stats main page

Data file (xls, 46kb)

Objectives

In "Frogs 1 : binomial distribution" we simulated data for detections of frogs at ponds where we knew frogs were present. Then, in "Frogs 2 : likelihood", we used those data to introduce the concept of likelihood and maximum likelihood estimation.

Here we will simulate another set of detection data to represent the results of a second visit to the same ponds.

If the results of the two visits are different, there are two possibilities:

  1. the true detection probability was the same on both visits, the observed difference is due to sampling error; or
  2. the true detection probability was different on the two visits.

We will formulate these as two different models and use likelihood and AIC (Akaike's Information Criterion) to judge which is the better model.

We will also see how uncertainty about the right model can be allowed for through model averaging.

Data simulation and the likelihood calculations are in the Excel file "Frogs3_models.xls".


Simulating the second set of data

The first set of data was generated by rolling a 10-sided die with spots on some of the sides. If the die landed with a spot on top, the frogs were detected.

The first time, I recorded x1 = 6 detections out of 10.

The second time, we used a different die, so the actual number of spots could be the same or different.

I rolled the die 10 times and got the following results (1=detected, 0=not detected):

0 0 0 0 0 1 0 1 1 0

for a total of x2 = 3 detections out of n = 10.

The rest of the calculations on this page use these simulated data, x1 = 6 and x2 = 3. If you have different simulated observations, you will get different results all the way though!

Two different models

We now consider two possible models:

Model 1 : the true detection probability was the same on both occasions, the observed difference is due to sampling error.

For my data, I estimate = (x1 + x2) / 2n = 9 / 20 = 0.45

Model 2 : the true detection probability was different on the two occasions.

For my data, I estimate 1 = x1 / n = 0.6 and 2 = x2 / n = 0.3

We'll calculate the likelihood for each of these models.

Model 1

To find the likelihood of p(detect) with x1 = 6 and x2 = 3, we multiply together the individual likelihoods:

(p=0.45 | x1=6, x2=3) = (p=0.45 | x1=6) * (p=0.45 | x2=3)

We use the binomial distribution to get the values for the likelihoods, which are calculated in the "Likelihoods" tab of the spreadsheet "Frogs3_models.xls". For my data, I look along the p=0.45 row and note the values in the columns for x = 6 and x = 3:

(p=0.45 | x1=6, x2=3) = 0.160 * 0.166 = 0.027

Model 2

This model has two different values of p(detect), one for the first occasion (p1) and another for the second occasion (p2).

My estimates were 1 = 0.6 and 2 = 0.3, so for this model:

(p1=0.6, p2=0.3  | x1=6, x2=3) = (p1=0.6 | x1=6) * (p2=0.3 | x2=3)

Referring to the likelihoods calculated in the spreadsheet, I get:

(p1=0.6, p2=0.3  | x1=6, x2=3) = 0.251 * 0.267 = 0.067

Using log-likelihood

You will see that the likelihood gets smaller and smaller each time we add more data. With lots of data, the likelihood is extremely small. For example, the likelihood for the psi(.) p(.) model for the golden cats analysis in PRESENCE is:

0.000...(70 zeros in here!)...00000000002185

So we prefer to work with the (natural) log of the likelihood, which in this case is:

-201.86

Using logs makes no difference to the process of maximum likelihood estimation, as the maximum of the log likelihood always corresponds to the maximum likelihood.

The log likelihoods for the two models using my data are:

log(Model 1) = log(0.027) = -3.628

log(Model 2) = log(0.067) = -2.704

You need to be careful when comparing negative numbers: the log likelihood for Model 2 is bigger than the log likelihood for Model 1.

Comparing the models

Comparing the two models applied to my simulated data, we see that the likelihood for model 2 is higher than for model 1 (0.067 vs 0.027), which means that it is a closer fit to the data we collected. But model 2 is also more complex, having two parameters (p1 and p2) instead of just one.

The model which is the best fit to the data is not necessarily the best representation of reality. To see the difference, let’s look at a different situation.

Graph of hypothetical population with different modelsThe graph shows the numbers for a hypothetical population of elephants for eight years. The crosses indicate the true population, which has an underlying upward trend. The black dots are the estimates from annual surveys of elephants. The lines represent two models calculated from the survey data. The straight line is a simple model with just 2 parameters; the wiggly line has 8 parameters and fits the data perfectly. In fact, the wiggliness of the wiggly line is due mainly to errors in estimating the population, and the simpler model is closer to the true situation. The wiggly model is described as “over-fitted” or “over-parameterized”.

In reality of course we don’t know the true population, all we have are estimates which are never quite exact – we see the dots but not the crosses in the diagram. How do we decide which is the best model?

Akaike’s Information Criterion (AIC) balances number of parameters and fit to the data (likelihood):

AIC = 2 * No. of Parameters – 2 * log(Likelihood)

and a small value for AIC indicates a better combination of simplicity and fit to data.

The table below, which was calculated in the "Model comparison" tab of the spreadsheet "Frogs3_models.xls", shows the calculation for AIC for the two models for my data:

  Model No. of Parameters Likelihood log(Llh) AIC ΔAIC
Model
1
p1 = p2 = p 1
( = 0.45)
0.027 -3.6119 9.224 0
Model
2
p1 p2 2
(1 = 0.60;
2 = 0.30)
0.067  -2.7031 9.406 0.182

Model 1 has a smaller AIC value, and is the better model by this criterion.

Again, note that if you are using your own simulated data, you will get different results, and you may find that Model 2 is better than Model 1. Also note that you can't compare AIC values for different data sets: the data used to calculate AIC must be the same.

Going further with AIC

The AIC values enable us to answer more questions about the models. Model 1 is better then Model 2, but is it a lot better, or only a little better? The difference in AIC ("ΔAIC" or "deltaAIC") is only 0.182, which means that Model 1 is only a little better: Burnham and Anderson (2002, page 70) suggest that models with deltaAIC < 2 are all plausible, 4-7 are considerably less so, and >10 means that the models miss some important explanatory variables.

We can also ask, "How likely is it that Model 2 is in fact the correct model?" We can only calculate the relative likelihoods of the models, but if we take the best model as having Model Likelihood = 1, then

Model Likelihood = e-deltaAIC/2

For Model 2, that works out to 0.913 (see table below), so Model 2 is almost as likely as Model 1 to be the right one.

  Model AIC deltaAIC Model
likelihood
AIC weight
Model
1
p1 = p2 = p 9.224 0 1 1/1.913 =
0.523
Model
2
p1 p2 9.406
 
0.182 0.913 0.913/1.913 =
0.477
 

If we were quite sure that Model 1 was right, we’d be confident in our estimates of detection probabilities, 1 = 2 = = 0.45. But Model 2 is almost as likely to be the right one, so the right values might be 1 = 0.60 and 2 = 0.30. We really need an estimate which takes into account the uncertainly about the best model.

The solution is 'model averaging': we take a weighted average of the estimates from the different models, using weights proportional to the Model Likelihood. As shown in the last column of the table above, AIC weights are simply the Model Likelihood divided by the total of all the Model Likelihoods. For our example, this gives:

1 = 0.45 x 0.523 + 0.60 x 0.477 = 0.52

2 = 0.45 x 0.523 + 0.30 x 0.477 = 0.38

AIC is a powerful way of working with models and is used by PRESENCE, DISTANCE and MARK. It can be used to compare models which are completely different in form. But you must use the same data set for all the comparisons.

For more on modeling and model fitting with AIC, see Burnham and Anderson (2002).


Main points

  • A model is a simplified description of a real-world phenomenon, often expressed as a mathematical relationship.
     
  • Once we have the maximum likelihood estimates of the model parameters, we can use the value of the likelihood to compare models.
     
  • Because the likelihood is usually extremely small, we use the logarithm, the log likelihood, which is also used to calculate AIC.
     
  • The model with the highest likelihood is the best fit to the data, but a model with fewer parameters may be a better fit to the underlying phenomenon.
     
  • AIC balances number of parameters and log likelihood; the best model has the lowest AIC.
     
  • Differences between model AIC’s ( AIC or delta AIC) allow us to compare models and combine the results.
wcsmalaysia.org home Text by Mike Meredith, updated 11 May 2009