Bootstrapping biodiversity indices
 stats main page

 R   Biodiversity scripts (zip, 22KB)

Bias and confidence limits for indices

Although the better diversity indices are less sensitive to rare species missed from the sample, calculations based on the sample tend to underestimate the true value for the community: estimates are biased low. (Simpson's index is an exception to this, as it has a built-in small-sample correction.) In addition, most indices lack a way to estimate the precision of the result.

The concept of bootstrapping

The idea of the bootstrap is shown in the diagram on the right. In the real world, the true diversity index for the population is unknown. We take a sample and estimate the index based on the sample.

In the bootstrap world we set up a model population based on what we know of the real population. We take samples from this model population and calculate values for the index from the bootstrap samples.

The bootstrap world gives us two major advantages:

  1. We know exactly what the true value of the index is for the model population, because we created it. We can compare the values for the samples with the true value and estimate the bias introduced by the sampling process.
  2. We can repeat the sampling process many hundreds, even thousands, of times. Each sample will be different, and the scatter we find enables us to estimate the standard error or confidence interval of the result.

We can then use the results from the bootstrap world to calculate an unbiased estimate and confidence interval for the real world population.

There are plenty of sources of information on bootstrapping in general: I would recommend Efron and Tibshirani's (1993) book An introduction to the bootstrap, which meshes with the 'bootstrap' package in R. Many other software packages also include a bootstrap option. So here are just a few points specific to bootstrapping diversity indices.

Bootstrapping diversity indices

The bootstrap population is modeled on the sample we have, in that the proportions of the various species are the same. (Ie. this is a nonparametric bootstrap method, as trying to fit an equation to the population would be difficult.) But the bootstrap population is large - much larger than the sample: we draw samples with replacement, so the population never 'runs out'. It behaves as an infinite population. As a result, when we come to calculate the true index for the bootstrap population we should not use a small-sample correction. If you bootstrap Simpson's index: you should not use the n*(n-1) form for the true value, but do use it for the estimates from the bootstrap samples.

The bootstrap must be set up so that we are sampling individuals not species. If the real world sample is, say, 170 birds from 20 different species, we might summarize the data as 20 numbers, the abundances for each of the species. But picking 20 values at random from those 20 numbers (with replacement) is not what we want. Instead, we need to represent the population as a string of 170 numbers, each number representing one bird, with the value indicating which species it belongs to. Then we take 170 individuals from that population and see how many we have of each species.

If that all sounds a bit convoluted, you'll probably find it becomes clear if you look as an example. The bundle of R scripts and data sets here includes examples of running bootstraps with both Hill's N2 and Simpson's index.


wcsmalaysia.org home
Text by Mike Meredith, updated 7 April 2010