Biodiversity
 stats main page

  Richness Lab guide (pdf, 314KB)
  Richness Data file (zip, 1KB)
 R   Biodiversity scripts (zip, 22KB)

Work in progress on this page!

On this page:

Our staff seminar at the end of November 2006 looked at biological diversity concepts and measures. This came out of recent work by Jason and his team trapping bats in Maludam and Loagan Bunut National Parks, as well as Melvin's earlier data on tree species in areas where flying foxes forage.

We used R to analyze the examples, and some R functions and R scripts are provided in the links below.


The meaning of biological diversity

The concept of diversity is slippery! Some biologists simply say, "I know it when I see it," others dismiss it as a 'non-concept'. So we began by looking at some simple diagrams of groups of animals and discussing which groups of animals were more diverse than others and why.

Biodiversity is usually interpreted as 'species diversity', though other taxa could be used, and within-species genetic diversity could also be included. Diversity of ecosystems or habitats was also brought up: although important, it is difficult to define habitats so that different kinds can be counted.

The relationship of abundance and evenness to diversity also came up. This is discussed in a paper in Nature by Purvis & Hector. Taxonomic distinctness should really be included in the concept, but few measures of diversity take it into account.

Species richness seems a simple and intuitively sensible measure of species diversity.


Species richness

Defining the population

To use species richness in practice, we need first to define the group of individual organisms included. We usually limit it to a particular taxonomic group (eg. birds) and a particular place (eg Bako NP). Often we specify a trophic level (eg. insectivores) or a guild (eg. understorey gleaners). Richness estimation means counting the number of species, treating all species equally, whether they are endangered endemics or invasive weeds, top predators or primary producers. This is only reasonable if the population is defined so that species are actually reasonably similar.

Invariably there will be population boundaries defined by your sampling method (eg. susceptible to capture in a 40mm mesh mist net extending from 1m to 3m above the ground from 7am to 7pm); results can only be compared if the same sampling methods are used.

Many of the "rare" species we encounter are actually "edge species", on the boundary of our defined groups of organisms, eg. birds that rarely enter Bako NP, or are rarely captured in mist nets. Deciding which to include can be difficult, as it implies knowing what species ought to be present!

Species accumulation curves

Once the population is defined, we want to know the number of species within it. Sampling is unlikely to capture all the species present, so the number of species observed, Sobs, will usually be too low. We can plot some graphs which will give us an idea of how much too low.

You could draw a simple "collector's curve" by plotting the number of species vs the number of individuals as you collected more samples: the black line in the graph on the left is such a curve for bats trapped in Loagan Bunut NP in Malaysia.

Provided the samples are independent, the order is not important; our collector's curve would look different if the order of samples was different. The smooth red curve was drawn by shuffling the order of the samples 100 times and averaging the curves obtained. We used computer software to do the shuffling, and also to calculate the 95% confidence intervals indicated by the dashed lines.

You can use EstimateS to calculate smoothed species accumulation curves and estimates of species richness (see below). A Lab Guide to a worked example using the Loagan Bunut bat data is here, and you can download the data file here.

The species accumulation curve starts climbing rapidly, then flattens out. If we collected enough samples so that we had picked up all the species present, it would level off. For the Loagan Bunut bats we still have a way to go. Estimating where it would level off is an extrapolation problem, and those are always difficult.

 Estimating richness

A number of methods have been suggested to estimate the true species richness:

  1. Jackknife and bootstrap methods examine the sampling process, asking, "How many species which we know are there would we have missed if we had taken fewer samples, or different samples?" This is then used to estimate the number missing from the actual set of samples.
     
  2. Anne Chao's methods consider the number of species which are so rare that they only occur once or twice in the sample, and try to estimate how many are even rarer, so didn't turn up in the sample at all. ACE and ICE use the same approach but using species which occur 1-10 times in the sample.
     
  3. Some people try to fit a mathematical equation to the species accumulation curve, and use this to predict the number of species where the curve levels off. A favourite is the Michaelis-Menton equation, which describes an enzyme-catalyzed chemical reaction: it's not clear what that might correspond to when sampling animals or plants!
     
  4. Rennolls and Laumonier (2006), working with tree species in Sumatra, assume the rare species in their plots are a random sample of equally rare species in the whole forest, and try to estimate how many 'shadow' species there are.
     
  5. Recently Royle and Dorazio (2008) have developed a method which allows detection probabilities to vary among species and they use Bayesian software to compute the estimated total of species.

Various methods in Groups 1-3 are implemented in EstimateS and the Lab Guide explains how to use this with data for bat trapping in Loagan Bunut. More details for these estimators are in the documentation for EstimateS, and I won't repeat them here. Many authors have recommendations for the choice of estimator, and a survey is here.

Species accumulation gives us a way to judge how well the various species richness estimators work for our data. The ideal estimator would be a straight horizontal line (green in the graph on the right) at the level where the species accumulation curve (black) will level off. In practice, we might expect a good estimator to give crazy answers when we have only a few samples, but to settle down to a steady value as we get more samples: the blue line in the graph would be fine. The red curve, which gradually goes up and up as we get more samples - and record more species - is not much help, but estimators often give a result like this.

Comparing species richness between sites

Very often the important question is not "How many species…?" but "Are there more species at… than…", either comparing two sites or the same site at two points in time. Here we can make inferences based on interpolation, which is much safer than extrapolation.

Some harp-trapping has been done in peat swamp forests at Maludam National Park, about 450 km from Loagan Bunut. Unlike Loagan Bunut, Maludam was used for timber production before being established as a national park.

In Maludam, 81 bats from 11 species were caught. In Loagan Bunut NP we caught 174 bats from 22 species. Does that mean that Loagan Bunut has more species of bats which can be trapped in harp-traps?

We can’t compare the species totals – 11 vs 22 – directly, because the number of bats caught in Maludam is less than half the number caught in Loagan Bunut. If we had carried on trapping at Maludam, we would almost certainly have found more species but, as we’ve seen, it’s difficult to estimate how many more. However, we can estimate how many species we would expect to find at Loagan Bunut if we only trapped 81 individuals. We use a subset of the data we collected at Loagan Bunut, with a process known as “rarefaction”. This involves taking samples at random from the Loagan Bunut set until we have approximately 81 individuals, and noting how many species we have found. We do this many times, and average the number of species.

In fact, the smoothing process for the species accumulation curve also uses random selections from the samples, and ‘Randomization Curve’ is an alternative name for the smoothed species accumulation curve. The graph on the left shows the species accumulation curves for Loagan Bunut (black) and Maludam NP (red). If we only collected 82 bats at Loagan Bunut we'd expect to record 17 species. So Loagan Bunut really is richer than Maludam in harp-trappable bats (though if we look at the confidence intervals, we see that they overlap: the difference could have arisen by chance, it is not statistically significant).

To summarize...

Species richness is intuitively the "right" way to estimate species diversity.

However, it is almost impossible to measure, and estimates based on extrapolation are often unreliable unless we have huge samples.

Although we may not be able to estimate the true species richness of our sites, we can use interpolation (rarefaction) to compare richness at two or more sites.

 


Making sense of biodiversity indices

Biodiversity indices purport to be a combined measure of species richness and species evenness. As we've seen, richness is often impossible to estimate, and evenness is difficult to define (more on that later). A large number of formulae have been proposed - see a partial list here.

An R script to try out a range of diversity indices, together with the functions and data sets you need, is here.

At one point I concluded that indices were just a mathematical carpet under which the problems with estimation of richness and evenness had been swept. However, Mark Hill has ideas which make sense of some of the indices.

Hill's diversity numbers

Hill (1973) noted that rare species - the ones that are so difficult to count - are ecologically less important than the common ones. (If that observation doesn't apply to the situation you have in mind, you need to (a) redefine the population so that it does apply, or (b) abandon the idea of counting species.)

Hill's 'diversity numbers' can be thought of as the number of common species in the population. But which count as 'common' and which are 'rare'?

Suppose we were brutal, and decided to look only at the most common species. The Berger-Parker index does that: it's just the proportion of the commonest species. A patch of coniferous forest in Canada may be 90% white spruce, and the BP index would be 0.9. Hill's Ninf = 1/BP = 1/0.9 = 1.1. This forest is almost a monoculture, and the value just over 1 reflects this. It doesn't matter much what the remaining 10% are, nor how many rare species it might contain.

Would that work in our Bornean forests? An inventory of a 1ha plot in Temburong (Brunei) identified 1011 trees; the most common, Fordia splendissima, accounted for just 6%. Hill's Ninf = 16.7, which is rather low, considering that 276 species were actually recorded! Looking only at the most common species, ignoring the rest, is a bit too extreme.

Let's try Hill's N2. This involves the proportional abundances of each species (pi), ie. for each species (i), we take the number of individuals of that species in the sample (ni) and divide by the total number of individuals in the sample (N = Σni). We square the pi's, add them up, and take the reciprocal:

Squaring the pi's means that the common species have greater weight than rare ones: a species with 50% in the sample has pi2 = 0.25, but for a species with only 1% it's 0.0001.

Hill's N2 for the Temburong forest works out at 75.1. If we look at the data, we'll see that 75 species account for 68% of the trees in the plot, the other 201 species only 32%. That looks like a reasonable split between 'common' and 'rare' species.

A whole range of Hill's diversity numbers exists: the general rule is:

So to calculate N3 we cube all the  pi's, add them up, then take the square root before inverting. The main ones of interest are:

The numbers go down steadily as a increases: N0 is the biggest, Ninf  the smallest.

To sum up:

  • Hill's numbers are related to well-known indices,
  • Rare species have decreasing weight from N0 to Ninf
  • Missing out rare species has less effect on the index and reflects the relative ecological importance of common species,
  • N2 or 1/Simpson's index seems a reasonable compromise.

Evenness revisited

In principle, diversity is a combination of richness and evenness. In practice however, evenness measures are defined in terms of diversity. If I is an index of diversity, the corresponding evenness index, E, is defined as:

E = I / Imax

where Imax is the value I would take if the abundances in the sample were all equal. Unfortunately, Imax is usually highly sensitive to the number of species in the sample. We have managed to devise a diversity index which is not overly sensitive to the number of rare species captured in our sample, but the trade-off is an evenness index which is more sensitive to missing rare species.

Hill points out that the ratio of any pair of Hill diversity numbers can be used as an evenness measure, but he prefers to avoid N0 because it reflects the sampling event rather than the actual value for the community. He suggests using N2 / N1 both of which are reasonably well insulated from the effects of sampling.

Hill's 1973 paper is well worth reading: 'Diversity and evenness: a unifying notation and its consequences'. Ecology 54:427-431


Bias and confidence limits for indices

Although the better diversity indices are less sensitive to rare species missed from the sample, calculations based on the sample tend to underestimate the true value for the community: estimates are biased low. (Simpson's index is an exception to this, as it has a built-in small-sample correction.) In addition, most indices lack a way to estimate the precision of the result.

The concept of bootstrapping

The idea of the bootstrap is shown in the diagram on the right. In the real world, the true diversity index for the population is unknown. We take a sample and estimate the index based on the sample.

In the bootstrap world we set up a model population based on what we know of the real population. We take samples from this model population and calculate values for the index from the bootstrap samples.

The bootstrap world gives us two major advantages:

  1. We know exactly what the true value of the index is for the model population, because we created it. We can compare the values for the samples with the true value and estimate the bias introduced by the sampling process.
  2. We can repeat the sampling process many hundreds, even thousands, of times. Each sample will be different, and the scatter we find enables us to estimate the standard error or confidence intervals of the result.

We can then use the results from the bootstrap world to calculate an unbiased estimate and confidence interval for the real world population.

There are plenty of sources of information on bootstrapping in general: I would recommend Efron and Tibshirani's (1993) book An introduction to the bootstrap, which meshes with the 'bootstrap' package in R. Many other software packages also include a bootstrap option. So here are just a few points specific to bootstrapping diversity indices.

Bootstrapping diversity indices

The bootstrap population is modeled on the sample we have, in that the proportions of the various species are the same. (Ie. this is a nonparametric bootstrap method, as trying to fit an equation to the population would be difficult.) But the bootstrap population is large - much larger than the sample: we draw samples with replacement, so the population never 'runs out'. It behaves as an infinite population. As a result, when we come to calculate the true index for the bootstrap population we should not use a small-sample correction. If you bootstrap Simpson's index: you should not use the n*(n-1) form for the true value, but do use it for the estimates from the bootstrap samples.

The bootstrap must be set up so that we are sampling individuals not species. If the real world sample is, say, 170 birds from 20 different species, we might summarize the data as 20 numbers, the abundances for each of the species. But picking 20 values at random from those 20 numbers (with replacement) is not what we want. Instead, we need to represent the population as a string of 170 numbers, each number representing one bird, with the value indicating which species it belongs to. Then we take 170 individuals from that population and see how many we have of each species.

If that all sounds a bit convoluted, you'll probably find it becomes clear if you look as an example. The bundle of R scripts and data sets here includes examples of running bootstraps with both Hill's N2 and Simpson's index.


wcsmalaysia.org home
Text by Mike Meredith, updated 4 Jan 2009