Sampling squirrels
   stats main page  

Objectives

We will draw samples from a simulated population of "squirrels" to review the concepts such as random variable and expected value. We address questions such as:

  • What's the difference between the true and estimated values?
  • How do we quantify the precision of our estimate?
  • What's the difference between standard deviation and standard error?
  • Why are bigger samples better than small samples?
  • Why do we divide by (n - 1) when we calculate SD, although we use n to calculate the mean?

Drawing a random sample

The "squirrels" are plastic chips with values written on them.We use a simulated population of 100 "squirrels", which means that we know the values of the entire population. The weights of the squirrels are on plastic chips in a bag, from which each participant draws a random sample of four chips and notes the values. All the results from participants are then used to investigate the properties of the sample.

If you are working on your own, you can create a large number of samples in R and plot histograms very easily: download the R script here; for hints on using scripts in R, go here. Alternatively, it is possible but tedious to do it in MS Excel®: download a spreadsheet with the data here.

Sample means and the population mean

Everyone calculated the mean of their sample of four "squirrels", denoted by ("x bar").

My four numbers were 1100, 1147, 940 and 1238; the mean = 1106.25

The values of varied, and all differed from the true mean, μ ("mu"), which we know in this case. The difference between the sample mean and the true value for the population is the error; this is not the same as a mistake in sampling, but is directly due to the scatter in the values in the population sampled.

 Nevertheless, the values of clustered around μ, as shown in the histogram (generated in R), where the red line shows the true mean, μ, and the dashed blue line is the mean of 500 sample means for sample size n = 4.

If we took a huge number of samples of n = 4, calculated the mean of each, then took the mean of the means, the result would be μ. This is the expected value of , written as

E() = μ

Because of this relationship, we can use as an estimate of μ, designated as ("mu hat") to distinguish it from the true (but usually unknown) value of μ. The histogram shows the distribution of the sample means, known in statistical parlance as the sampling distribution of the mean.

We then got participants to pair up and do the same with their combined sample of n = 8. This produced a similar histogram, but values were clustered closer to the true population mean.

The tightness of the clustering - the width of the sampling distribution - will tell us the precision of our estimate, = , but first we need to look at measures of scatter.

Sample variance and standard deviation

Using the samples of n = 4, we tried different measures of scatter:

  • deviation for each observation in the sample is the difference from the mean, x - ; these are difficult to summarize as the mean is zero.

My values were -6.25, 40.75, -166.25 and 131.75

  • absolute deviation is the same, but ignoring the sign and treating all as positive; this gave us the mean absolute deviation, often just called the mean deviation.

 My value was 86.25.

  • a common way of dealing with these problems in statistics is to square the values, average them, then take the square root: we tried the root-mean-square (RMS) deviation.

My value for this was 108.05.

We calculated the RMS deviation for the population and for our samples, and found that the sample values were lower than the value for the population (145.74), as shown in the histogram on the right, where the true value is shown by the red line and the mean of the sample values by the blue line. RMS deviation for the sample is not a good estimator for the population value: it is biased low.

The values are too small because they're calculated from , and itself is calculated to be bang in the middle of the sample values! So we tried doing the calculation with x  - μ instead of x  - ; the results were consistently higher (the sample x's are closer to than to μ) and very close to the true value for the population.

For this I got 150.49, close to the true value.

But we rarely know the true mean, μ; how do we calculate scatter without it? The key is the concept of degrees of freedom. Although we have four values for x, the deviations really only contain three bits of information, as they always add up to zero; the missing 'bit of information' has been hijacked by . So it would be more honest to divide the squared deviations by 3 instead of 4 when calculating the degree of scatter. The standard calculation for the standard deviation of a sample, s, is

In general, if you calculate a value from n observations and k parameters which have already been calculated from the same n observations, the degrees of freedom = n - k. (Note that the RMS value, dividing by n not n - 1, is correct for population standard deviation, σ ("sigma").)

  • standard deviations (SD) of the samples using n -1 degrees of freedom instead of n cluster around the true value for the population, σ. So s is an (almost) unbiased estimator of σ,  = s. (In fact, s is still slightly biased low, but not enough to cause problems.)

My result was 124.76, not as good as the value using μ, but better than when dividing by n.

  • variance is the square of the standard deviation, symbolized by s2 (for samples) or σ2 (for population parameters).

My result: 15,566.

Variance and standard error of the sample mean

Now let's go back to the histogram of the sample means and the problem of measuring their scatter. If we had a large number of samples, we could calculate the SD in the same way as we did for the observations within a sample. But we usually have only one sample. Statistical theory comes to the rescue in most cases, as it can be shown that the SD of the sampling distribution of the mean can be derived from the SD of the population sampled (in practice, s) and the sample size:

My sample gave the value of 62.38.

"In most cases" includes cases where the variable follows a normal (bell-shaped) curve, or sample size is large, or intermediate cases with an approximately normal curve and medium-sized samples. See the demonstration in R of the Central Limit Theorem to get a feel for this.

The SD of a sample distribution gets a special name: standard error (SE). If you repeated the sampling and calculations many times, the true value of the mean would lie between - SE and + SE about 2/3 of the time; for small samples, that goes down, and for n = 4 it's only 60%.

I would state my estimate of the squirrel mean as 1106.25 ± 62.38.

Finally, the coefficient of variation (CV) of an estimate is the ratio between the SE and the actual estimate, often expressed as a percentage.

For my sample this came to 62.38/1106.25 = 0.056 or 5.6%, so I could give the result as 1106.25 ± 5.6%.

The CV is useful for comparing uncertainty in estimates: for example, when adding estimates of the number of animals in different zones of a national park.


What we have covered

  • The sample mean is a good estimator of the mean of the population.
     
  • The sample variance and standard deviation (when calculated with n - 1 degrees of freedom) are good estimators of population parameters.
     
  • The sample means themselves come from a distribution - the sampling distribution of the mean - with its own mean and standard deviation (SD).
     
  • The SD of the sampling distribution indicates the precision of the estimated parameter, and is given the special name 'standard error' (SE).

... plus the following terminology and widely-used symbols:

  • the terms mean, error, deviation, variance, standard deviation (SD), standard error (SE), coefficient of variation (CV), and expected value (E(...));
     
  • the use of Greek letters (μ, σ) for true population parameters (which we don’t know in reality) and the “hat” notation (, ) for estimates of the parameters based on samples;
     
  • the “bar” notation for sample means (eg. ), and the use of n (sample size), s (sample SD) and (standard error of the sample mean).

What next?

Confidence intervals are the best way to summarize uncertainty about an estimate: to calculate these for small samples (n < 20) you need to use Student's t-distribution.

wcsmalaysia.org home

Page updated 10 May 2009 by Mike Meredith