Objectives
We will draw samples from a simulated
population of "squirrels" to review the concepts such as
random variable and expected value. We address questions
such as:
-
What's the difference between the true
and estimated values?
-
How do we quantify the precision of our
estimate?
-
What's the difference between standard
deviation and standard error?
-
Why are bigger samples better than small
samples?
-
Why do we divide by (n - 1) when
we calculate SD, although we use n to calculate
the mean?
Drawing a random sample
We
use a simulated population of 100 "squirrels", which means
that we know the values of the entire population. The
weights of the squirrels are on plastic chips in a bag, from
which each participant draws a random sample of four
chips and notes the values. All the results from
participants are then used to investigate the properties of
the sample.
If you are working on your own, you can create a large
number of samples in R and plot histograms very easily:
download the R script here; for hints on using scripts in R,
go here. Alternatively,
it is possible but tedious to do it in MS Excel®:
download a spreadsheet with the data
here.
Sample means and the population mean
Everyone calculated the mean of their sample of four
"squirrels", denoted by
("x bar").
My four numbers were
1100, 1147, 940 and 1238; the mean
= 1106.25
The values of
varied, and all differed from the true mean,
μ ("mu"), which we know in this
case.
The
difference between the sample mean and the true value for
the population is the error; this is not the same as
a mistake in sampling, but is directly due to the
scatter in the values in the population sampled.
Nevertheless,
the values of
clustered around μ, as shown in
the histogram (generated in R), where the red line shows the
true mean, μ, and the
dashed blue line is the mean of 500 sample means for sample
size n = 4.
If we took a huge number of samples of n = 4,
calculated the mean of each, then took the mean of the
means, the result would be μ.
This is the
expected value
of
,
written as E( )
= μ
Because of this relationship,
we can use
as an estimate of μ,
designated as
("mu hat") to distinguish it from the true (but usually
unknown) value of μ. The
histogram shows the distribution of the sample means, known
in statistical parlance as the
sampling distribution
of the mean.
We then got participants to
pair up and do the same with their combined sample of n
= 8. This produced a similar histogram, but values were
clustered closer to the true population mean.
The tightness of the clustering
- the width of the sampling distribution - will tell us the
precision of our estimate,
=
,
but first we need to look at measures of scatter.
Sample variance and standard deviation
Using the samples of n = 4, we tried
different measures of scatter:
-
deviation for each observation in
the sample is the difference from the mean,
x -
;
these are difficult to summarize as the mean is zero.
My values were -6.25, 40.75, -166.25 and 131.75
-
absolute deviation is the same,
but ignoring the sign and treating all as positive; this
gave us the mean absolute deviation, often just called
the mean deviation.
My value was 86.25.
-
a common way of dealing with these
problems in statistics is to square the values, average
them, then take the square root: we tried the
root-mean-square (RMS) deviation.
My value for this
was 108.05.
We
calculated the RMS deviation for the population and for our
samples, and found that the sample values were lower than
the value for the population (145.74), as shown in the histogram on
the right, where the true value is shown by the red line and
the mean of the sample values by the blue line. RMS
deviation for the sample is not a good estimator for the
population value: it is
biased
low.
The values are too small because they're
calculated from
,
and
itself is calculated to be bang in the middle of the sample
values! So we tried doing the calculation with
x - μ
instead of x
-
;
the results were consistently higher (the sample
x's
are closer to
than to μ) and very close to
the true value for the population.
For this I got
150.49, close to the true value.
But we rarely know the true
mean, μ; how do we calculate scatter without it? The key is
the concept of degrees of freedom. Although we have
four values for x, the
deviations really only contain three bits of information, as
they always add up to zero; the missing 'bit of information'
has been hijacked by
.
So it would be more honest to divide the squared deviations
by 3 instead of 4 when calculating the degree of scatter.
The standard calculation for the
standard deviation
of a sample, s, is

In general, if you calculate a value from n
observations and k parameters which have already been
calculated from the same n observations, the degrees
of freedom = n - k. (Note that the RMS value,
dividing by n not n - 1, is correct for
population standard deviation,
σ ("sigma").)
-
standard deviations (SD) of the
samples using n -1 degrees of freedom instead of
n cluster around the true value for the
population, σ. So s
is an (almost) unbiased estimator of σ,
= s. (In fact, s is still slightly biased
low, but not enough to cause problems.)
My
result was 124.76, not as good as the value using μ, but
better than when dividing by n.
-
variance is the square of the
standard deviation, symbolized by s2
(for samples) or σ2
(for population parameters).
My result:
15,566.
Variance and standard
error of the sample mean
Now let's go back to the histogram of the
sample means and the problem of measuring their scatter. If
we had a large number of samples, we could calculate the SD
in the same way as we did for the observations within a
sample. But we usually have only one sample. Statistical
theory comes to the rescue in most cases, as it can be shown
that the SD of the sampling distribution of the mean can be
derived from the SD of the population sampled (in practice,
s) and the sample size:

My sample gave the value of 62.38.
"In most cases" includes cases where the variable follows a
normal (bell-shaped) curve, or sample size is large, or
intermediate cases with an approximately normal curve and
medium-sized samples. See the
demonstration in R of
the Central Limit Theorem to get a feel for this.
The SD of a sample distribution gets a special name:
standard error
(SE). If you repeated the sampling
and calculations many times, the true value of the mean
would lie between
- SE and
+ SE about 2/3 of the time; for small samples, that goes
down, and for n = 4 it's only 60%.
I would state
my estimate of the squirrel mean as 1106.25 ± 62.38.
Finally, the coefficient of variation (CV) of an
estimate is the ratio between the SE and the actual
estimate, often expressed as a percentage.
For my sample
this came to 62.38/1106.25 = 0.056 or 5.6%, so I could give
the result as 1106.25 ± 5.6%.
The CV is useful for
comparing uncertainty in estimates: for example, when adding
estimates of the number of animals in different zones of a
national park.
What we have covered
- The sample mean is a good estimator of
the mean of the population.
- The sample variance and standard deviation
(when calculated with n - 1 degrees of freedom)
are good estimators of population parameters.
- The sample means themselves come from a distribution
- the sampling distribution of the mean - with
its own mean and standard deviation (SD).
- The SD of the sampling distribution indicates the
precision of the estimated parameter, and is given the
special name 'standard error' (SE).
... plus the following terminology and widely-used symbols:
- the terms mean, error, deviation, variance, standard deviation (SD), standard error
(SE), coefficient of variation (CV), and expected value (E(...));
- the use of Greek letters (μ,
σ) for true population parameters (which we don’t
know in reality) and the “hat” notation (
,
) for estimates of the parameters
based on samples;
- the “bar” notation for sample means (eg.
), and the use of
n (sample size), s
(sample SD) and (standard error of the sample mean).
What next?
Confidence intervals are the best way to summarize
uncertainty about an estimate: to calculate these for small
samples (n < 20) you need to use
Student's
t-distribution. |