############################################################################## # A Note on using R scripts: Start R, select "File > Open script..." from # # the pull-down menus, and browse to this file. You can now place the cursor # # in a line and press Ctrl-R to run the line. (Or select "Edit > Run line # # or selection" from the pull-down menus, or right-click and select "Run # # line or selection" from the context menu.) # ############################################################################## # EXPLORING THE DIFFERENCE BETWEEN TWO SETS OF DATA # This script uses randomization tests in R to see if the difference between # two sets of data are likely to be due to chance or if they reflect real # differences. # See Manly, B F J 1997. Randomization, bootstrap and Monte Carlo methods # in biology. Chapman and Hall / CRC pages 4-8 for details. # THE DATA # The data are the lengths in mm of the jaws of 10 male and 10 female golden # jackals from specimens in the Natural History Museum: jaw.m <- c(120, 107, 110, 116, 114, 111, 113, 117, 114, 112) jaw.f <- c(110, 111, 107, 108, 110, 105, 107, 106, 111, 111) # Use a box plot to compare the two: boxplot(jaw.m, jaw.f, names=c("male","female")) # It looks like the male jaws are longer. Check the means, and store the # difference in the object 'jaw.diff': mean(jaw.m) mean(jaw.f) (jaw.diff <- mean(jaw.m) - mean(jaw.f)) # The mean of 'jaw.m' is 4.8 mm bigger. We want to test the null hypothesis # that the difference between 'jaws.m' and 'jaws.f' is really only due to # chance. # USING A T-TEST # We can easily do a t-test in R: t.test(jaw.m, jaw.f) # It works fine and gives a P-value of 0.00336. The problem is that the t-test # is based on the assumption that the data are drawn at random from a normally- # distributed population and we can’t assume that for our museum specimens. # A RANDOMIZATION TEST # If the difference between 'jaws.m' and 'jaws.f' is really only due to # chance, we could shuffle the 20 jaws, pull out 10 at random, compare these # 10 and the 10 left behind, and get a similar difference in means as for # males vs females. What's the probability of getting a value > 4.8mm just # by chance? # Combine the vectors 'jaw.m' and 'jaw.f' into a new vector 'jaws': jaws <- c(jaw.m, jaw.f) # We will use the function 'sample' to generate 10 numbers at random # between 1 and 20 – try it out first: sample(20,10) # Run that line several times and you’ll get a new set of 10 random numbers # each time you do it. # Assign a set of 10 random numbers to the variable 'samp': samp <- sample(20,10) # We use 'samp' to pull out 10 elements from the 'jaws' vector: jaws[samp] # the 10 elements indexed by the 10 random numbers in 'samp', jaws[-samp] # the other 10 elements, those not indexed by 'samp'. # The difference in means between the 10 measurements indexed by samp and # the other 10 is: mean(jaws[samp]) - mean(jaws[-samp]) # Now we’re going to do that 999 times and see how many times we get a # difference in means which is bigger than or equal to 4.8 mm. # If the null hypothesis is true, the males vs females split is random too, so # we include this in the set as number 1 (ie. as 'jaw.diff[1]'). for(index in 2:1000) { samp <- sample(20,10) jaw.diff[index] <- mean(jaws[samp]) - mean(jaws[-samp]) } ############################################################################## # An explanation of loops: the command "for(index in 2:1000) {...}" tells R # # to start with index = 2 and then do everything between the {curly braces}; # # so a sample is drawn and 'jaw.diff[2]' gets the difference in means. Then # # R sets index = 3 and does {...} again; a new sample is drawn and # # 'jaw.diff[3]' gets the difference in means of this one. This continues for # # index = 4, 5, ... up 1000. # ############################################################################## # Draw a histogram with these values: hist(jaw.diff) # The exact shape depends on the actual random numbers drawn, but you should # get a roughly bell-shaped histogram centered around zero and with most # values between -2 and +2. Draw a vertical blue line at +4.8 : abline(v=4.8, col='blue') # Now see how many of the random values are > or = the observed difference: sum(jaw.diff >= 4.8) # The "jaw.diff >= 4.8" bit checks each element of 'jaw.diff', producing TRUE # if it is >= 4.8 and FALSE otherwise. The "sum()" bit adds it up, treating # TRUE as 1 and FALSE as 0, telling you how many TRUEs there are altogether. # The answer will vary somewhat each time you run this. I got 3, ie. only # 3 times out of 1000 was the difference in means as big as the actual value # for 'jaw.m' and 'jaw.f'. The probability that we’d get a difference of # 4.8mm just by dividing the 20 jaws into two piles at random (ie. the # p-value) is: sum(jaw.diff >= 4.8) / 1000 # CONCLUSION # There is strong support for the idea that the male jaws in the museum # collection are longer than the female jaws, and this is not just chance. # But we can't simply infer that in general male jackals have longer jaws # than females, since the specimens measured are not a proper random sample # of a larger population of jackals and may not be representative. # Mike Meredith, updated 18 Sep 2007