Using
spreadsheet packages
   stats main page

 

Spreadsheets are good tools for entering and tabulating raw data. However, you do need to be careful:

They can also be useful to manipulate data - calculating totals or counts, regrouping data with pivot tables, and for preliminary data exploration - scatterplots, simple graphs, means and SDs. Usually, you will need to export your data to a purpose-built statistical package for final analysis, because spreadsheet packages offer only a limited range of statistical options, and some of those are flawed.

In some of these workshops, we have used MS Excel® for more sophisticated analysis, including likelihood calculations and model fitting. The results for these examples have been checked against other software and are sufficiently accurate. However, we do not recommend spreadsheets for serious statistical analysis of wildlife data.

The bottom line is that statistical analysis results from spreadsheet programs should not be trusted until they have been corroborated by a purpose-built statistical package.


Date formats in spreadsheets

The following background applies to MS Excel® and Calc (OpenOffice.org), and probably to other spreadsheet packages too, but I haven't experimented with all of them. The two rules below apply to all packages.

Dates are stored internally in Excel and Calc as integer numbers. The default on Windows platforms has Day 61 = 1 March 1900, so 30 April 2007 was Day 39202.

(In Excel, Day 1 = 1 Jan 1900 and Day 60 corresponds to non-existent 29 Feb 1900: non-existent because 1900 was not a leap year. In Calc, Day 1 = 31 Dec 1899 and Day 60 = 28 Feb 1900. To add to the confusion, Mac versions of Excel have Day 1 = 2 Jan 1904.)

What is displayed - and how the program interprets what you type in - depends on the format setting of the cell.

If you don't select a specific format, the default is used: in Excel that's 'General' and in Calc, 'Number > General'. In either case, the spreadsheet package will try to guess whether you mean a number, or a date, or text.

  • If you type '1/5', it is interpreted as a date, either 1 May or January 5th in the current year depending on the 'Locale' setting (in Excel) or the 'Language' setting (in Calc). It will be stored as a Day number, 39203 or 39087.
     
  • '30/4' or '4/30' will be interpreted either as 30 April 2007 (Day 39202) or as text, or possibly even as 1 April or 31 March 1930!

In addition, if the display format is  '--/--/----', you may end up with a mixture of text and dates, and even a mixture of dd/mm/yyyy and mm/dd/yyyy formats in the same column.

You may not see the difference, but you will get unexpected results if you:

  • sort by date (1/5/2007 before 30/4/2007 ?);
     
  • calculate the difference between dates, eg. the time (in days) between two trapping sessions (116 days between 30/4/2007 and 1/5/2007 ?);
     
  • pass the file to someone whose locale is set differently (eg. yours is USA, theirs is UK); the entries stored as dates will be switched around (1/5/2007 becoming 5/1/2007), but those stored as text will not. Again the upshot is a mixture of dd/mm/yyyy and mm/dd/yyyy formats in the same column.

So ...

Two rules when entering dates in spreadsheets:

  1. Before you start entering data, always format all the cells in the date column as 'Date' with an unambiguous display format such as '1 May 2007' (or 'May 1, 2007' if you wish) or '2007-05-01'.

    (In Excel: Click on the letter at the top of the column to select the whole column, then go to Format > Cell... and the Number tab, then select 'Date' and your chosen format. Calc is similar. This will not prevent you putting in text as column headings, etc.)
     
  2. Type in dates in a short format, not the same format they are displayed: eg. type '1/5' or '1/5/07', not '1 May 2007', then check that the program has interpreted it correctly. This will also catch mistakes such as typing '31/4'.

Exporting data to .TXT or .CSV files

Numbers

Spreadsheets allow you to adjust the number of decimal places displayed in cells. For example, typing '3.14159265358979' in Excel results in '3.141592654' appearing, and in Calc, '3.14'; the final digits are not displayed. They are there, however, and appear if you increase the number of decimal places shown (go to Format > Cell..., the Numbers tab, select 'Numbers' and then Decimal Places:).

If you save the file in .XLS or .ODS format, the full precision will be retained, whatever the number of decimal places displayed. However...

When you save in .CSV or .TXT format, the full precision number is not saved, and the number in the file will be as displayed in Excel or Calc.

In general, you will want to export data with the maximum number of decimal places; eg. if you weighed an animal to the nearest gram and recorded 4.356 kg, you need to export 4.356 to your statistics package for analysis, not 4.36 or 4.4 (and definitely not 4.35 or 4.3). So...

Ensure that numbers are displayed with all decimal places before saving as .CSV or .TXT format.

Dates

Dates are also saved in .TXT or .CSV format as displayed, not in the spreadsheet package's internal 'Day number' format (see above for details).

Avoid the --/--/---- format, as nothing will indicate whether this is dd/mm/yyyy or mm/dd/yyyy - or even a mixture in the same column! The yyyy-mm-dd format (eg. 2007-04-30) is unambiguous and can be interpreted by most software packages.

Text

Your spreadsheet package may allow you to put long passages of text into cells. But many statistical packages will have trouble reading files with fields more than 255 characters in length. So keep text items short.

Avoid mixing text and numbers in the same column, as some packages will then treat the whole column as text or a categorical variable; so don't write "escaped" in the weights column if a trapped animal escaped before being weighed. Leave the cell in the 'Weights' column blank and put the note in a separate 'Comments' column. Similarly, don't use '>50' or '5?' or '25cm' in columns of numbers.

Keep column headings short. Relevant metadata can go in a separate row at the top of the file. Eg. head the column "SV length" and put "SV length = snout-vent length measured in cm on sedated animal with Acme SnakeOmeter Mk II" in a row above the table. Only use the cells in the first column for this metadata. If your statistical package can't cope with initial rows of non-data, you can later 'cut' them in a text editor such as Notepad and 'paste' into a separate file.


Statistical calculations

Precision

Remember that spreadsheets are not designed for the analysis of wildlife data. In business applications, calculation to 2 or 3 decimal places is quite adequate, and if you're working with millions of dollars you probably don't care about the cents.

So be wary if you are working with very large or very small numbers, or if you are interested in small differences between large numbers. This often happens in intermediate calculations, rather than actual data, eg. you calculate the area of the USA in square meters, or the proportion of the Sundaland population of geometrid moths included in your sample, or the area of your park including vs excluding water bodies.

Statistical functions and add-ins

Most statistical functions in spreadsheets give correct and adequately precise answers for most data sets. Some data sets, however, cause problems, the most common of which are:

  • Loss of precision in intermediate results. This affects in particular the calculation of variance (and hence SD, SE, confidence intervals and significance) in Excel.
     
  • Poor precision in modeling the tails of distributions, affecting significance tests, (and sometimes even the use of inappropriate distributions, eg. the normal distribution when a t-distribution would be appropriate).
     
  • Inconsistent (and often inappropriate) treatment of missing values in data sets; eg. a blank cell or '-' may be ignored (eg. by AVERAGE() in Excel), be equated to zero (eg. for arithmetical operations) or may generate an error. For more complex operations, eg. a t-test, it may not be clear what is happening.

Spreadsheet functions will often produce an erroneous result in such cases, when a statistical package would produce an error message or at least a warning.

Third-party add-ins are often better that the functions included in spreadsheet packages.

You can download a spreadsheet with examples here and check the performance of your own spreadsheet package and add-ins. More details of these problems are in papers by Pottel, Simonof, and Cryer.

wcsmalaysia.org home

Text by Mike Meredith, updated 17 May 2007