![]() ![]() ![]() |
Using spreadsheet packages |
|
| |
|
|
|
Spreadsheets are good tools for entering and tabulating raw data. However, you do need to be careful:
They can also be useful to manipulate data - calculating totals or counts, regrouping data with pivot tables, and for preliminary data exploration - scatterplots, simple graphs, means and SDs. Usually, you will need to export your data to a purpose-built statistical package for final analysis, because spreadsheet packages offer only a limited range of statistical options, and some of those are flawed. In some of these workshops, we have used MS Excel® for more sophisticated analysis, including likelihood calculations and model fitting. The results for these examples have been checked against other software and are sufficiently accurate. However, we do not recommend spreadsheets for serious statistical analysis of wildlife data. The bottom line is that statistical analysis results from spreadsheet programs should not be trusted until they have been corroborated by a purpose-built statistical package. Date formats in spreadsheetsThe following background applies to MS Excel® and Calc (OpenOffice.org), and probably to other spreadsheet packages too, but I haven't experimented with all of them. The two rules below apply to all packages. Dates are stored internally in Excel and Calc as integer numbers. The default on Windows platforms has Day 61 = 1 March 1900, so 30 April 2007 was Day 39202. (In Excel, Day 1 = 1 Jan 1900 and Day 60 corresponds to non-existent 29 Feb 1900: non-existent because 1900 was not a leap year. In Calc, Day 1 = 31 Dec 1899 and Day 60 = 28 Feb 1900. To add to the confusion, Mac versions of Excel have Day 1 = 2 Jan 1904.) What is displayed - and how the program interprets what you type in - depends on the format setting of the cell. If you don't select a specific format, the default is used: in Excel that's 'General' and in Calc, 'Number > General'. In either case, the spreadsheet package will try to guess whether you mean a number, or a date, or text.
In addition, if the display format is '--/--/----', you may end up with a mixture of text and dates, and even a mixture of dd/mm/yyyy and mm/dd/yyyy formats in the same column. You may not see the difference, but you will get unexpected results if you:
So ... Two rules when entering dates in spreadsheets:
Exporting data to .TXT or .CSV filesNumbersSpreadsheets allow you to adjust the number of decimal places displayed in cells. For example, typing '3.14159265358979' in Excel results in '3.141592654' appearing, and in Calc, '3.14'; the final digits are not displayed. They are there, however, and appear if you increase the number of decimal places shown (go to Format > Cell..., the Numbers tab, select 'Numbers' and then Decimal Places:). If you save the file in .XLS or .ODS format, the full precision will be retained, whatever the number of decimal places displayed. However... When you save in .CSV or .TXT format, the full precision number is not saved, and the number in the file will be as displayed in Excel or Calc. In general, you will want to export data with the maximum number of decimal places; eg. if you weighed an animal to the nearest gram and recorded 4.356 kg, you need to export 4.356 to your statistics package for analysis, not 4.36 or 4.4 (and definitely not 4.35 or 4.3). So... Ensure that numbers are displayed with all decimal places before saving as .CSV or .TXT format. DatesDates are also saved in .TXT or .CSV format as displayed, not in the spreadsheet package's internal 'Day number' format (see above for details). Avoid the --/--/---- format, as nothing will indicate whether this is dd/mm/yyyy or mm/dd/yyyy - or even a mixture in the same column! The yyyy-mm-dd format (eg. 2007-04-30) is unambiguous and can be interpreted by most software packages. TextYour spreadsheet package may allow you to put long passages of text into cells. But many statistical packages will have trouble reading files with fields more than 255 characters in length. So keep text items short. Avoid mixing text and numbers in the same column, as some packages will then treat the whole column as text or a categorical variable; so don't write "escaped" in the weights column if a trapped animal escaped before being weighed. Leave the cell in the 'Weights' column blank and put the note in a separate 'Comments' column. Similarly, don't use '>50' or '5?' or '25cm' in columns of numbers. Keep column headings short. Relevant metadata can go in a separate row at the top of the file. Eg. head the column "SV length" and put "SV length = snout-vent length measured in cm on sedated animal with Acme SnakeOmeter Mk II" in a row above the table. Only use the cells in the first column for this metadata. If your statistical package can't cope with initial rows of non-data, you can later 'cut' them in a text editor such as Notepad and 'paste' into a separate file. Statistical calculationsPrecisionRemember that spreadsheets are not designed for the analysis of wildlife data. In business applications, calculation to 2 or 3 decimal places is quite adequate, and if you're working with millions of dollars you probably don't care about the cents. So be wary if you are working with very large or very small numbers, or if you are interested in small differences between large numbers. This often happens in intermediate calculations, rather than actual data, eg. you calculate the area of the USA in square meters, or the proportion of the Sundaland population of geometrid moths included in your sample, or the area of your park including vs excluding water bodies. Statistical functions and add-insMost statistical functions in spreadsheets give correct and adequately precise answers for most data sets. Some data sets, however, cause problems, the most common of which are:
Spreadsheet functions will often produce an erroneous result in such cases, when a statistical package would produce an error message or at least a warning. Third-party add-ins are often better that the functions included in spreadsheet packages. You can download a spreadsheet with examples here and check the performance of your own spreadsheet package and add-ins. More details of these problems are in papers by Pottel, Simonof, and Cryer. |
||
|
|
Text by Mike Meredith, updated 17 May 2007 |
|
![]() ![]() |
||