Subsections


11.10 Writing Functions

It is quite straightforward to create new functions in R.

In this section, we will look at why it is useful and sometimes necessary to write our own functions.

x

11.10.1 Case Study: The Data Expo (continued)

The data for the 2006 JSM Data Expo (Section 7.5.6) were obtained from NASA's Live Access Server as a set of 505 text files (see Section 1.1).

Seventy-two of those files contain near-surface air temperature measurements, with one file for each month of recordings. Each file contains average temperatures for the relevant month at 576 different locations. Figure 11.15 shows the first few lines of the temperature file for the first month, January 1995.

Figure 11.15: The first few lines of output from the Live Access Server for the near-surface air temperature of the Earth for January 1995, over a coarse 24 by 24 grid of locations covering central America.
 

             VARIABLE : Mean Near-surface air temperature (kelvin)
             FILENAME : ISCCPMonthly_avg.nc
             FILEPATH : /usr/local/fer_data/data/
             SUBSET   : 24 by 24 points (LONGITUDE-LATITUDE)
             TIME     : 16-JAN-1995 00:00
              113.8W 111.2W 108.8W 106.2W 103.8W 101.2W 98.8W  ...
               27     28     29     30     31     32     33    ...
 36.2N / 51:  272.1  270.3  270.3  270.9  271.5  275.6  278.4  ...
 33.8N / 50:  282.2  282.2  272.7  272.7  271.5  280.0  281.6  ...
 31.2N / 49:  285.2  285.2  276.1  275.0  278.9  281.6  283.7  ...
 28.8N / 48:  290.7  286.8  286.8  276.7  277.3  283.2  287.3  ...
 26.2N / 47:  292.7  293.6  284.2  284.2  279.5  281.1  289.3  ...
 23.8N / 46:  293.6  295.0  295.5  282.7  282.7  281.6  285.2  ...
 ...

The complete set of 72 file names for the files containing temperature recordings can be obtained by the following code (only the first fifteen file names are shown):

> nasaAirTempFiles <- list.files(file.path("NASA", "Files"), 
                                  pattern="^temperature",
                                  full.names=TRUE)
> head(nasaAirTempFiles, n=15)

 [1] "NASA/Files/temperature10.txt"
 [2] "NASA/Files/temperature11.txt"
 [3] "NASA/Files/temperature12.txt"
 [4] "NASA/Files/temperature13.txt"
 [5] "NASA/Files/temperature14.txt"
 [6] "NASA/Files/temperature15.txt"
 [7] "NASA/Files/temperature16.txt"
 [8] "NASA/Files/temperature17.txt"
 [9] "NASA/Files/temperature18.txt"
[10] "NASA/Files/temperature19.txt"
[11] "NASA/Files/temperature1.txt" 
[12] "NASA/Files/temperature20.txt"
[13] "NASA/Files/temperature21.txt"
[14] "NASA/Files/temperature22.txt"
[15] "NASA/Files/temperature23.txt"

This code assumes that the files are stored in a local directory NASA/Files. Notice also that we are using a regular expression to specify the pattern of the filenames that we want; we have selected all file names that start with temperature.11.23

The file names are in an awkward order because of the default alphabetical ordering of the names;11.24 the data for the first month are in the file temperature1.txt, but this is the eleventh file name. We can ignore this problem for now, but we will come back to it later.

We will conduct a simple task with these data: calculating the near-surface air temperature for each month, averaged over all locations. In other words, we will calculate a single average temperature from each file.

The following code reads in the data using the first file name, temperature10.txt, and averages all of the temperatures in the file.

> mean(as.matrix(read.fwf(nasaAirTempFiles[1], 
                           skip=7, 
                           widths=c(-12, rep(7, 24)))))

[1] 298.0564

The call to read.fwf() ignores the first 7 lines of the file and the first 12 characters on the remaining lines of the file. This just leaves the temperature values, which are converted into a matrix so that the mean() function can calculate the average across all of the values.

We want to perform this calculation for each of the air temperature files. Conceptually, we want to perform a loop, once for each file name. Indeed, we can write code to perform the task with an explicit loop.

> avgTemp <- numeric(72)
> for (i in 1:72) {
       avgTemp[i] <- 
           mean(as.matrix(read.fwf(nasaAirTempFiles[i], 
                                   skip=7, 
                                   widths=c(-12, 
                                     rep(7, 24)))))
   }
> avgTemp

 [1] 298.0564 296.7019 295.9568 295.3915 296.1486 296.1087
 [7] 297.1007 298.3694 298.1970 298.4031 295.1849 298.0682
[13] 298.3148 297.3823 296.1304 295.5917 295.5562 295.6438
[19] 296.8922 297.0823 298.4793 295.3175 299.3575 299.7984
[25] 299.7314 299.6090 298.4970 297.9872 296.8453 296.9569
[31] 296.9354 297.0240 296.3335 298.0668 299.1821 300.7290
[37] 300.6998 300.3715 300.1036 299.2269 297.8642 297.2729
[43] 296.8823 296.9587 297.4288 297.5762 298.2859 299.1076
[49] 299.1938 299.0599 299.5424 298.9135 298.2849 297.0981
[55] 297.7286 296.2639 296.1943 296.5868 297.5510 298.6106
[61] 299.7425 299.5219 299.7422 300.3411 299.5781 298.5809
[67] 298.6965 297.0830 296.3813 299.1863 299.0660 298.4634

However, as shown in Section 11.7.4, it is more natural in R to use a function like sapply() to perform this sort of task.

The sapply() function calls a given function for each element of a list. It is not difficult to coerce the vector of file names into a list; in fact, sapply() can do that coercion automatically. The problem in this case is that there is no existing function to perform the task that we need to perform for each file name, namely: read in the file, discard all but the temperature values, generate a matrix of the values, and calculate an overall mean.

Fortunately, it is very easy to define such a function. Here is code that defines a function called monthAvg() to perform this task.

> monthAvg <- function(filename) {
       mean(as.matrix(read.fwf(filename, 
                               skip=7, 
                               widths=c(-12, rep(7, 24)))))
   }
The object monthAvg is a function object that can be used like any other R function. This function has a single argument called filename. When the monthAvg() function is called, the filename argument must be specified and, within the function, the symbol filename contains the value specified as the first argument in the function call. The value returned by a function is the value of the last expression within the function. In this case, the return value is the result of the call to the mean() function.

As a simple example, the following two pieces of code produce exactly the same result; they both calculate the overall average temperature from the contents of the file temperature10.txt.

> mean(as.matrix(read.fwf(nasaAirTempFiles[1], 
                           skip=7, 
                           widths=c(-12, rep(7, 24)))))

> monthAvg(nasaAirTempFiles[1])

[1] 298.0564

With this function defined, it is now possible to calculate the monthly averages from all of the temperature files using sapply(). We supply the vector of file names and sapply() automatically converts this to a list of file names. We also supply our new monthAvg() function and sapply() calls that function for each file name. We specify USE.NAMES=FALSE because otherwise each element of the result would have the corresponding filename as a label.

> sapply(nasaAirTempFiles, monthAvg, USE.NAMES=FALSE)

 [1] 298.0564 296.7019 295.9568 295.3915 296.1486 296.1087
 [7] 297.1007 298.3694 298.1970 298.4031 295.1849 298.0682
[13] 298.3148 297.3823 296.1304 295.5917 295.5562 295.6438
[19] 296.8922 297.0823 298.4793 295.3175 299.3575 299.7984
[25] 299.7314 299.6090 298.4970 297.9872 296.8453 296.9569
[31] 296.9354 297.0240 296.3335 298.0668 299.1821 300.7290
[37] 300.6998 300.3715 300.1036 299.2269 297.8642 297.2729
[43] 296.8823 296.9587 297.4288 297.5762 298.2859 299.1076
[49] 299.1938 299.0599 299.5424 298.9135 298.2849 297.0981
[55] 297.7286 296.2639 296.1943 296.5868 297.5510 298.6106
[61] 299.7425 299.5219 299.7422 300.3411 299.5781 298.5809
[67] 298.6965 297.0830 296.3813 299.1863 299.0660 298.4634

Unfortunately, these results are not actually in chronological order. Recall that the file names are ordered alphabetically, so the answer for the first month is actually the eleventh average in the result above. We will now develop a different function so that we can get the averages in the right order.

The idea of this new function is that it takes an integer as its argument and it calculates a file name based on that integer. In this way, we can control the order of the file names by controlling the order of the integers. The new function is called ithAvg().

> ithAvg <- function(i=1) {
       monthAvg(file.path("NASA", "Files",
                          paste("temperature", i,
                                ".txt", sep="")))
   }
This function has a single argument called i. The value of this argument is combined with the path to the file to produce a complete file name. This file name is then passed to the monthAvg() function.

The argument i is optional because a default value, 1, is provided. If the function is called with no arguments, then it will calculate the average temperature for the file temperature1.txt.

> ithAvg()

[1] 295.1849

With this function defined, we can calculate the monthly average temperatures in a chronological order.

> sapply(1:72, ithAvg, USE.NAMES=FALSE)

 [1] 295.1849 295.3175 296.3335 296.9587 297.7286 298.5809
 [7] 299.1863 299.0660 298.4634 298.0564 296.7019 295.9568
[13] 295.3915 296.1486 296.1087 297.1007 298.3694 298.1970
[19] 298.4031 298.0682 298.3148 297.3823 296.1304 295.5917
[25] 295.5562 295.6438 296.8922 297.0823 298.4793 299.3575
[31] 299.7984 299.7314 299.6090 298.4970 297.9872 296.8453
[37] 296.9569 296.9354 297.0240 298.0668 299.1821 300.7290
[43] 300.6998 300.3715 300.1036 299.2269 297.8642 297.2729
[49] 296.8823 297.4288 297.5762 298.2859 299.1076 299.1938
[55] 299.0599 299.5424 298.9135 298.2849 297.0981 296.2639
[61] 296.1943 296.5868 297.5510 298.6106 299.7425 299.5219
[67] 299.7422 300.3411 299.5781 298.6965 297.0830 296.3813

11.10.2 Flashback: The DRY Principle

This example demonstrates that it is useful to be able to define our own functions for use with functions like apply(), lapply(), and sapply(). However, there are many other good reasons for being able to write functions. In particular, functions are useful for organising code, simplifying code, and for making it easier to maintain code.

For example, with the monthAvg() and ithAvg() functions defined, the for loop solution to our task11.25 becomes much simpler and easier to read.

> avgTemp <- numeric(72)
> for (i in 1:72) {
       avgTemp[i] <- ithAvg(i)
   }
This is an example of just making our code tidier, which is just an extension of the ideas of laying out and documenting code for the benefit of human readers. A further advantage that we obtain from writing functions is the ability to safely and efficiently reuse our code.

The ithAvg() function demonstrates the idea of code reuse. Here is the function definition again:

> ithAvg <- function(i=1) {
       monthAvg(file.path("NASA", "Files",
                          paste("temperature", i,
                                ".txt", sep="")))
   }
The important feature of this function is that it calls our other function monthAvg(). By way of contrast, consider the following alternative way that we could define ithAvg().

> ithAvgBad <- function(i=1) {
       mean(
         as.matrix(
           read.fwf(file.path("NASA", "Files",
                              paste("temperature", i,
                                    ".txt", sep="")), 
                    skip=7, 
                    widths=c(-12, rep(7, 24)))))
   }
What is wrong with this function? The problem is that it repeats almost all of the code that is already in monthAvg() and this leads to a number of familiar issues: there is obvious inefficiency because it is wasteful to type all of that code again; what is worse, the code is harder to maintain because there are two copies of the code--if any changes need to be made, they must now be made in two places; worse still, we are now vulnerable to making mistakes because we can change one copy of the code without changing the other copy and thereby end up with two functions that behave differently even though we think they are the same.

The merits of reusing our monthAvg() function in the ithAvg() function should now be clear. The existence of the monthAvg() function means that we only have one copy of the code used to read data from the Data Expo files and that leads to greater efficiency and better accuracy.

Paul Murrell

Creative Commons License
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.