11.7 Data manipulation

R provides many different functions for extracting subsets of data and for rearranging data into different formats.

11.7.1 Subsetting

11.7.2 Case study: Counting candy (continued)

Recall the candy data set that was first introduced in Section 11.5.1. The full candy data frame is reproduced below.

> candy

   shape pattern shade count
1  round pattern light     2
2   oval pattern light     0
3   long pattern light     3
4  round   plain light     1
5   oval   plain light     3
6   long   plain light     2
7  round pattern  dark     9
8   oval pattern  dark     0
9   long pattern  dark     2
10 round   plain  dark     1
11  oval   plain  dark    11
12  long   plain  dark     2

We have previously seen that we can get a single variable from a data frame using the $ operator. For example, the count variable can be obtained using candy$count. Another way to do the same thing is to use the [[ operator and specify the variable of interest as a string.

> candyCounts <- candy[["count"]]
> candyCounts

 [1]  2  0  3  1  3  2  9  0  2  1 11  2

The advantage of the [[ operator is that it allows a number or an expression as the index. For example, the count variable is the fourth variable, so this code also works.

> candy[[4]]

 [1]  2  0  3  1  3  2  9  0  2  1 11  2

The following code evaluates an expression that determines which of the variables in the data frame is numeric and then selects just that variable from the data frame.

> sapply(candy, is.numeric)

  shape pattern   shade   count 
  FALSE   FALSE   FALSE    TRUE

> which(sapply(candy, is.numeric))

count 
    4

> candy[[which(sapply(candy, is.numeric))]]

 [1]  2  0  3  1  3  2  9  0  2  1 11  2

When working with a simple vector of values, like candyCounts, we can extract subsets of the vector using the [ operator. For example, the following code produces the first three counts (light-shaded candies with a pattern).

> candyCounts[1:3]

[1] 2 0 3

The indices can be any integer sequence, they can include repetitions, and even negative numbers (to exclude specific values). The following two examples produce counts for all candies with a pattern and then all counts except the count for round plain dark candies.

> candyCounts[c(1:3, 7:9)]

[1] 2 0 3 9 0 2

> candyCounts[-10]

 [1]  2  0  3  1  3  2  9  0  2 11  2

As well as using integers for indices, we can use logical values. For example, a better way to express the idea that we want the counts for all candies with a pattern is to use an expression like this:

> hasPattern <- candy$pattern == "pattern"
> hasPattern

 [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
[10] FALSE FALSE FALSE

This vector of logical values can be used as an index to return all of the counts where hasPattern is TRUE.

> candyCounts[hasPattern]

[1] 2 0 3 9 0 2

Better still would be to work with the entire data frame and retain the pattern with the counts, so that we can check that we have the correct result. A data frame can be indexed using the [ operator too, though slightly differently because we have to specify both which rows and which columns we want. Here are two examples which extract the pattern and count variables from the data frame for all candies with a pattern:

> candy[hasPattern, c(2, 4)]

> candy[hasPattern, c("pattern", "count")]

  pattern count
1 pattern     2
2 pattern     0
3 pattern     3
7 pattern     9
8 pattern     0
9 pattern     2

In both cases, the index is of the form [<rows>, <columns>], but the first example uses column numbers and the second example uses column names. The result is still a data frame, just a smaller one.

The function subset() provides another way to perform this sort of subsetting, with a subset argument for specifying the rows and a select argument for specifying the columns.

> subset(candy, subset=hasPattern, 
          select=c("pattern", "count"))

  pattern count
1 pattern     2
2 pattern     0
3 pattern     3
7 pattern     9
8 pattern     0
9 pattern     2

It is possible to leave the row or column index completely empty, in which case all rows or columns are returned. For example, this code extracts all variables for the light-shaded candies with a pattern:

> candy[1:3, ]

  shape pattern shade count
1 round pattern light     2
2  oval pattern light     0
3  long pattern light     3

It is also possible to extract all rows for a selection of variables in a data frame by just specifying a single index with the [ operator.

> candy["count"]

This result, which is a data frame, is quite different to the result from using the [[ operator, which gives a vector.

> candy[["count"]]

 [1]  2  0  3  1  3  2  9  0  2  1 11  2

11.7.3 Accessor functions

Some R functions produce complex results (e.g., the result of a linear regression returned by lm()). These results are often returned as list objects, which makes it tempting to obtain various subsets of the results, e.g., the model coefficients from a linear regression, using the basic subsetting syntax. However, this is usually not the correct approach.

Most functions which produce complex results are accompanied by a set of other functions which perform the extraction of useful subsets of the result. For example, the coefficients of a linear regression analysis should be obtained using the coef() function.

The reason for this is so that the people who write a function that produces complex results can change the structure of the result without breaking code that extracts subsets of the result. This idea is known as encapsulation.

11.7.4 Aggregation and reshaping

It is often necessary to expand, reduce, or generally “reshape” a data set and R provides many functions for these sorts of operations.

11.7.5 Case study: Counting Candy (continued)

Recall the candy data set that was first introduced in Section 11.5.1. The full candy data frame is reproduced below.

> candy

   shape pattern shade count
1  round pattern light     2
2   oval pattern light     0
3   long pattern light     3
4  round   plain light     1
5   oval   plain light     3
6   long   plain light     2
7  round pattern  dark     9
8   oval pattern  dark     0
9   long pattern  dark     2
10 round   plain  dark     1
11  oval   plain  dark    11
12  long   plain  dark     2

This is a somewhat abbreviated form for the data, recording only the number of occurrences of each possible combination of candy characteristics. Another way to represent the data is as a case per candy, with three variables giving the pattern, shade, and shape of each piece of candy (see Figure 11.12).

For example, the following code produces what we want for the shape variable.

> rep(candy$shape, candy$count)

 [1] round round long  long  long  round oval  oval  oval 
[10] long  long  round round round round round round round
[19] round round long  long  round oval  oval  oval  oval 
[28] oval  oval  oval  oval  oval  oval  oval  long  long 
Levels: round oval long

We could perform this operation on each variable and explicitly glue the new variables back together (Figure 11.12 shows the full case-per-candy form of the data set).

> candyCases <- 
       data.frame(shape=rep(candy$shape, candy$count),
                  pattern=rep(candy$pattern, candy$count),
                  shade=rep(candy$shade, candy$count))

This is a feasible approach for a small number of variables, but for larger data sets, and for pure elegance, R provides a more general approach via functions such as apply() and lapply().

The apply() function works on matrices or arrays. We can specify a function, FUN, and apply() will call that function for each column of a matrix. For this example, we can treat the first three variables of the candy data set as a matrix (of strings) with three columns, and apply the rep() function to each column (see Figure 11.12).

> candyCasesMatrix <- apply(candy[,1:3], 2, rep, candy$count)
> candyCases <- data.frame(candyCasesMatrix, row.names=NULL)

The second argument to apply() specifies whether to call the function for each row of the matrix (1) or for each column of the matrix (2). Extra arguments to the function FUN can also be specified; in this case, we pass the count variable. The result of apply() is a matrix, so we use data.frame() to turn the final result back into a data frame.

The lapply() function is similar, but it calls a function for each component of a list. The first three variables in the candy data frame can be treated as a list and if we use lapply() on these variables, the result is also a list, so again we use data.frame() to turn the final result back into a data frame (see Figure 11.12).

> candyCasesList <- lapply(candy[,1:3], rep, candy$count)
> candyCases <- data.frame(candyCasesList)

**Figure 11.12:** The candy data set in a case-per-candy format; each row describes the shape, pattern, and shade characteristics of a single piece of candy from the picture on page .
> candyCases shape pattern shade 1 round pattern light 2 round pattern light 3 long pattern light 4 long pattern light 5 long pattern light 6 round plain light 7 oval plain light 8 oval plain light 9 oval plain light 10 long plain light 11 long plain light 12 round pattern dark 13 round pattern dark 14 round pattern dark 15 round pattern dark 16 round pattern dark 17 round pattern dark 18 round pattern dark 19 round pattern dark 20 round pattern dark 21 long pattern dark 22 long pattern dark 23 round plain dark 24 oval plain dark 25 oval plain dark 26 oval plain dark 27 oval plain dark 28 oval plain dark 29 oval plain dark 30 oval plain dark 31 oval plain dark 32 oval plain dark 33 oval plain dark 34 oval plain dark 35 long plain dark 36 long plain dark

Figure 11.12: The candy data set in a case-per-candy format; each row describes the shape, pattern, and shade characteristics of a single piece of candy from the picture on page

> candyCases

   shape pattern shade
1  round pattern light
2  round pattern light
3   long pattern light
4   long pattern light
5   long pattern light
6  round   plain light
7   oval   plain light
8   oval   plain light
9   oval   plain light
10  long   plain light
11  long   plain light
12 round pattern  dark
13 round pattern  dark
14 round pattern  dark
15 round pattern  dark
16 round pattern  dark
17 round pattern  dark
18 round pattern  dark
19 round pattern  dark
20 round pattern  dark
21  long pattern  dark
22  long pattern  dark
23 round   plain  dark
24  oval   plain  dark
25  oval   plain  dark
26  oval   plain  dark
27  oval   plain  dark
28  oval   plain  dark
29  oval   plain  dark
30  oval   plain  dark
31  oval   plain  dark
32  oval   plain  dark
33  oval   plain  dark
34  oval   plain  dark
35  long   plain  dark
36  long   plain  dark

11.7.6 Tables of Counts

We can collapse the case-per-candy format back into counts in a number of different ways. The table() function produces counts for each combination of the levels of the factors it is given.

> candyTable <- table(candyCases)
> candyTable

, , shade = light

       pattern
shape   pattern plain
  round       2     1
  oval        0     3
  long        3     2

, , shade = dark

       pattern
shape   pattern plain
  round       9     1
  oval        0    11
  long        2     2

The ftable() function produces the same result, but as a “flat” (2-dimensional) contingency table, which can be easier to read.

> candyFTable <- ftable(candyCases)
> candyFTable

              shade light dark
shape pattern                 
round pattern           2    9
      plain             1    1
oval  pattern           0    0
      plain             3   11
long  pattern           3    2
      plain             2    2

The results of these functions are arrays or matrices, but they can easily be converted back to a data frame, like the one we originally entered in Section 11.5.1.

> as.data.frame(candyTable)

   shape pattern shade Freq
1  round pattern light    2
2   oval pattern light    0
3   long pattern light    3
4  round   plain light    1
5   oval   plain light    3
6   long   plain light    2
7  round pattern  dark    9
8   oval pattern  dark    0
9   long pattern  dark    2
10 round   plain  dark    1
11  oval   plain  dark   11
12  long   plain  dark    2

The function xtabs() produces the same result as table(), but provides a formula interface for specifying the factors to tabulate.

> candyXTable <- xtabs(~ shape + pattern + shade, candyCases)

This function can also be used to create a frequency table from data which has been entered as counts.

> candyXTable <- xtabs(count ~ shape + pattern + shade, candy)

Another way to collapse groups of observations into summary values (in this case counts) is to use the aggregate() function.

> aggregate(candy["count"], 
             list(shape=candy$shape, pattern=candy$pattern), 
             sum)

  shape pattern count
1 round pattern    11
2  oval pattern     0
3  long pattern     5
4 round   plain     2
5  oval   plain    14
6  long   plain     4

One advantage of this function is that any summary function can be used (e.g., mean() instead of sum()). Another advantage is that the result of the function is a data frame, not a contingency table like from table() or xtabs().

> xtabs(count ~ shape + pattern, candy)

       pattern
shape   pattern plain
  round      11     2
  oval        0    14
  long        5     4

Other functions do similar work to aggregate(), but produce the result in a different form (e.g., by() and tapply()).

11.7.7 Merging data sets

The functions cbind() and rbind() can be used to append data frames together--cbind() adds two sets of variables on common cases and rbind() adds two sets of cases on common variables.

The merge() function can be used to perform the equivalent of a database join (see Section 9.2.3) with two data frames.

11.7.8 Case study: Rothamsted moths

$\includegraphics[width=2in]{butterflyfromsvgviainkscape}$

“Farfallo contorno” (butterfly silhouette) by Architetto Francesco Rollandin.^11.12 Taxonomists do not officially differentiate between butterflies and moths, but the most obvious physical difference is that the antennae of moths are usually furrier.

Rothamsted is a very large and very old agricultural research centre situated in Hertfordshire in the United Kingdom. One of its long term research projects is the Rothamsted Insect Survey, part of which involves the daily monitoring of nearly 100 “light traps” in order to count the number of moths of different species at various locations around the UK. The oldest light traps date back to 1933.

A colleague obtained a subset of the Rothamsted moth data in the form of two CSV format text files. One file contained 14 years worth of total yearly counts for a limited range of moth species. The first row of the file contains the species label and each subsequent row contains the counts of each species for a single year:

sp117,sp120,sp121,sp125,sp126,sp139,sp145,sp148,sp154, ...
2,1,1,0,0,1,0,3,0,1,3,4,4,0,0,9,2,6,0,0,0,16,31,2,40, ...
3,1,1,0,0,1,0,1,0,2,11,4,8,0,0,19,2,1,0,0,0,20,11,1, ...
2,0,3,6,0,1,0,1,0,1,9,1,18,0,0,15,15,0,1,0,0,14,12,5, ...
5,0,2,2,0,2,1,2,1,0,3,2,17,0,0,17,23,1,0,2,1,14,22,2, ...
...

Another file contained the species labels for the full set of 263 moth species for which counts exist, with one label per row. Notice that there are species labels in this file that do not appear in the file of counts (e.g., sp108, sp109, sp110, and sp111).

"x"
"sp108"
"sp109"
"sp110"
"sp111"
"sp117"
"sp120"
...

We can read in the counts using read.csv() to produce a data frame and we can read in the names using scan() to produce a simple character vector.

> mothCounts <- read.csv("mothcounts.csv")
> mothCounts

  sp117 sp120 sp121 sp125 sp126 sp139 sp145 sp148 sp154 ...
1     2     1     1     0     0     1     0     3     0 ...
2     3     1     1     0     0     1     0     1     0 ...
3     2     0     3     6     0     1     0     1     0 ...
4     5     0     2     2     0     2     1     2     1 ...
...

> mothNames <- scan("mothnames.csv", 
                     what="character", skip=1)
> mothNames

[1] "sp108" "sp109" "sp110" "sp111" "sp117" "sp120" ...

Our goal in this data manipulation example is to produce a single data set containing the count data from the file mothCounts.csv plus empty columns for all of the species for which we do not have counts. The end result will be a data frame as shown below, with columns of NAs where a species label occurs in mothNames, but not in mothCounts, and all other columns filled with the appropriate counts from mothCounts:

  sp108 sp109 sp110 sp111 sp117 sp120 sp121 sp125 sp126 ...
1    NA    NA    NA    NA     2     1     1     0     0 ...
2    NA    NA    NA    NA     3     1     1     0     0 ...
3    NA    NA    NA    NA     2     0     3     6     0 ...
4    NA    NA    NA    NA     5     0     2     2     0 ...
...

We will look at solving this problem in two ways. The first approach involves building up a large empty data set and then inserting the counts we know.

The first step is to create an empty data frame of the appropriate size. The matrix() function can create a matrix of the appropriate size and the as.data.frame() function converts it to a data frame. The function colnames() is used to give the columns of the data frame the appropriate names.

> allMoths <- as.data.frame(matrix(NA, 
                                    ncol=length(mothNames), 
                                    nrow=14))
> colnames(allMoths) <- mothNames
> allmoths

  sp108 sp109 sp110 sp111 sp117 sp120 sp121 sp125 sp126 ...
1    NA    NA    NA    NA    NA    NA    NA    NA    NA ...
2    NA    NA    NA    NA    NA    NA    NA    NA    NA ...
3    NA    NA    NA    NA    NA    NA    NA    NA    NA ...
4    NA    NA    NA    NA    NA    NA    NA    NA    NA ...
...

Now we just fill in the columns where we know the moth counts. This is done simply using indexing; each column in the mothCounts data frame is assigned to the column with the same name in the allMoths data frame.

> allMoths[, colnames(mothCounts)] <- mothCounts
> allMoths

  sp108 sp109 sp110 sp111 sp117 sp120 sp121 sp125 sp126 ...
1    NA    NA    NA    NA     2     1     1     0     0 ...
2    NA    NA    NA    NA     3     1     1     0     0 ...
3    NA    NA    NA    NA     2     0     3     6     0 ...
4    NA    NA    NA    NA     5     0     2     2     0 ...
...

An alternative approach to the problem is to treat it like a database join. This time we start with a data frame that has a variable for each moth species, but no rows. The list() function creates a list with a zero-length vector, and the rep() function replicates that an appropriate number of times. The data.frame() function turns the list into a data frame (with zero rows).

> emptyMoths <- data.frame(rep(list(numeric(0)), 
                                length(mothNames)))
> colnames(emptyMoths) <- mothNames
> emptyMoths

[1] sp108 sp109 sp110 sp111 sp117 sp120 sp121 sp125 sp126 ...
<0 rows> (or 0-length row.names) ...

Now we get the final data frame by joining this data frame with the data frame containing moth counts; we perform an outer join to retain rows where there is no match between the data frames (in this case, all rows in mothCounts). The merge() function does the join, automatically detecting the columns to match on by the columns with common names in the two data frames. There are no matching rows between the data frames (emptyMoths has no counts), but the outer join retains all rows of mothCounts and adds missing values in columns which are in emptyMoths, but not in mothCounts (all.x=TRUE). We have to be careful to retain the original order of the rows (sort=FALSE) and the original order of the columns ([,mothNames]).

> mergeMoths <- merge(mothCounts, emptyMoths, 
                       all.x=TRUE, sort=FALSE)[,mothNames]
> allmoths

  sp108 sp109 sp110 sp111 sp117 sp120 sp121 sp125 sp126 ...
1    NA    NA    NA    NA     2     1     1     0     0 ...
2    NA    NA    NA    NA     3     1     1     0     0 ...
3    NA    NA    NA    NA     2     0     3     6     0 ...
4    NA    NA    NA    NA     5     0     2     2     0 ...
...

11.7.9 Case study: Utilities

Image justbulb

A compact fluorescent light bulb.^11.13 This sort of light bulb lasts up to 16 times longer and consumes about one quarter of the power of a comparable incandescent light bulb.

A colleague in Baltimore collected data from his residential gas and electricity power bills over 8 years. The data are in a text file called baltimore.txt and include the start date for the bill, the number of therms of gas used and the amount charged, the number of kilowatt hours of electricity used and the amount charged, the average daily outdoor temperature (as reported on the bill), and the number of days in the billing period.

Figure 11.13 shows the first few lines of the file. This sort of text file can be read conveniently using the read.table() function, with the header=TRUE argument specified to use the variable names on the first line of the file. We also use as.is=TRUE to keep the dates as strings for now.

**Figure 11.13:** The first few lines of the utilities data set.
start therms gas KWHs elect temp days 10-Jun-98 9 16.84 613 63.80 75 40 20-Jul-98 6 15.29 721 74.21 76 29 18-Aug-98 7 15.73 597 62.22 76 29 16-Sep-98 42 35.81 460 43.98 70 33 19-Oct-98 105 77.28 314 31.45 57 29 17-Nov-98 106 77.01 342 33.86 48 30 17-Dec-98 200 136.66 298 30.08 40 33 19-Jan-99 144 107.28 278 28.37 37 30 18-Feb-99 179 122.80 253 26.21 39 29 ...

Figure 11.13: The first few lines of the utilities data set.

start   therms  gas KWHs        elect   temp    days
10-Jun-98       9       16.84   613     63.80   75      40
20-Jul-98       6       15.29   721     74.21   76      29
18-Aug-98       7       15.73   597     62.22   76      29
16-Sep-98       42      35.81   460     43.98   70      33
19-Oct-98       105     77.28   314     31.45   57      29
17-Nov-98       106     77.01   342     33.86   48      30
17-Dec-98       200     136.66  298     30.08   40      33
19-Jan-99       144     107.28  278     28.37   37      30
18-Feb-99       179     122.80  253     26.21   39      29
 ...

> utilities <- read.table("baltimore.txt", 
                           header=TRUE, as.is=TRUE)
> head(utilities)

      start therms   gas KWHs elect temp days
1 10-Jun-98      9 16.84  613 63.80   75   40
2 20-Jul-98      6 15.29  721 74.21   76   29
3 18-Aug-98      7 15.73  597 62.22   76   29
4 16-Sep-98     42 35.81  460 43.98   70   33
5 19-Oct-98    105 77.28  314 31.45   57   29
6 17-Nov-98    106 77.01  342 33.86   48   30

The first thing we want to do is to convert the first variable into actual dates. This will allow us to perform calculations and comparisons on the date values.

> utilities$start <- as.Date(utilities$start,
                              "%d-%b-%y")
> head(utilities)

       start therms   gas KWHs elect temp days
1 1998-06-10      9 16.84  613 63.80   75   40
2 1998-07-20      6 15.29  721 74.21   76   29
3 1998-08-18      7 15.73  597 62.22   76   29
4 1998-09-16     42 35.81  460 43.98   70   33
5 1998-10-19    105 77.28  314 31.45   57   29
6 1998-11-17    106 77.01  342 33.86   48   30

Several events of interest occurred in the household over this time period and the aim of the analysis was to determine whether any of these events had any effect on the energy consumption of the household. The events were:

An additional resident moved in on July 31 ${}^{\rm st}$ 1999.
Two storm windows were replaced on April 22 ${}^{\rm nd}$ 2004.
Four storm windows were replaced on September 1 ${}^{\rm st}$ 2004.
An additional resident moved in on December 18 ${}^{\rm th}$ 2005.

These events can be used to break the data set into five different time periods; we will be interested in the average daily charges for each of these periods.

The first problem is that none of these events coincide with the start of a billing period, so we need to split some of the bills into two separate pieces in order to obtain data for each of the time periods we are interested in.

We could do this by hand, but by now it should be clear that there are good reasons to write code to perform the task instead. This will reduce the chance of silly errors and make it a trivial task to repeat the calculations if, for example, we discover that one of the event dates needs to be corrected, or a new event is discovered. The code will also serve as documentation of the steps we have taken in preparing the data set for analysis.

The algorithm we will use to reshape the data is as follows:

1 find which billing periods the events occurred in 2 split those billing periods in two 3 combine new billing periods with unsplit billing periods

How do we determine which period an event occurred in? This provides a nice example of working with dates in R. The code below creates a vector of dates for the events.

> events <- as.Date(c("1999-07-31", "2004-04-22",
                       "2004-09-01", "2005-12-18"))

One thing we can do with dates is subtract one from another; the result in this case is a number of days between the dates. We can do the subtraction for all events at once using the outer() function. This will subtract each event date from each billing period start date and produce a matrix of the differences, where column i of the matrix shows the differences between the event i and the start of each billing period. A subset of the rows of this matrix are shown below. For example, the first column gives the number of days between the first event and the start of each billing period; negative values indicate that the billing period began before the first event.

> dayDiffs <- outer(utilities$start, events, "-")
> dayDiffs[10:15,]

Time differences in days
     [,1]  [,2]  [,3]  [,4]
[1,] -106 -1833 -1965 -2438
[2,]  -74 -1801 -1933 -2406
[3,]  -44 -1771 -1903 -2376
[4,]  -11 -1738 -1870 -2343
[5,]   18 -1709 -1841 -2314
[6,]   48 -1679 -1811 -2284

For each event, we want the billing period where the difference is closest to zero and non-positive. This is the billing period in which the event occurred. First, we can set all of the positive differences to large negative ones, then the problem simply becomes finding the maximum of the differences. Rather than overwrite the original data, we will work with a new copy. This is generally a safer technique, if there is enough computer memory to allow it.

> nonNegDayDiffs <- dayDiffs
> nonNegDayDiffs[dayDiffs > 0] <- -9999
> nonNegDayDiffs[10:15,]

Time differences in days
      [,1]  [,2]  [,3]  [,4]
[1,]  -106 -1833 -1965 -2438
[2,]   -74 -1801 -1933 -2406
[3,]   -44 -1771 -1903 -2376
[4,]   -11 -1738 -1870 -2343
[5,] -9999 -1709 -1841 -2314
[6,] -9999 -1679 -1811 -2284

The function which.max() returns the index of the maximum value in a vector. We can use the apply() function to calculate this index for each event (i.e., for each column of the matrix of differences).

> eventPeriods <- apply(nonNegDayDiffs, 2, which.max)
> eventPeriods

[1] 13 68 72 79

The eventPeriods variable now contains the row numbers for the billing periods that we want to split. These billing periods are shown below.

> utilities[eventPeriods, ]

        start therms    gas KWHs elect temp days
13 1999-07-20      6  15.88  723 74.41   77   29
68 2004-04-16     36  45.58  327 29.99   65   31
72 2004-08-18      7  19.13  402 43.00   73   31
79 2005-12-15    234 377.98  514 43.55   39   33

This may seem a little too much work for something which on first sight appears much easier to do “by eye”, because the data appear to be in date order. The problem is that it is very easy to be seduced into thinking that the task is simple, whereas a code solution like we have developed is harder to fool. In this case, a closer inspection reveals that the data are not as orderly as we first thought. Shown below are the last 10 rows of the data set, which reveal a change in the ordering of the dates during 2005 (look at the start variable), so it was just as well that we used a code solution.^11.14

> tail(utilities, n=10)

        start therms    gas KWHs  elect temp days
87 2005-04-18     65  80.85  220  23.15   58   29
88 2005-03-16    126 141.82  289  27.93   49   30
89 2005-02-15    226 251.89  257  25.71   36   29
90 2006-02-14    183 257.23  332  30.93   41   30
91 2006-03-16    115 146.31  298  28.56   51   33
92 2006-04-18     36  54.55  291  28.06   59   28
93 2006-05-17     22  37.98  374  40.55   68   34
94 2006-06-19      7  20.68  614  83.19   78   29
95 2006-07-18      6  19.94  746 115.47   80   30
96 2006-08-17     11  26.08  534  84.89   72   33

Now that we have identified which billing periods the events occurred in, the next step is to split each billing period that contains an event into two new periods, with energy consumptions and charges divided proportionally between them. The proportion will depend on where the event occurred within the billing period, which is nicely represented by the differences in the matrix dayDiffs. The differences we want are in the rows corresponding the to billing periods containing events, as shown below. The difference on row 1 of column 1 is the number of days from the start of billing period 13 to the first event. The difference on row 2 of column 2 is the number of days between the start of billing period 68 and the second event.

> dayDiffs[eventPeriods,]

Time differences in days
     [,1]  [,2]  [,3]  [,4]
[1,]  -11 -1738 -1870 -2343
[2,] 1721    -6  -138  -611
[3,] 1845   118   -14  -487
[4,] 2329   602   470    -3

The differences we want lie on the diagonal of this sub-matrix and these can be extracted using the diag() function.

> eventDiffs <- diag(dayDiffs[eventPeriods,])
> eventDiffs

[1] -11  -6 -14  -3

We want to convert the differences in days into a proportion of the billing period. The proportions are these differences divided by the number of days in the billing period, which is stored in the days variable.

> proportions <- -eventDiffs/utilities$days[eventPeriods]
> proportions

[1] 0.3793103 0.1935484 0.4516129 0.0909091

The modified periods are the original billing periods multiplied by these proportions. We want to mutiply all values in the data set except the start variable (hence the -1 below).

> modifiedPeriods <- utilities[eventPeriods, -1]*proportions
> modifiedPeriods$start <- utilities$start[eventPeriods]
> modifiedPeriods

      therms       gas      KWHs     elect      temp days      start
13  2.275862  6.023448 274.24138 28.224483 29.206897   11 1999-07-20
68  6.967742  8.821935  63.29032  5.804516 12.580645    6 2004-04-16
72  3.161290  8.639355 181.54839 19.419355 32.967742   14 2004-08-18
79 21.272727 34.361818  46.72727  3.959091  3.545455    3 2005-12-15

We also need new periods representing the remainder of the original billing periods. The start dates for these new periods are the dates on which the events occurred.

> newPeriods <- utilities[eventPeriods, -1]*(1 - proportions)
> newPeriods$start <- events
> newPeriods

       therms        gas     KWHs    elect     temp days      start
13   3.724138   9.856552 448.7586 46.18552 47.79310   18 1999-07-31
68  29.032258  36.758065 263.7097 24.18548 52.41935   25 2004-04-22
72   3.838710  10.490645 220.4516 23.58065 40.03226   17 2004-09-01
79 212.727273 343.618182 467.2727 39.59091 35.45455   30 2005-12-18

Finally, we combine the unchanged billing periods with the periods that we have split in two. When the data frames being combined have exactly the same set of variables, the rbind() function can be used to combine them.^11.15

> periods <- rbind(utilities[-eventPeriods, ],
                    modifiedPeriods,
                    newPeriods)

We should check that our new data frame has the same basic properties as the original data frame. The code below simply calculates the sum of each variable in the data set (other than the start dates).

> apply(utilities[, -1], 2, sum)

  therms      gas     KWHs    elect     temp     days 
 8899.00  9342.40 41141.00  4058.77  5417.00  2929.00

> apply(periods[, -1], 2, sum)

  therms      gas     KWHs    elect     temp     days 
 8899.00  9342.40 41141.00  4058.77  5417.00  2929.00

Now that we have the data in a format where the significant events occur at the start of a billing period, the last step is to calculate average daily usage and costs in the time periods between the events. We need to create a new variable phase which will identify the “time phase” for each billing period (between which events did the billing period occur). The algorithm we will use for this task is shown below:

1 phase = 5 for all periods 2 for each event E 3 subtract 1 from phase for periods that start before E

This can be done in R using a for loop and the fact that dates can be compared using <, just like numbers.

> periods$phase <- 5
> for (i in events) { 
       periods$phase[periods$start < i] <- 
           periods$phase[periods$start < i] - 1 
   }
> periods$phase

  [1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [28] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [55] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 4 4 4 4 4 5 4 4 4 4 4 4
 [82] 4 4 4 4 5 5 5 5 5 5 5 1 2 3 4 2 3 4 5

Now we can sum usage and costs for each phase, then divide by the number of days to obtain averages that can be compared between phases.

> phases <- aggregate(periods[c("therms", "gas", "KWHs", 
                                 "elect", "days")], 
                       list(phase=periods$phase), sum)
> phases

  phase     therms       gas      KWHs     elect days
1     1  871.27586  693.4134  5434.241  556.6545  388
2     2 5656.69188 5337.3285 24358.049 2291.1400 1664
3     3   53.19355  103.1574  1648.258  170.8748  132
4     4 1528.11144 2017.5025  5625.179  551.8997  470
5     5  789.72727 1190.9982  4075.273  488.2009  275

> phaseAvgs <- 
       phases[c("therms", "gas", "KWHs", "elect")]/phases$days
> phaseAvgs

     therms       gas     KWHs    elect
1 2.2455563 1.7871481 14.00578 1.434677
2 3.3994543 3.2075291 14.63825 1.376887
3 0.4029814 0.7814956 12.48680 1.294506
4 3.2513009 4.2925584 11.96847 1.174255
5 2.8717355 4.3309025 14.81917 1.775276

The last step above works because when a data frame is divided by a vector, each variable in the data frame gets divided by the vector. This is not necessarily obvious from the code; a more explicit way to perform the operation is to use the sweep() function, which forces us to explicitly state that, for each row of the data frame (MARGIN=1), we are dividing (FUN="/") by the corresponding value from a vector (STAT=phase$days).

> phaseSweep <- sweep(phases[c("therms", "gas", "KWHs", "elect")], 
                       MARGIN=1, STAT=phases$days, FUN="/")

The function identical() is useful for demonstrating that this approach produces exactly the same values as before.

> identical(phaseAvgs, phaseSweep)

[1] TRUE

The values that stand out amongst the average daily energy values are the gas usage and cost during phase 3 (after the first two storm windows were replaced, but before the second set of four storm windows were replaced). The naive interpretation is that the first two storm windows were incredibly effective, but the second four storm windows actually made things worse again!

At first sight this appears strange, but it is easily explained by the fact that phase 3 coincided with summer months, as shown below using the table() function.

> table(months(periods$start), periods$phase)[month.name, ]

            1 2 3 4 5
  January   1 5 0 1 1
  February  1 5 0 1 1
  March     0 5 0 1 1
  April     1 5 1 1 1
  May       1 3 1 1 1
  June      2 4 1 1 1
  July      2 5 1 1 1
  August    1 5 1 1 1
  September 1 5 0 3 0
  October   1 5 0 2 0
  November  1 4 0 2 0
  December  1 5 0 2 1

The function months() extracts the names of the months from the dates in the start variable. Subsetting the resulting table by month.name just rearranges the order of the rows of the table; this ensures that the months are reported in calendar order rather than alphabetical order.

We would expect the gas usage (for heating) to be a lot lower during summer. This is confirmed by calculating the daily average usages and costs for each month. Again we must take care to get the answer in calendar order; this time we do that by aggregating over a factor that is based on the extracting the months from the start variable, with the order of the levels of the factor explicitly set to be month.name. These results are presented graphically in Figure 11.14.

> months <- aggregate(periods[c("therms", "gas", "KWHs",
                                 "elect", "days")],
                       list(month=factor(months(periods$start), 
                              levels=month.name)), 
                       sum)
> cbind(month=months$month,
         months[c("therms", "gas", "KWHs", "elect")]/months$days)

       month    therms       gas     KWHs    elect
1    January 7.7763713 7.5162025 12.62025 1.151646
2   February 5.9300412 5.8969136 11.02058 1.036337
3      March 3.9056604 3.9867453 11.65094 1.074009
4      April 1.6666667 1.9054852 10.93671 1.035907
5        May 0.6547085 0.9880269 12.47085 1.335202
6       June 0.2214533 0.5661592 16.29066 1.756298
7       July 0.2067669 0.5886090 21.85338 2.347444
8     August 0.2354594 0.6006794 17.01344 1.856210
9  September 1.1079020 1.2773281 14.07927 1.279034
10   October 3.4347826 3.4618261 11.83913 1.106087
11  November 5.3669725 5.5368807 13.55046 1.191651
12  December 7.2182540 7.1405159 13.12302 1.177976

**Figure 11.14:** The average daily energy usage and costs for each month from the Utilities data set.
$\begin{figure}% the gap below seems to be important!?? \par \includegraphics[width=\textwidth]{script-utilitymonths}\end{figure}$

Attempting to determine whether any of the significant events lead to a significant change in energy consumption and cost for this household is clearly going to require careful analysis. Fortunately, having prepared the data, our work is done, so we will leave the difficult part to those who follow.

Paul Murrell

This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.