R provides many different functions for extracting subsets of data and for rearranging data into different formats.
Recall the candy data set that was first introduced in Section 11.5.1. The full candy data frame is reproduced below.
> candy
shape pattern shade count 1 round pattern light 2 2 oval pattern light 0 3 long pattern light 3 4 round plain light 1 5 oval plain light 3 6 long plain light 2 7 round pattern dark 9 8 oval pattern dark 0 9 long pattern dark 2 10 round plain dark 1 11 oval plain dark 11 12 long plain dark 2
We have previously seen that we can get a single variable from a data frame using the $ operator. For example, the count variable can be obtained using candy$count. Another way to do the same thing is to use the [[ operator and specify the variable of interest as a string.
> candyCounts <- candy[["count"]] > candyCounts
[1] 2 0 3 1 3 2 9 0 2 1 11 2
The advantage of the [[ operator is that it allows a number or an expression as the index. For example, the count variable is the fourth variable, so this code also works.
> candy[[4]]
[1] 2 0 3 1 3 2 9 0 2 1 11 2
The following code evaluates an expression that determines which of the variables in the data frame is numeric and then selects just that variable from the data frame.
> sapply(candy, is.numeric)
shape pattern shade count FALSE FALSE FALSE TRUE
> which(sapply(candy, is.numeric))
count 4
> candy[[which(sapply(candy, is.numeric))]]
[1] 2 0 3 1 3 2 9 0 2 1 11 2
When working with a simple vector of values, like candyCounts, we can extract subsets of the vector using the [ operator. For example, the following code produces the first three counts (light-shaded candies with a pattern).
> candyCounts[1:3]
[1] 2 0 3
The indices can be any integer sequence, they can include repetitions, and even negative numbers (to exclude specific values). The following two examples produce counts for all candies with a pattern and then all counts except the count for round plain dark candies.
> candyCounts[c(1:3, 7:9)]
[1] 2 0 3 9 0 2
> candyCounts[-10]
[1] 2 0 3 1 3 2 9 0 2 11 2
As well as using integers for indices, we can use logical values. For example, a better way to express the idea that we want the counts for all candies with a pattern is to use an expression like this:
> hasPattern <- candy$pattern == "pattern" > hasPattern
[1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE [10] FALSE FALSE FALSE
This vector of logical values can be used as an index to return all of the counts where hasPattern is TRUE.
> candyCounts[hasPattern]
[1] 2 0 3 9 0 2
Better still would be to work with the entire data frame and retain the pattern with the counts, so that we can check that we have the correct result. A data frame can be indexed using the [ operator too, though slightly differently because we have to specify both which rows and which columns we want. Here are two examples which extract the pattern and count variables from the data frame for all candies with a pattern:
> candy[hasPattern, c(2, 4)]
> candy[hasPattern, c("pattern", "count")]
pattern count 1 pattern 2 2 pattern 0 3 pattern 3 7 pattern 9 8 pattern 0 9 pattern 2
In both cases, the index is of the form [<rows>, <columns>], but the first example uses column numbers and the second example uses column names. The result is still a data frame, just a smaller one.
The function subset() provides another way to perform this sort of subsetting, with a subset argument for specifying the rows and a select argument for specifying the columns.
> subset(candy, subset=hasPattern, select=c("pattern", "count"))
pattern count 1 pattern 2 2 pattern 0 3 pattern 3 7 pattern 9 8 pattern 0 9 pattern 2
It is possible to leave the row or column index completely empty, in which case all rows or columns are returned. For example, this code extracts all variables for the light-shaded candies with a pattern:
> candy[1:3, ]
shape pattern shade count 1 round pattern light 2 2 oval pattern light 0 3 long pattern light 3
It is also possible to extract all rows for a selection of variables in a data frame by just specifying a single index with the [ operator.
> candy["count"]
count 1 2 2 0 3 3 4 1 5 3 6 2 7 9 8 0 9 2 10 1 11 11 12 2
This result, which is a data frame, is quite different to the result from using the [[ operator, which gives a vector.
> candy[["count"]]
[1] 2 0 3 1 3 2 9 0 2 1 11 2
Some R functions produce complex results (e.g., the result of a linear regression returned by lm()). These results are often returned as list objects, which makes it tempting to obtain various subsets of the results, e.g., the model coefficients from a linear regression, using the basic subsetting syntax. However, this is usually not the correct approach.
Most functions which produce complex results are accompanied by a set of other functions which perform the extraction of useful subsets of the result. For example, the coefficients of a linear regression analysis should be obtained using the coef() function.
The reason for this is so that the people who write a function that produces complex results can change the structure of the result without breaking code that extracts subsets of the result. This idea is known as encapsulation.
It is often necessary to expand, reduce, or generally “reshape” a data set and R provides many functions for these sorts of operations.
Recall the candy data set that was first introduced in Section 11.5.1. The full candy data frame is reproduced below.
> candy
shape pattern shade count 1 round pattern light 2 2 oval pattern light 0 3 long pattern light 3 4 round plain light 1 5 oval plain light 3 6 long plain light 2 7 round pattern dark 9 8 oval pattern dark 0 9 long pattern dark 2 10 round plain dark 1 11 oval plain dark 11 12 long plain dark 2
This is a somewhat abbreviated form for the data, recording only the number of occurrences of each possible combination of candy characteristics. Another way to represent the data is as a case per candy, with three variables giving the pattern, shade, and shape of each piece of candy (see Figure 11.12).
For example, the following code produces what we want for the shape variable.
> rep(candy$shape, candy$count)
[1] round round long long long round oval oval oval [10] long long round round round round round round round [19] round round long long round oval oval oval oval [28] oval oval oval oval oval oval oval long long Levels: round oval long
We could perform this operation on each variable and explicitly glue the new variables back together (Figure 11.12 shows the full case-per-candy form of the data set).
> candyCases <- data.frame(shape=rep(candy$shape, candy$count), pattern=rep(candy$pattern, candy$count), shade=rep(candy$shade, candy$count))This is a feasible approach for a small number of variables, but for larger data sets, and for pure elegance, R provides a more general approach via functions such as apply() and lapply().
The apply() function works on matrices or arrays. We can specify a function, FUN, and apply() will call that function for each column of a matrix. For this example, we can treat the first three variables of the candy data set as a matrix (of strings) with three columns, and apply the rep() function to each column (see Figure 11.12).
> candyCasesMatrix <- apply(candy[,1:3], 2, rep, candy$count) > candyCases <- data.frame(candyCasesMatrix, row.names=NULL)The second argument to apply() specifies whether to call the function for each row of the matrix (1) or for each column of the matrix (2). Extra arguments to the function FUN can also be specified; in this case, we pass the count variable. The result of apply() is a matrix, so we use data.frame() to turn the final result back into a data frame.
The lapply() function is similar, but it calls a function for each component of a list. The first three variables in the candy data frame can be treated as a list and if we use lapply() on these variables, the result is also a list, so again we use data.frame() to turn the final result back into a data frame (see Figure 11.12).
> candyCasesList <- lapply(candy[,1:3], rep, candy$count) > candyCases <- data.frame(candyCasesList)
We can collapse the case-per-candy format back into counts in a number of different ways. The table() function produces counts for each combination of the levels of the factors it is given.
> candyTable <- table(candyCases) > candyTable
, , shade = light pattern shape pattern plain round 2 1 oval 0 3 long 3 2 , , shade = dark pattern shape pattern plain round 9 1 oval 0 11 long 2 2
The ftable() function produces the same result, but as a “flat” (2-dimensional) contingency table, which can be easier to read.
> candyFTable <- ftable(candyCases) > candyFTable
shade light dark shape pattern round pattern 2 9 plain 1 1 oval pattern 0 0 plain 3 11 long pattern 3 2 plain 2 2
The results of these functions are arrays or matrices, but they can easily be converted back to a data frame, like the one we originally entered in Section 11.5.1.
> as.data.frame(candyTable)
shape pattern shade Freq 1 round pattern light 2 2 oval pattern light 0 3 long pattern light 3 4 round plain light 1 5 oval plain light 3 6 long plain light 2 7 round pattern dark 9 8 oval pattern dark 0 9 long pattern dark 2 10 round plain dark 1 11 oval plain dark 11 12 long plain dark 2
The function xtabs() produces the same result as table(), but provides a formula interface for specifying the factors to tabulate.
> candyXTable <- xtabs(~ shape + pattern + shade, candyCases)This function can also be used to create a frequency table from data which has been entered as counts.
> candyXTable <- xtabs(count ~ shape + pattern + shade, candy)Another way to collapse groups of observations into summary values (in this case counts) is to use the aggregate() function.
> aggregate(candy["count"], list(shape=candy$shape, pattern=candy$pattern), sum)
shape pattern count 1 round pattern 11 2 oval pattern 0 3 long pattern 5 4 round plain 2 5 oval plain 14 6 long plain 4
One advantage of this function is that any summary function can be used (e.g., mean() instead of sum()). Another advantage is that the result of the function is a data frame, not a contingency table like from table() or xtabs().
> xtabs(count ~ shape + pattern, candy)
pattern shape pattern plain round 11 2 oval 0 14 long 5 4
Other functions do similar work to aggregate(), but produce the result in a different form (e.g., by() and tapply()).
The functions cbind() and rbind() can be used to append data frames together--cbind() adds two sets of variables on common cases and rbind() adds two sets of cases on common variables.
The merge() function can be used to perform the equivalent of a database join (see Section 9.2.3) with two data frames.
“Farfallo contorno” (butterfly
silhouette) by
Architetto Francesco Rollandin.11.12
Taxonomists do not officially differentiate between butterflies and
moths, but the most obvious physical difference is that the antennae of
moths are usually furrier.
Rothamsted is a very large and very old agricultural research centre situated in Hertfordshire in the United Kingdom. One of its long term research projects is the Rothamsted Insect Survey, part of which involves the daily monitoring of nearly 100 “light traps” in order to count the number of moths of different species at various locations around the UK. The oldest light traps date back to 1933.
A colleague obtained a subset of the Rothamsted moth data in the form of two CSV format text files. One file contained 14 years worth of total yearly counts for a limited range of moth species. The first row of the file contains the species label and each subsequent row contains the counts of each species for a single year:
sp117,sp120,sp121,sp125,sp126,sp139,sp145,sp148,sp154, ... 2,1,1,0,0,1,0,3,0,1,3,4,4,0,0,9,2,6,0,0,0,16,31,2,40, ... 3,1,1,0,0,1,0,1,0,2,11,4,8,0,0,19,2,1,0,0,0,20,11,1, ... 2,0,3,6,0,1,0,1,0,1,9,1,18,0,0,15,15,0,1,0,0,14,12,5, ... 5,0,2,2,0,2,1,2,1,0,3,2,17,0,0,17,23,1,0,2,1,14,22,2, ... ...
Another file contained the species labels for the full set of 263 moth species for which counts exist, with one label per row. Notice that there are species labels in this file that do not appear in the file of counts (e.g., sp108, sp109, sp110, and sp111).
"x" "sp108" "sp109" "sp110" "sp111" "sp117" "sp120" ...
We can read in the counts using read.csv() to produce a data frame and we can read in the names using scan() to produce a simple character vector.
> mothCounts <- read.csv("mothcounts.csv") > mothCounts
sp117 sp120 sp121 sp125 sp126 sp139 sp145 sp148 sp154 ... 1 2 1 1 0 0 1 0 3 0 ... 2 3 1 1 0 0 1 0 1 0 ... 3 2 0 3 6 0 1 0 1 0 ... 4 5 0 2 2 0 2 1 2 1 ... ...
> mothNames <- scan("mothnames.csv", what="character", skip=1) > mothNames
[1] "sp108" "sp109" "sp110" "sp111" "sp117" "sp120" ...
Our goal in this data manipulation example is to produce a single data set containing the count data from the file mothCounts.csv plus empty columns for all of the species for which we do not have counts. The end result will be a data frame as shown below, with columns of NAs where a species label occurs in mothNames, but not in mothCounts, and all other columns filled with the appropriate counts from mothCounts:
sp108 sp109 sp110 sp111 sp117 sp120 sp121 sp125 sp126 ... 1 NA NA NA NA 2 1 1 0 0 ... 2 NA NA NA NA 3 1 1 0 0 ... 3 NA NA NA NA 2 0 3 6 0 ... 4 NA NA NA NA 5 0 2 2 0 ... ...
We will look at solving this problem in two ways. The first approach involves building up a large empty data set and then inserting the counts we know.
The first step is to create an empty data frame of the appropriate size. The matrix() function can create a matrix of the appropriate size and the as.data.frame() function converts it to a data frame. The function colnames() is used to give the columns of the data frame the appropriate names.
> allMoths <- as.data.frame(matrix(NA, ncol=length(mothNames), nrow=14)) > colnames(allMoths) <- mothNames > allmoths
sp108 sp109 sp110 sp111 sp117 sp120 sp121 sp125 sp126 ... 1 NA NA NA NA NA NA NA NA NA ... 2 NA NA NA NA NA NA NA NA NA ... 3 NA NA NA NA NA NA NA NA NA ... 4 NA NA NA NA NA NA NA NA NA ... ...
Now we just fill in the columns where we know the moth counts. This is done simply using indexing; each column in the mothCounts data frame is assigned to the column with the same name in the allMoths data frame.
> allMoths[, colnames(mothCounts)] <- mothCounts > allMoths
sp108 sp109 sp110 sp111 sp117 sp120 sp121 sp125 sp126 ... 1 NA NA NA NA 2 1 1 0 0 ... 2 NA NA NA NA 3 1 1 0 0 ... 3 NA NA NA NA 2 0 3 6 0 ... 4 NA NA NA NA 5 0 2 2 0 ... ...
An alternative approach to the problem is to treat it like a database join. This time we start with a data frame that has a variable for each moth species, but no rows. The list() function creates a list with a zero-length vector, and the rep() function replicates that an appropriate number of times. The data.frame() function turns the list into a data frame (with zero rows).
> emptyMoths <- data.frame(rep(list(numeric(0)), length(mothNames))) > colnames(emptyMoths) <- mothNames > emptyMoths
[1] sp108 sp109 sp110 sp111 sp117 sp120 sp121 sp125 sp126 ... <0 rows> (or 0-length row.names) ...
Now we get the final data frame by joining this data frame with the data frame containing moth counts; we perform an outer join to retain rows where there is no match between the data frames (in this case, all rows in mothCounts). The merge() function does the join, automatically detecting the columns to match on by the columns with common names in the two data frames. There are no matching rows between the data frames (emptyMoths has no counts), but the outer join retains all rows of mothCounts and adds missing values in columns which are in emptyMoths, but not in mothCounts (all.x=TRUE). We have to be careful to retain the original order of the rows (sort=FALSE) and the original order of the columns ([,mothNames]).
> mergeMoths <- merge(mothCounts, emptyMoths, all.x=TRUE, sort=FALSE)[,mothNames] > allmoths
sp108 sp109 sp110 sp111 sp117 sp120 sp121 sp125 sp126 ... 1 NA NA NA NA 2 1 1 0 0 ... 2 NA NA NA NA 3 1 1 0 0 ... 3 NA NA NA NA 2 0 3 6 0 ... 4 NA NA NA NA 5 0 2 2 0 ... ...
A compact fluorescent light bulb.11.13
This sort of light bulb lasts up to
16 times longer and
consumes about one quarter of the power of a
comparable incandescent light bulb.
A colleague in Baltimore collected data from his residential gas and electricity power bills over 8 years. The data are in a text file called baltimore.txt and include the start date for the bill, the number of therms of gas used and the amount charged, the number of kilowatt hours of electricity used and the amount charged, the average daily outdoor temperature (as reported on the bill), and the number of days in the billing period.
Figure 11.13 shows the first few lines of the file. This sort of text file can be read conveniently using the read.table() function, with the header=TRUE argument specified to use the variable names on the first line of the file. We also use as.is=TRUE to keep the dates as strings for now.
start therms gas KWHs elect temp days 10-Jun-98 9 16.84 613 63.80 75 40 20-Jul-98 6 15.29 721 74.21 76 29 18-Aug-98 7 15.73 597 62.22 76 29 16-Sep-98 42 35.81 460 43.98 70 33 19-Oct-98 105 77.28 314 31.45 57 29 17-Nov-98 106 77.01 342 33.86 48 30 17-Dec-98 200 136.66 298 30.08 40 33 19-Jan-99 144 107.28 278 28.37 37 30 18-Feb-99 179 122.80 253 26.21 39 29 ... |
> utilities <- read.table("baltimore.txt", header=TRUE, as.is=TRUE) > head(utilities)
start therms gas KWHs elect temp days 1 10-Jun-98 9 16.84 613 63.80 75 40 2 20-Jul-98 6 15.29 721 74.21 76 29 3 18-Aug-98 7 15.73 597 62.22 76 29 4 16-Sep-98 42 35.81 460 43.98 70 33 5 19-Oct-98 105 77.28 314 31.45 57 29 6 17-Nov-98 106 77.01 342 33.86 48 30
The first thing we want to do is to convert the first variable into actual dates. This will allow us to perform calculations and comparisons on the date values.
> utilities$start <- as.Date(utilities$start, "%d-%b-%y") > head(utilities)
start therms gas KWHs elect temp days 1 1998-06-10 9 16.84 613 63.80 75 40 2 1998-07-20 6 15.29 721 74.21 76 29 3 1998-08-18 7 15.73 597 62.22 76 29 4 1998-09-16 42 35.81 460 43.98 70 33 5 1998-10-19 105 77.28 314 31.45 57 29 6 1998-11-17 106 77.01 342 33.86 48 30
Several events of interest occurred in the household over this time period and the aim of the analysis was to determine whether any of these events had any effect on the energy consumption of the household. The events were:
These events can be used to break the data set into five different time periods; we will be interested in the average daily charges for each of these periods.
The first problem is that none of these events coincide with the start of a billing period, so we need to split some of the bills into two separate pieces in order to obtain data for each of the time periods we are interested in.
We could do this by hand, but by now it should be clear that there are good reasons to write code to perform the task instead. This will reduce the chance of silly errors and make it a trivial task to repeat the calculations if, for example, we discover that one of the event dates needs to be corrected, or a new event is discovered. The code will also serve as documentation of the steps we have taken in preparing the data set for analysis.
The algorithm we will use to reshape the data is as follows:
How do we determine which period an event occurred in? This provides a nice example of working with dates in R. The code below creates a vector of dates for the events.
> events <- as.Date(c("1999-07-31", "2004-04-22", "2004-09-01", "2005-12-18"))One thing we can do with dates is subtract one from another; the result in this case is a number of days between the dates. We can do the subtraction for all events at once using the outer() function. This will subtract each event date from each billing period start date and produce a matrix of the differences, where column i of the matrix shows the differences between the event i and the start of each billing period. A subset of the rows of this matrix are shown below. For example, the first column gives the number of days between the first event and the start of each billing period; negative values indicate that the billing period began before the first event.
> dayDiffs <- outer(utilities$start, events, "-") > dayDiffs[10:15,]
Time differences in days [,1] [,2] [,3] [,4] [1,] -106 -1833 -1965 -2438 [2,] -74 -1801 -1933 -2406 [3,] -44 -1771 -1903 -2376 [4,] -11 -1738 -1870 -2343 [5,] 18 -1709 -1841 -2314 [6,] 48 -1679 -1811 -2284
For each event, we want the billing period where the difference is closest to zero and non-positive. This is the billing period in which the event occurred. First, we can set all of the positive differences to large negative ones, then the problem simply becomes finding the maximum of the differences. Rather than overwrite the original data, we will work with a new copy. This is generally a safer technique, if there is enough computer memory to allow it.
> nonNegDayDiffs <- dayDiffs > nonNegDayDiffs[dayDiffs > 0] <- -9999 > nonNegDayDiffs[10:15,]
Time differences in days [,1] [,2] [,3] [,4] [1,] -106 -1833 -1965 -2438 [2,] -74 -1801 -1933 -2406 [3,] -44 -1771 -1903 -2376 [4,] -11 -1738 -1870 -2343 [5,] -9999 -1709 -1841 -2314 [6,] -9999 -1679 -1811 -2284
The function which.max() returns the index of the maximum value in a vector. We can use the apply() function to calculate this index for each event (i.e., for each column of the matrix of differences).
> eventPeriods <- apply(nonNegDayDiffs, 2, which.max) > eventPeriods
[1] 13 68 72 79
The eventPeriods variable now contains the row numbers for the billing periods that we want to split. These billing periods are shown below.
> utilities[eventPeriods, ]
start therms gas KWHs elect temp days 13 1999-07-20 6 15.88 723 74.41 77 29 68 2004-04-16 36 45.58 327 29.99 65 31 72 2004-08-18 7 19.13 402 43.00 73 31 79 2005-12-15 234 377.98 514 43.55 39 33
This may seem a little too much work for something which on first sight appears much easier to do “by eye”, because the data appear to be in date order. The problem is that it is very easy to be seduced into thinking that the task is simple, whereas a code solution like we have developed is harder to fool. In this case, a closer inspection reveals that the data are not as orderly as we first thought. Shown below are the last 10 rows of the data set, which reveal a change in the ordering of the dates during 2005 (look at the start variable), so it was just as well that we used a code solution.11.14
> tail(utilities, n=10)
start therms gas KWHs elect temp days 87 2005-04-18 65 80.85 220 23.15 58 29 88 2005-03-16 126 141.82 289 27.93 49 30 89 2005-02-15 226 251.89 257 25.71 36 29 90 2006-02-14 183 257.23 332 30.93 41 30 91 2006-03-16 115 146.31 298 28.56 51 33 92 2006-04-18 36 54.55 291 28.06 59 28 93 2006-05-17 22 37.98 374 40.55 68 34 94 2006-06-19 7 20.68 614 83.19 78 29 95 2006-07-18 6 19.94 746 115.47 80 30 96 2006-08-17 11 26.08 534 84.89 72 33
Now that we have identified which billing periods the events occurred in, the next step is to split each billing period that contains an event into two new periods, with energy consumptions and charges divided proportionally between them. The proportion will depend on where the event occurred within the billing period, which is nicely represented by the differences in the matrix dayDiffs. The differences we want are in the rows corresponding the to billing periods containing events, as shown below. The difference on row 1 of column 1 is the number of days from the start of billing period 13 to the first event. The difference on row 2 of column 2 is the number of days between the start of billing period 68 and the second event.
> dayDiffs[eventPeriods,]
Time differences in days [,1] [,2] [,3] [,4] [1,] -11 -1738 -1870 -2343 [2,] 1721 -6 -138 -611 [3,] 1845 118 -14 -487 [4,] 2329 602 470 -3
The differences we want lie on the diagonal of this sub-matrix and these can be extracted using the diag() function.
> eventDiffs <- diag(dayDiffs[eventPeriods,]) > eventDiffs
[1] -11 -6 -14 -3
We want to convert the differences in days into a proportion of the billing period. The proportions are these differences divided by the number of days in the billing period, which is stored in the days variable.
> proportions <- -eventDiffs/utilities$days[eventPeriods] > proportions
[1] 0.3793103 0.1935484 0.4516129 0.0909091
The modified periods are the original billing periods multiplied by these proportions. We want to mutiply all values in the data set except the start variable (hence the -1 below).
> modifiedPeriods <- utilities[eventPeriods, -1]*proportions > modifiedPeriods$start <- utilities$start[eventPeriods] > modifiedPeriods
therms gas KWHs elect temp days start 13 2.275862 6.023448 274.24138 28.224483 29.206897 11 1999-07-20 68 6.967742 8.821935 63.29032 5.804516 12.580645 6 2004-04-16 72 3.161290 8.639355 181.54839 19.419355 32.967742 14 2004-08-18 79 21.272727 34.361818 46.72727 3.959091 3.545455 3 2005-12-15
We also need new periods representing the remainder of the original billing periods. The start dates for these new periods are the dates on which the events occurred.
> newPeriods <- utilities[eventPeriods, -1]*(1 - proportions) > newPeriods$start <- events > newPeriods
therms gas KWHs elect temp days start 13 3.724138 9.856552 448.7586 46.18552 47.79310 18 1999-07-31 68 29.032258 36.758065 263.7097 24.18548 52.41935 25 2004-04-22 72 3.838710 10.490645 220.4516 23.58065 40.03226 17 2004-09-01 79 212.727273 343.618182 467.2727 39.59091 35.45455 30 2005-12-18
Finally, we combine the unchanged billing periods with the periods that we have split in two. When the data frames being combined have exactly the same set of variables, the rbind() function can be used to combine them.11.15
> periods <- rbind(utilities[-eventPeriods, ], modifiedPeriods, newPeriods)We should check that our new data frame has the same basic properties as the original data frame. The code below simply calculates the sum of each variable in the data set (other than the start dates).
> apply(utilities[, -1], 2, sum)
therms gas KWHs elect temp days 8899.00 9342.40 41141.00 4058.77 5417.00 2929.00
> apply(periods[, -1], 2, sum)
therms gas KWHs elect temp days 8899.00 9342.40 41141.00 4058.77 5417.00 2929.00
Now that we have the data in a format where the significant events occur at the start of a billing period, the last step is to calculate average daily usage and costs in the time periods between the events. We need to create a new variable phase which will identify the “time phase” for each billing period (between which events did the billing period occur). The algorithm we will use for this task is shown below:
This can be done in R using a for loop and the fact that dates can be compared using <, just like numbers.
> periods$phase <- 5 > for (i in events) { periods$phase[periods$start < i] <- periods$phase[periods$start < i] - 1 } > periods$phase
[1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [28] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [55] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 4 4 4 4 4 5 4 4 4 4 4 4 [82] 4 4 4 4 5 5 5 5 5 5 5 1 2 3 4 2 3 4 5
Now we can sum usage and costs for each phase, then divide by the number of days to obtain averages that can be compared between phases.
> phases <- aggregate(periods[c("therms", "gas", "KWHs", "elect", "days")], list(phase=periods$phase), sum) > phases
phase therms gas KWHs elect days 1 1 871.27586 693.4134 5434.241 556.6545 388 2 2 5656.69188 5337.3285 24358.049 2291.1400 1664 3 3 53.19355 103.1574 1648.258 170.8748 132 4 4 1528.11144 2017.5025 5625.179 551.8997 470 5 5 789.72727 1190.9982 4075.273 488.2009 275
> phaseAvgs <- phases[c("therms", "gas", "KWHs", "elect")]/phases$days > phaseAvgs
therms gas KWHs elect 1 2.2455563 1.7871481 14.00578 1.434677 2 3.3994543 3.2075291 14.63825 1.376887 3 0.4029814 0.7814956 12.48680 1.294506 4 3.2513009 4.2925584 11.96847 1.174255 5 2.8717355 4.3309025 14.81917 1.775276
The last step above works because when a data frame is divided by a vector, each variable in the data frame gets divided by the vector. This is not necessarily obvious from the code; a more explicit way to perform the operation is to use the sweep() function, which forces us to explicitly state that, for each row of the data frame (MARGIN=1), we are dividing (FUN="/") by the corresponding value from a vector (STAT=phase$days).
> phaseSweep <- sweep(phases[c("therms", "gas", "KWHs", "elect")], MARGIN=1, STAT=phases$days, FUN="/")The function identical() is useful for demonstrating that this approach produces exactly the same values as before.
> identical(phaseAvgs, phaseSweep)
[1] TRUE
The values that stand out amongst the average daily energy values are the gas usage and cost during phase 3 (after the first two storm windows were replaced, but before the second set of four storm windows were replaced). The naive interpretation is that the first two storm windows were incredibly effective, but the second four storm windows actually made things worse again!
At first sight this appears strange, but it is easily explained by the fact that phase 3 coincided with summer months, as shown below using the table() function.
> table(months(periods$start), periods$phase)[month.name, ]
1 2 3 4 5 January 1 5 0 1 1 February 1 5 0 1 1 March 0 5 0 1 1 April 1 5 1 1 1 May 1 3 1 1 1 June 2 4 1 1 1 July 2 5 1 1 1 August 1 5 1 1 1 September 1 5 0 3 0 October 1 5 0 2 0 November 1 4 0 2 0 December 1 5 0 2 1
The function months() extracts the names of the months from the dates in the start variable. Subsetting the resulting table by month.name just rearranges the order of the rows of the table; this ensures that the months are reported in calendar order rather than alphabetical order.
We would expect the gas usage (for heating) to be a lot lower during summer. This is confirmed by calculating the daily average usages and costs for each month. Again we must take care to get the answer in calendar order; this time we do that by aggregating over a factor that is based on the extracting the months from the start variable, with the order of the levels of the factor explicitly set to be month.name. These results are presented graphically in Figure 11.14.
> months <- aggregate(periods[c("therms", "gas", "KWHs", "elect", "days")], list(month=factor(months(periods$start), levels=month.name)), sum) > cbind(month=months$month, months[c("therms", "gas", "KWHs", "elect")]/months$days)
month therms gas KWHs elect 1 January 7.7763713 7.5162025 12.62025 1.151646 2 February 5.9300412 5.8969136 11.02058 1.036337 3 March 3.9056604 3.9867453 11.65094 1.074009 4 April 1.6666667 1.9054852 10.93671 1.035907 5 May 0.6547085 0.9880269 12.47085 1.335202 6 June 0.2214533 0.5661592 16.29066 1.756298 7 July 0.2067669 0.5886090 21.85338 2.347444 8 August 0.2354594 0.6006794 17.01344 1.856210 9 September 1.1079020 1.2773281 14.07927 1.279034 10 October 3.4347826 3.4618261 11.83913 1.106087 11 November 5.3669725 5.5368807 13.55046 1.191651 12 December 7.2182540 7.1405159 13.12302 1.177976
Attempting to determine whether any of the significant events lead to a significant change in energy consumption and cost for this household is clearly going to require careful analysis. Fortunately, having prepared the data, our work is done, so we will leave the difficult part to those who follow.
Paul Murrell
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.