12.4 Functions

A function call is an expression of the form:

functionName(arg1, arg2)

A function can have any number of arguments, including zero. Arguments can be specified by position or by name (name overrides position). Arguments are optional if they have a default value.

This section provides a list of some of the functions that are useful for working with data in R. The descriptions of these functions is very brief and only some of the arguments to each function are mentioned. For a complete description of the function and its arguments, the relevant function help page should be consulted.

12.4.1 Generating vectors

c(...): Concatenate or combine values (or vectors of values) to make a vector. All values must be of the same type (or they will be coerced to the same type). This function can be used to concatenate lists.
seq(from, to, by, length.out): Generate a sequence of values from from to (not greater than) to in steps of by for a total of length.out values.
rep(x, times, each, length.out): Repeat all values in a vector times times, or each value in the vector each times, or all values in the vector until the total number of values is length.out.

12.4.2 Numeric functions

sum(..., na.rm=FALSE)

Sum the value of all arguments. Arguments should be vectors, but, for example, matrices will be accepted. If NA values are included, the result is NA (unless na.rm=TRUE). This function is generic.

$\usebox{\minmaxbox}$

Calculate the minimum, maximum, or range of all values in all arguments.

$\usebox{\roundbox}$

Round a numeric value to a number of digits or to an integer value. floor() returns largest integer not greater than x and ceiling() returns smallest integer not less than x.

12.4.3 Comparisons

identical(x, y): Tests whether two objects are equivalent down to the binary storage level.
all.equal(target, current, tolerance): Tests whether two numeric values are effectively equal (i.e., only differ by a tiny amount, as specified by tolerance).

12.4.4 Subsetting

Subsetting is generally performed via the [ operator (e.g., candyCounts[1:4]). In general, the result is of the same class as the original object that is being subsetted. The subset may be numerical indices, string names, or a logical vector (the same length as the original object).

When subsetting objects with more than one dimension, e.g., data frames, matrices or arrays, the subset may be several vectors, separated by commas (e.g., candy[1:4, 4]).

The [[ operator selects only one component of an object. This is typically used to extract a component from a list.

subset(x, subset, select): Extract the rows of the data frame x that satisfy the condition in subset and the columns that are named in select. The advantage of this over the normal subset syntax is that column names are searched for within the data frame (i.e., you can use just count; no need for candy$count).

12.4.5 Merging

rbind(...): Create a new data frame by combining two or more data frames that have the same columns. The result is the union of the rows of the original data frames. This function also works for matrices.
cbind(...): Create a new data frame by combining two or more data frames that have the same number of rows. The result is the union of the columns of the original data frames. This function also works for matrices.
merge(x, y): Create a new data frame by combining two data frames in a database-join operation. The two data frames will usually have different columns, though they will typically share at least one column, which is used to match the rows. The default join is a natural join. Additional arguments allow for the equivalent of inner joins and outer joins.

12.4.6 Summarizing data frames

aggregate(x, by, FUN): Call the function FUN for each subset of x defined by the grouping factors in the list by. It is possible to apply the function to multiple variables (x can be a data frame) and it is possible to group by multiple factors (the list by can have more than one component). The result is a data frame. The names used in the by list are used for the relevant columns in the result. If x is a data frame, then the names of the variables in the data frame are used for the relevant columns in the result.
sweep(x, MARGIN, STATS, FUN): Take an array and add or subtract (more generally, apply the function FUN) the STATS values from the rows or columns (depending on value of MARGIN). For example, remove column means from all columns.
table(...): Generate table of counts for one or more factors. The result is a "table" object, with as many dimensions as there were arguments.
xtabs(formula, data): Similar to table() except factors to cross-tabulate are expressed in a formula. Symbols in the formula will be searched for in the data frame given by the data argument.
ftable(...): Similar to table() except that the result is always a two-dimensional "ftable" object, no matter how many factors are cross-tabulated. This makes for a more readable display.

12.4.7 Looping over variables in a data frame

apply(X, MARGIN, FUN, ...)

Call a function on each row or each column of a data frame or matrix. The function FUN is called for each row of the matrix X (if MARGIN equals 1; if MARGIN is 2, the function is called for each column of X). All other arguments are passed as arguments to FUN.

The data structure that is returned depends on the value returned by FUN. In the simplest case, where FUN returns a single value, the result is a vector with one value per row (or column) of the original matrix X.

tapply(X, INDEX, FUN, ...)

Call a function once each subset of the vector X, where the subsets correspond to unique values of the factor INDEX. The INDEX argument can be a list of factors, in which case the subsets are unique combinations of the levels of the factors.

The result depends on how many factors are given in INDEX. For the simple case, where there is only one factor, and FUN returns a single value, the result is a vector.

lapply(X, FUN, ...)

Call the function FUN once for each component of the list X. The result is a list.

sapply(X, FUN, ...)

Similar to lapply(), but will simplify the result to a vector if possible (e.g., if all components of X are vectors and FUN returns a single value).

12.4.8 Sorting

sort(x): Put a vector in order. For sorting by more than one factor, see order().
order(...): Calculate an ordering of one or more vectors (all the same length). The result is a numeric vector, which can be used, via subsetting, to reorder another vector.
with(data, expr): Run the code in expr and search within the variables of the data frame specified by data for any symbols used in expr.

12.4.9 Data import/export

readLines(con)

Read the text file specified by the file name and/or path in con. The file can also be a URL. The result is a string vector with one element for each line in the file.

read.table(file, header, skip, sep)

Read the text file specified by the string value in file, treating each line of text as a case in a data set that contains values for each variable in the data set, with values separated by the string value in sep. Ignore the first skip lines in the file. If header is TRUE, treat the first line of the file as variable names.

The result is a data frame.

read.fwf(file, widths)

Read a text file in fixed-width format. The name of the file is specified by file and widths is a numeric vector specifying the width of each column of values. The result is a data frame.

read.csv(file)

A front end for read.table() with default argument settings designed for reading a text file in CSV format. The result is a data frame.

scan(file, what)

Read data from a text file and produce a vector of values. The type of the value provided for the argument what determines how the values in the text file are interpreted. If this argument is a list, then the result is a list of vectors, each of a type corresponding to the relevant component of what.

This function is faster than read.table() and its kin.

12.4.10 Processing strings

grep(pattern, x): Search for the regular expression pattern in the string vector x and return a vector of numbers, where each number is the index to a string in x that matches pattern. If there are no matches, the result has length zero.
gsub(pattern, replacement, x): Search for the regular expression pattern in the character vector x and replace all matches with the string value in replacement. The result is a vector containing the modified strings.
substr(x, start, stop): For each string in x, return a substring consisting of the characters at positions start through stop inclusive. The first character is at position 1.
strsplit(x, split): For each string in x, break the string into separate strings, using split as the delimiter. The result is a list, with one component for each string in the original vector x.
paste(..., sep, collapse): Combine strings together, placing the string sep in between. The result is a string vector the same length as the longest of the arguments, so shorter arguments are recycled. If the collapse argument is not NULL, the result vector is collapsed to a single string, with the string collapse placed in between each element of the result.

12.4.11 Getting help

The help() function is special in that it provides information about other functions. This function displays a help page, which is online documentation that describes what a function does. This includes an explanation of all of the arguments to the function and a description of the return value for the function. Figure 12.1 shows the beginning of the help page for the sleep() function, which is obtained by typing help(Sys.sleep).

**Figure 12.1:** The help page for the function `Sys.sleep()` as displayed in a Linux system. This help page is displayed by the expression `help(Sys.sleep)`.
Sys.sleep package:base R Documentation Suspend Execution for a Time Interval Description: Suspend execution of R expressions for a given number of seconds Usage: Sys.sleep(time) Arguments: time: The time interval to suspend execution for, in seconds. Details: Using this function allows R to be given very low priority and hence not to interfere with more important foreground tasks. A typical use is to allow a process launched from R to set itself up and read its input files before R execution is resumed.

Sys.sleep                package:base            R Documentation

Suspend Execution for a Time Interval

Description:

     Suspend execution of R expressions for a given number of 
     seconds

Usage:

     Sys.sleep(time)

Arguments:

    time: The time interval to suspend execution for, in seconds.

Details:

     Using this function allows R to be given very low priority 
     and hence not to interfere with more important foreground 
     tasks. A typical use is to allow a process launched from R 
     to set itself up and read its input files before R execution 
     is resumed.

A special shorthand using the question mark character, ?, is provided for getting the help page for a function. Instead of typing help(Sys.sleep) it is also possible to simply type ?Sys.sleep.

Many help pages also have a set of examples to demonstrate the proper use of the function and these examples can be run using the example() function.

12.4.12 Packages

There are many thousand R functions in existence. They are organised into collections of functions called packages. A number of packages are installed with R by default and several packages are loaded automatically in every R session. The search() function shows which packages are currently available, as shown below:

> search()

[1] ".GlobalEnv"        "package:stats"     "package:graphics"
[4] "package:grDevices" "package:utils"     "package:datasets"
[7] "package:methods"   "Autoloads"         "package:base"

The top line of the help page for a function shows which package the function comes from. For example, Sys.sleep() comes from the base package (see Figure 12.1).

Other packages may be loaded using the library() function. For example, the foreign package provides functions for reading in data sets that have been stored in the native format of a different statistical software system. In order to use the read.spss() function from this package, the foreign package must be loaded as follows:

> library(foreign)

The search() function confirms that the foreign package is now loaded and all of the functions from that package are now available.

> search()

 [1] ".GlobalEnv"        "package:foreign"   "package:stats"
 [4] "package:graphics"  "package:grDevices" "package:utils"
 [7] "package:datasets"  "package:methods"   "Autoloads"
[10] "package:base"

There are usually 25 packages distributed with R. Over a thousand other packages are available for download from the web via the Comprehensive R Archive Network (CRAN).^12.2 These packages must first be installed before they can be loaded. A new package can be installed using the install.packages() function.

12.4.13 Searching for functions

Given the name of a function, it is not difficult to find out what that function does and how to use the function by reading the function's help page. A more difficult job is to find the name of a function that will perform a particular task.

The help.search() function can be used to search for functions relating to a keyword within the current R installation and the RSiteSearch() function performs a more powerful and comprehensive web-based search of functions in almost all known R packages, R mailing list archives, and the main R manuals.^12.3 There is also a Google customised search available^12.4 that provides a convenient categorisation of the search results.

Another problem that arises is that, while information on a single function is easy to obtain, it can be harder to discover how several related functions work together. One way to get a broader overview of functions in a package is to read a package vignette (see the vignette() function). There are also overviews of certain areas of research or application provided by CRAN Task Views (see http://cran.r-project.org) and there is a growing list of books on R.

Paul Murrell

This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.