Subsections


11.9 Regular expressions

Two of the tasks we looked at when working with the long Maori place name involved treating both `ng' and `wh' as if they were a single letter, either counting the number of occurrences of these character pairs, or replacing them both with full stops. In each case, we performed the task in two steps, one for `ng' and one for `wh'. For example, when converting to full stops, we performed the following two steps: convert all occurrences of `ng' to a full stop; convert all occurrences of `wh' to a full stop. Conceptually, it would be simpler, and more efficient, to perform the task in a single step: convert all occurrences of `ng' or `wh' to a full stop. Regular expressions allow us to do this.

With the place name in the variable called placename, converting both `ng' and `wh' to full stops in a single step is achieved as follows:

> gsub("ng|wh", ".", placename)

[1] "Taumata.akata.iha.akoauauotamateaturipukakapikimau.ahoronukupokai.enuakitanatahu"

A similar approach allows us to count the number of occurrences of either `ng' or `wh' in the place name in a single step.

> gregexpr("ng|wh", placename)[[1]]

[1]  8 15 20 54 70
attr(,"match.length")
[1] 2 2 2 2 2

The regular expression we are using, ng|wh, describes a pattern: the character `n' followed by the character `g' or the character `w' followed by the character `h'. The vertical bar, |, is a metacharacter. It does not have its normal meaning, but instead denotes an optional pattern; a match will occur if the string contains either the pattern to the left of the vertical bar or the pattern to the right of the vertical bar. The characters `n', `g', `w', and `h' are all literals; they have their normal meaning.


11.9.1 Search and replace

A message to the R-help mailing list11.18posed the following problem: given a set of strings consisting of one or more digits followed by one or more letters (plus possibly some more digits), how can the initial digits be split from the remainder of each string? Here are some example strings:

Initial strings:

> strings <- c("123abc", "12cd34", "1e23")
It is easy to specify a regular expression that matches one or more digits at the start of a string, ^[[:digit:]]+, but how do we separate that part from the rest of the string? We will use two regular expression features: the special meaning of parentheses and the notion of a backreference.

When parentheses are used in a regular expression, they designate sub-expressions within the regular expression. For example, the following regular expression ^([[:digit:]]+)(.*) has two sub-expressions because it contains two pairs of parentheses. The first sub-expression is the one associated with the first opening parenthesis, ([[:digit:]]+) and matches the digits at the start of the string. The second sub-expression matches the remainder of the string, (.*).

When a regular expression contains sub-expressions, it is possible to refer to the text matched by each sub-expression when specifying the replacement text. A sub-expression is referred to by specifying the escape sequence \\<n>, where <n> is the number of the sub-expression.

The following code uses these features to replace the original strings with strings that have the initial digits separated from the rest of the string by three spaces.

Split strings into three pieces:

> pieces <- gsub("^([[:digit:]]+)(.*)", "\\1   \\2", strings)
> pieces

[1] "123   abc" "12   cd34" "1   e23"

> read.table(textConnection(pieces), 
              col.names=c("Digits", "Remainder"))

  Digits Remainder
1    123       abc
2     12      cd34
3      1       e23

11.9.2 Case study: Crohn's disease


Image gutscropBW

 
The digestive tract.11.19 Crohn's disease is an inflammation of the small intestine.
 

Genetic data consists of (usually large amounts of) information on the genotypes of individuals--which alleles do people have at particular loci on their chromosomes. Genetics is a very fast-moving field, with many new methods being developed for collecting genetic data and a number of specialized software systems for performing analyses. One of the problems that genetics researchers face is the difficulty of dealing with many different data formats. The various methods for collecting genetic data produce a variety of raw formats and some of the analysis software requires the data to be in a very specific format for processing.

We will consider an example of this problem, using data from a study of Crohn's disease (an inflammatory bowel disease).11.20 The data were originally obtained in a format appropriate for analysis using the LINKAGE software,11.21but my colleague wanted to use the PHASE software11.22instead.

The original format looks like this:

PED054  430  0  0  1  0  1  3  3  1  4  1  4  2  2 ...
PED054  412  430  431  2  2  1  3  1  3  4  1  4   ...
PED054  431  0  0  2  0  3  3  3  3  1  1  2  2  1 ...
PED058  438  0  0  1  0  3  3  3  3  1  1  2  2  1 ...
PED058  470  438  444  2  2  3  3  3  3  1  1  2   ...
PED058  444  0  0  2  0  3  3  3  3  1  1  2  2  1 ...
...

Each line in the file represents one individual. On each row, the first value is a pedigree label (all individuals who are related to each other are grouped into a single pedigree), the second value is the individual's unique identifier, and the third and fourth values identify the individual's genetic parents (if they exist within the data set). The fifth value on each row indicates gender (1 is male, 2 is female) and the sixth value indicates whether the individual has Crohn's disease (1 is no disease, 2 is disease, 0 is unknown). From the first three lines of the data file we can see that individual 412 is the child of individuals 430 and 431, she is female, and she has Crohn's disease. We do not know whether either of her parents have the disease.

The remainder of each line consists of pairs of values, where each pair gives the alleles for the individual at a particular locus. For example, individual 412 has alleles 1 and 3 at locus 1, 1 and 3 at locus 2, and 4 and 1 at locus 3.

We want to convert the data to the following format:

430 
1 3 4 4 2 3 2 3 3 4 4 2 2 3 2 2 3 3 1 3 1 2 3 1 2  ...
3 1 1 2 1 1 4 2 3 2 2 1 1 1 2 2 3 2 1 3 1 2 3 1 2  ...
412 
1 1 4 4 2 3 4 3 3 2 2 2 1 1 2 2 3 ? 1 3 1 2 3 1 2  ...
3 3 1 2 1 1 2 2 3 4 4 1 2 3 2 2 3 ? 1 3 1 2 3 1 2  ...
431 
3 3 1 2 1 1 2 2 3 4 4 ? 2 3 2 2 3 3 1 3 1 2 3 1 2  ...
3 3 1 2 1 1 2 2 3 4 4 ? 2 3 2 2 3 3 1 3 1 2 3 1 2  ...
...

In this format, the information for each individual is stored on three lines. The first line gives the individual's unique identifier, the second line gives the first allele at each locus, and the third line gives the second allele at each locus. Instead of alleles being in pairs of columns, they are in pairs of rows. Furthermore, any zeroes in the original allele information, which indicate missing values, must be encoded as question marks (e.g., individual 412 has missing values at the 18th locus).

Performing this transformation will involve a number of the file handling, data manipulation and text processing tools that we have discussed.

The first step is to read the original file into R. We keep all values as strings so that we can work with the data as one large matrix. The read.table() function conveniently splits the data into separate values for us. We also calculate the number of individuals in the data set (there are 387).

> crohn <- as.matrix(read.table("Dalydata.txt", 
                                 colClasses="character"))
> ncase <- dim(crohn)[1]
> crohn

     V1       V2    V3    V4    V5  V6  V7  V8  V9  V10 ...
[1,] "PED054" "430" "0"   "0"   "1" "0" "1" "3" "3" "1" ...
[2,] "PED054" "412" "430" "431" "2" "2" "1" "3" "1" "3" ...
[3,] "PED054" "431" "0"   "0"   "2" "0" "3" "3" "3" "3" ...
[4,] "PED058" "438" "0"   "0"   "1" "0" "3" "3" "3" "3" ...
[5,] "PED058" "470" "438" "444" "2" "2" "3" "3" "3" "3" ...
[6,] "PED058" "444" "0"   "0"   "2" "0" "3" "3" "3" "3" ...
...

It is a simple matter to extract the unique identifiers for the individuals from this matrix. These are just the second column of the matrix.

> ids <- crohn[, 2]
> ids

[1] "430" "412" "431" "438" "470" "444" "543" "516" "513" ...

These identifiers represent the first, fourth, seventh, etc line of the final format. We can generate an empty object with the apropriate number of lines and start to fill in the lines that we know.

> crohnPHASE <- vector("character", 3*ncase) 
> crohnPHASE[seq(by=3, length.out=ncase)] <- ids
> crohnPHASE

[1] "430" ""    ""    "412" ""    ""    "431" ""    ""    ...

The genotype information (the pairs of alleles) requires considerable rearrangement. To make it easy to see what we are doing, we will just extract that part of the data set and take a note of how many genotypes we have (there are 103).

> genotypes <- crohn[, -(1:6)]
> ngenotype <- dim(genotypes)[2]/2
> genotypes

     V7  V8  V9  V10 V11 V12 V13 V14 V15 V16 ...
[1,] "1" "3" "3" "1" "4" "1" "4" "2" "2" "1" ...
[2,] "1" "3" "1" "3" "4" "1" "4" "2" "2" "1" ...
[3,] "3" "3" "3" "3" "1" "1" "2" "2" "1" "1" ...
...

What we want to do is take the odd alleles for an individual and put them together in a single row. We can extract the odd alleles using simple indexing:

> allele1 <- genotypes[, 2*(1:ngenotype) - 1]
> allele1

     V7  V9  V11 V13 V15 V17 V19 V21 V23 V25 ...
[1,] "1" "3" "4" "4" "2" "3" "2" "3" "3" "4" ...
[2,] "1" "1" "4" "4" "2" "3" "4" "3" "3" "2" ...
[3,] "3" "3" "1" "2" "1" "1" "2" "2" "3" "4" ...
...

Each row of this matrix contains the information we need for one row of the final format. We can combine all of the strings on each row of the matrix into a single string by using apply() to call the paste() function on each row of the matrix.

> alleleLine1 <- apply(allele1, 1, paste, collapse=" ")
> alleleLine1

[1] "1 3 4 4 2 3 2 3 3 4 4 2 2 3 2 2 3 3 1 3 1 2 3 ...
[2] "1 1 4 4 2 3 4 3 3 2 2 2 1 1 2 2 3 0 1 3 1 2 3 ...
[3] "3 3 1 2 1 1 2 2 3 4 4 0 2 3 2 2 3 3 1 3 1 2 3 ...
...

These strings now represent the second, fifth, eighth, etc rows of the final format, so we can fill in more of the crohnPHASE object. At this point, we also do the conversion of 0 values to ? symbols.

> crohnPHASE[seq(2, by=3, length.out=ncase)] <- 
       gsub("0", "?", alleleLine1)
> crohnPHASE

[1] "430"                                          
[2] "1 3 4 4 2 3 2 3 3 4 4 2 2 3 2 2 3 3 1 3 1 2 3 ...
[3] ""                                             
[4] "412"                                          
[5] "1 1 4 4 2 3 4 3 3 2 2 2 1 1 2 2 3 ? 1 3 1 2 3 ...
[6] ""                                             
...

The same series of steps can be carried out for the even allele values to generate the third, sixth, ninth, etc lines of the final format, and the final step is to write the new lines to a file.

> allele2 <- genotypes[, 2*(1:ngenotype)]
> alleleLine2 <- apply(allele2, 1, paste, collapse=" ")
> crohnPHASE[seq(3, by=3, length.out=ncase)] <- 
       gsub("0", "?", alleleLine2)
> crohnPHASE

[1] "430"                                          
[2] "1 3 4 4 2 3 2 3 3 4 4 2 2 3 2 2 3 3 1 3 1 2 3 ...
[3] "3 1 1 2 1 1 4 2 3 2 2 1 1 1 2 2 3 2 1 3 1 2 3 ...
[4] "412"                                          
[5] "1 1 4 4 2 3 4 3 3 2 2 2 1 1 2 2 3 ? 1 3 1 2 3 ...
[6] "3 3 1 2 1 1 2 2 3 4 4 1 2 3 2 2 3 ? 1 3 1 2 3 ...
...

> writeLines(crohnPHASE, "DalydataPHASE.txt")


11.9.3 Flashback: Regular expressions in HTML Forms

11.9.4 Flashback: Regular expressions in SQL

Paul Murrell

Creative Commons License
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.