Subsections


11.8 Text processing


11.8.1 Case study: The longest placename

One of the longest place names in the world is attributed to a hill in the Hawke's Bay region of New Zealand. The name (in Maori) is:

Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu

which means “The hilltop where Tamatea with big knees, conqueror of mountains, eater of land, traveller over land and sea, played his koauau [flute] to his beloved.”

My primary-school-aged son was set a homework assignment that included counting the number of letters in this name. This task of counting the number of characters in a string is a simple example of what we will call text processing and is the sort of task that often comes up when working with data that has been stored in a text format.

Counting the number of characters in a string is something that any general purpose language will do. Assuming that the name has been saved into a text file called placename.txt, here is how to count the number of characters in the name using the nchar() function in R.

> placename <- scan("placename.txt", "character") 
> nchar(placename)

[1] 85

Counting characters is a very simple text processing task, though even with something that simple, performing the task using a computer is much more likely to get the right answer. We will now look at some more complex text processing tasks.

The homework assignment went on to say that, in Maori, the combinations `ng' and `wh' can be treated as a single letter. Given this, how many letters are in the place name? My son started to have serious difficulty doing this by hand, trying to treat every `ng' or `wh' as a single letter, until my wife cleverly pointed out to him that he should count the number of `ng's and `wh's and subtract that from his previous answer.

Here's how we could do both approaches in R. For the first approach, we could try counting all of the `ng's and `wh's as single letters. One way to do this is by searching through the string and converting all of the `ng's and `wh's into single characters and then redo the count. This search-and-replace task is another standard text processing operation. In R, we can perform this task using the gsub() function,11.16 which takes three arguments: a pattern to search for, a replacement value, and the string to search within. The result is a string with the pattern replaced. Because we are only counting letters, it does not matter what letter we choose as a replacement. First, we replace occurrences of `ng' with a full stop.

> gsub("ng", ".", placename)

[1] "Taumatawhakata.iha.akoauauotamateaturipukakapikimau.ahoronukupokaiwhenuakitanatahu"

Next, we replace the occurrences of `wh' with a full stop.

> replacengs <- gsub("ng", ".", placename)
> gsub("wh", ".", replacengs)

[1] "Taumata.akata.iha.akoauauotamateaturipukakapikimau.ahoronukupokai.enuakitanatahu"

Finally, we count the number of letters in the resulting string.

> replacewhs <- gsub("wh", ".", replacengs)
> nchar(replacewhs)

[1] 80

The alternative approach involves just finding out how many `ng's and `wh's are in the string and subtracting that number from the original count. This simple step of searching within a string for a pattern is yet another common text processing task. There are several R functions that perform variations on this task11.17, but for this example we need the function gregexpr() because it returns all of the matches within a string. This function takes two arguments: a pattern to search for and the string to search within. The return value gives a vector of the starting positions of the pattern within the string plus an attribute that gives the lengths of each match.

> gregexpr("ng", placename)

[[1]]
[1] 15 20 54
attr(,"match.length")
[1] 2 2 2

This shows that the pattern `ng' occurs three times in the place name, starting at character positions 15, 20, and 54, respectively, and that the length of the match is 2 characters in each case. Here is the result of searching for occurrences of `wh':

> gregexpr("wh", placename)

[[1]]
[1]  8 70
attr(,"match.length")
[1] 2 2

The return value of gregexpr() is a list to allow for more than one string to be searched at once. In this case, we are only searching a single string, so we just need the first component of the result. We can use the length() function to count how many matches there were in the string.

> ngmatches <- gregexpr("ng", placename)[[1]]
> length(ngmatches)

[1] 3

> whmatches <- gregexpr("wh", placename)[[1]]
> length(whmatches)

[1] 2

The final answer is simple arithmetic.

> nchar(placename) - 
     (length(ngmatches) + length(whmatches))

[1] 80

For the final question in the homework assignment, my son had to count how many times each letter appeared in the place name (treating `wh' and `ng' as two letters each again). Doing this by hand requires scanning the place name multiple times and the error rate increases alarmingly.

One way to do this in R is by breaking the place name into individual characters and creating a table of counts. Once again, we have a standard text processing task: breaking a single string into multiple pieces. The strsplit() function performs this task in R. It takes two arguments: the string to break up and a pattern which is used to decide where to split the string. If we give a zero-length pattern, the string is split at each character.

> strsplit(placename, NULL)

[[1]]
 [1] "T" "a" "u" "m" "a" "t" "a" "w" "h" "a" "k" "a" "t" "a"
[15] "n" "g" "i" "h" "a" "n" "g" "a" "k" "o" "a" "u" "a" "u"
[29] "o" "t" "a" "m" "a" "t" "e" "a" "t" "u" "r" "i" "p" "u"
[43] "k" "a" "k" "a" "p" "i" "k" "i" "m" "a" "u" "n" "g" "a"
[57] "h" "o" "r" "o" "n" "u" "k" "u" "p" "o" "k" "a" "i" "w"
[71] "h" "e" "n" "u" "a" "k" "i" "t" "a" "n" "a" "t" "a" "h"
[85] "u"

Again, the result is a list to allow for breaking up multiple strings at once. In this case, we are only interested in the first component of the list. One minor complication is that we want the uppercase `T' to be counted as a lowercase `t'. The function tolower() performs this task.

> nameLetters <- strsplit(placename, NULL)[[1]]
> tolower(nameLetters)

 [1] "t" "a" "u" "m" "a" "t" "a" "w" "h" "a" "k" "a" "t" "a"
[15] "n" "g" "i" "h" "a" "n" "g" "a" "k" "o" "a" "u" "a" "u"
[29] "o" "t" "a" "m" "a" "t" "e" "a" "t" "u" "r" "i" "p" "u"
[43] "k" "a" "k" "a" "p" "i" "k" "i" "m" "a" "u" "n" "g" "a"
[57] "h" "o" "r" "o" "n" "u" "k" "u" "p" "o" "k" "a" "i" "w"
[71] "h" "e" "n" "u" "a" "k" "i" "t" "a" "n" "a" "t" "a" "h"
[85] "u"

Now it is a simple matter of calling the table function to produce a table of counts of the letters.

> lowerNameLetters <- tolower(nameLetters)
> table(lowerNameLetters)

lowerNameLetters
 a  e  g  h  i  k  m  n  o  p  r  t  u  w 
22  2  3  5  6  8  3  6  5  3  2  8 10  2

Paul Murrell

Creative Commons License
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.