11.1 Case study: The Population Clock

As early as the 1970s, Isaac Asimov was attempting to draw the eye of the general public to the looming problem of the overpopulation of Planet Earth. In several books and speeches11.1he dramatically pointed out that the world population at the time stood at around 4 billion, but with the increasing rate of growth, it would be around 7 billion by the turn of the millenium. Asimov also pointed out that, one way or another, the growth of the human population would slow, it was just a matter of how messy the deceleration was. Nature and the limits of natural resources would do the job if necessary, but the least messy solution, he suggested, was voluntary birth control.

Asimov did not live to see the turn of the millenium, but he may have been pleased to see that his prediction was slightly pessimistic, not because his calculations were wrong, but because population growth had begun to slow, and because it was slowing for non-messy reasons.

Overall estimates of the population of the world can be obtained from the U.S. Census Bureau.11.2Figure 11.1 shows estimates dating from 1900 and extending until 2050. This clearly shows the upward curve during the first three quarters of the 20th Century, but already a straightening and tailing off beginning as we pass the turn of the millenium.

Figure 11.1: The population of the world, based on estimates by the U.S. Census Bureau. The shaded area to the right indicates projected estimates of world population.
\begin{figure}% the gap below seems to be important!??
\par
\includegraphics[width=\textwidth]{script-worldpop}\end{figure}

This slowing in the growth of the world's population is mainly thanks to lower fertility rates (the non-messy solution), which is due to cultural changes such as people marrying later, and the greater availability and use of contraceptives. Even longer term projections by the United Nations suggest that, assuming trends in lower fertility continue, the world's population may actually stabilize at around 9 or 10 billion before 2500. Asimov, not to mention his descendants, should be thrilled.

Another population-related service offered by the U.S. Census Bureau is the World Population Clock (see Figure 11.2).

Figure 11.2: The World Population Clock shows an up-to-the-minute snapshot of the current estimate of the world's population (based on estimates by the U.S. Census Bureau).
Image popclockgray

This web site provides an up-to-the-minute snapshot of the current estimate of the world's population, based on estimates by the U.S. Census Bureau. It is updated every few seconds.

What we are going to do in this section is to use this clock to generate a rough estimate of the current rate of growth of the world's population.

We will do this by looking at the steps involved, how we might perform this task “by hand”, and how we might use the computer to do the work instead. The steps involved are these:

Copy the current value of the population clock.
 
This is pretty easy to do by simply navigating to the population clock web page and typing out or cutting-and-pasting the current population value.

What about getting the computer to do the work?

Navigating to a web page and downloading the information is not actually very difficult. The following code will do this data import task:

> clockHTML <- 
    readLines("http://www.census.gov/ipc/www/popclockworld.html")

Getting the population estimate from the downloaded information is a bit more difficult, but not much. The first thing to realise is that we do not have a nice picture of the web page like we see in a browser. This is actually a good thing because it would be incredibly difficult for the computer to extract the information from a picture. What we have instead is the HTML code behind the web page (see Figure 11.3).

Figure 11.3: HTML code for the World Population Clock (see 11.2). The line numbers (in grey) are just for reference.
 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" 
      xml:lang="en" lang="en">
<head>
    <title>World POPClock Projection</title>
    <link rel="stylesheet" 
          href="popclockworld%20Files/style.css" 
          type="text/css">
    <meta name="author" content="Population Division">
    <meta http-equiv="Content-Type" 
          content="text/html; charset=iso-8859-1">
    <meta name="keywords" content="world, population">
    <meta name="description" 
          content="current world population estimate">
    <style type="text/css">
        #worldnumber {
	    text-align: center;
	    font-weight: bold;
	    font-size: 400%;
	    color: #ff0000;
        }
    </style>
</head>
<body>
    <div id="cb_header">
    <a href="http://www.census.gov/">
    <img src="popclockworld%20Files/cb_head.gif" 
         alt="U.S. Census Bureau" 
         border="0" height="25" width="639">
    </a>
    </div>

    <h1>World POPClock Projection</h1>

    <p></p>
    According to the <a href="http://www.census.gov/ipc/www/">
    International Programs Center</a>, U.S. Census Bureau, 
    the total population of the World, projected to 09/12/07
    at 07:05 GMT (EST+5) is<br><br>
    <div id="worldnumber">6,617,746,521</div>
    <p></p>
    <hr>

    ...

This is better than a picture because there is structure to this information and we can use that structure to get the computer to extract the information for us. The current population value is within an HTML div tag with an id attribute (line 41 in Figure 11.3). This makes it very easy to find the line that contains the population estimate; this is just a text search task. The following code is one way to peform this task:

> popLine <- grep('id="worldnumber"', clockHTML)
> popLine

[1] 41

It is also easy to extract the population estimate from the line by deleting all of the bits of the line that we do not want. This is a text search-and-replace task and can be performed using the following code:

> popString <- gsub('^.+id="worldnumber">', "",
                     gsub("</div>.*", "",
                          clockHTML[popLine]))
> popString

[1] "6,617,746,521"

Finally, we need to turn the text of the population estimate into a number so that we can later carry out mathematical operations. This is called data coercion and appropriate code is shown below (notice that we have to remove the commas that are so useful for human viewers, but a complete distraction for computers):

> pop <- as.numeric(gsub(",", "", popString))
> pop

[1] 6617746521

This example provides a classic demonstration of the difference between performing a task by hand and writing code to get a computer to do the work. The manual method is simple, requires no new skills, and takes very little time. The computer code approach requires learning new information (it will take substantial chunks of this chapter to explain just the code we have used so far), so it is slower and harder. However, the computer code approach will pay off, as we are about to see.

Wait about ten minutes.
 
This is about as simple as it gets for a do-it-yourself task. However, it highlights two of the major advantages of automating tasks by computer. Computers will perform boring tasks without complaining or falling asleep and their accuracy will not degrade as a function of the boredom of the task.

The following code will make the computer wait for 10 minutes:

> Sys.sleep(600)
Copy the new value of the population clock.
 
This is the same as the first task. If we do it by hand, it is just as easy as it was before, though the boredom issue pretty rapidly comes into play. What about doing it by computer code? Here we see the third major benefit of writing computer code: once code has been written to perform a task, repetitions of the task become essentially free. All of the pain of writing the code in the first place starts to pay off very rapidly once a task has to be repeated. Almost exactly the same code as before will produce the new population clock estimate.

> clockHTML2 <- 
       readLines("http://www.census.gov/ipc/www/popclockworld.html")
> popLine2 <- grep('id="worldnumber"', clockHTML2)
> popString2 <- gsub('^.+id="worldnumber">', "",
                     gsub("</div>.+", "",
                          clockHTML2[popLine2]))
> pop2 <- as.numeric(gsub(",", "", popString2))
Calculate the growth rate.
 
This is a very simple calculation that is, again, easy to do by hand. Computer code still provides an advantage because there is less chance of making an error in the calculation. There is the usual cost to pay in terms of writing the code in the first place, but in this case, that is fairly small. All we need to do is divide the change in population by the elapsed time (10 minutes):

> rateEstimate <- (pop2 - pop)/10

[1] 146.6

Repeat several times to get a decent sample
 
Because we are unaware of the process going on behind the scenes at the population clock web site, it would be unwise to trust a single point estimate of the population growth rate using this technique. A safer approach would be to generate a sample of several estimates and that means that we should repeat the whole process.

As mentioned previously, computers are world champions when it comes to mindlessly repeating tasks, so the computer code approach will now pay off handsomely.

The computer code that will generate 10 population growth rate estimates is shown in Figure 11.4. As usual, the details of how this code works are not important at this stage. However, there are several important features that we should highlight.

Figure 11.4: R code for estimating world population growth by downloading the World Population Clock web site and processing it at 10 minute intervals. The line numbers (in grey) are just for reference.
 

checkTheClock <- function() {
    clockHTML <-
     readLines("http://www.census.gov/ipc/www/popclockworld.html")
    popLine <- grep('id="worldnumber"', clockHTML)
    popString <- gsub('^.+id="worldnumber">', "",
                       gsub("</div>.*", "",
                            clockHTML[popLine]))
    as.numeric(gsub(",", "", popString))
}

rateEstimates <- rep(0, 10)

for (i in 1:10) {
    pop1 <- checkTheClock()
    Sys.sleep(600) # Wait 10 minutes
    pop2 <- checkTheClock()
    rateEstimates[i] <- (pop2 - pop1) / 10
}

writeLines(as.character(rateEstimates),
           paste("popGrowthEstimates",
                 as.Date(Sys.time()), sep=""))

The core task in this example involves downloading the World Population Clock and processing the information to extract a time and a population estimate. For each estimate of the population growth rate, this core task must be performed twice. A naive approach would suggest writing out two copies of the code to perform the task. However, that would violate the DRY principle (see Section 2.11) because it would create two copies of an important piece of information; the information in this case being computer code to perform a certain task. As can be seen from Figure 11.4, the code can be written so that only one copy is required ( lines 2 to 8) and that single copy can be referred to from elsewhere in the code (lines 14 and 16).

At a slightly higher level, the task of calculating an estimate of the population growth is also repeated, in this case, 10 times. Again, rather than having 10 copies of the code to calculate an estimate, there is only one copy ( lines 14 to 17), with other code to express the fact that the this sub-task needs to be repeated 10 times (lines 13 and 18).

Write the answer down
 
The final step in this exercise is to record the results of all of our work. This will be useful if, for example, we want to compare the current population growth rate with the rate next month, or next year. This is the purpose of lines 20 to 22 in Figure 11.4. This code creates a plain text file containing our estimates and includes the current date in the name of the file so that we know when it was generated.

This chapter is concerned with writing code like this, using the R language, to perform general data handling tasks: importing and exporting data, manipulating the shape of the data, and processing data into new forms.

Paul Murrell

Creative Commons License
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.