Subsections

9.2 SQL

SQL consists of three components:

The Data Definition Language
(DDL)
This is concerned with the creation of databases and the specification of the structure of tables and of constraints between tables.
The Data Control Language
(DCL)
This is concerned with controlling access to the database--who is allowed to do what to which tables.
The Data Manipulation Language
(DML)
This is concerned with getting data into and out of a database.

In this chapter, we will only be concerned with the DML part of SQL, and only then with the extraction of data from a database.

9.2.1 The SELECT command

Everything we do will be a variation on the SELECT command of SQL, which has the following basic form:

SELECT columns
    FROM
tables
    WHERE
row_condition

This will extract the specified columns from the specified tables, but only the rows for which the row_condition is true.


9.2.2 Case study: The Data Expo (continued)

The Data Expo was a data analysis competition run at the JSM 2006 conference. The Data Expo data set consists of seven atmospheric measurements at locations on a 24 by 24 grid averaged over each month for six years (72 time points). The elevation (height above sea level) at each location is also included in the data set (see Section 7.5.6 for more details).

The data set was originally provided as 505 plain text files, but the data can also be stored in a database with the following structure (see Section 7.9.7).

location_table ( ID [PK], longitude, latitude, elevation )

date_table ( date [PK], month, year )

measure_table ( date [PK] [FK date_table.date],
                location [PK] [FK location_table.ID],
                cloudhigh, cloudlow, cloudmid,
                ozone, pressure,
                surftemp, temperature )

The location_table contains all of the geographic locations at which measurements were taken, and includes the elevation at each location. The date_table contains all of the dates at which measurements were taken. This table also includes the text form of each month and the numeric form of the year. These have been split out to make it easier to perform queries based on months or years. The full dates are stored using the ISO 8601 format so that alphabetical ordering gives chronological order.

The measure_table contains all of the atmospheric measurements for all dates and locations. Locations are represented by simple ID numbers, referring to the appropriate complete information in the location_table.

The goal for contestants in the Data Expo was to summarize the important features of the data set. In this section, we will use SQL statements to extract subsets of the data in order to produce some basic summaries.

A simple first step is just to view the univariate distribution of each measurement variable. For example, the following code extracts all air pressure values from the database by extracting all rows from the pressure column of the measure_table.

> SELECT pressure FROM measure_table;
pressure
--------
835.0   
810.0   
810.0   
775.0   
795.0   
915.0   
...

The result contains 41472 ( 24 x 24 x 72) values, so only the first few are shown here. Figure 9.3 shows a plot of all of the values.

Figure 9.3: All of the air pressure measurements from the 2006 JSM Data Expo.
\begin{figure}% the gap below seems to be important!??
\par
\includegraphics[width=\textwidth]{extract-onetable}\end{figure}

The resolution of the data is immediately apparent; the pressure is only recorded to the nearest multiple of 5. However, the more striking feature is the change in the spread of the second half of the data. NASA have confirmed that this change is real, but unfortunately have not been able to give an explanation for why it occurred. Perhaps something happened to the collecting satellite.

An entire column of data from the measure_table in the Data Expo database represents measurements of a single variable at all locations for all time periods. One useful way to “slice” the Data Expo data is to look at the values for a particular location over all time periods. For example, how does surface temperature vary over time at a particular location? Here is how we could extract all of the surface temperature values for location 1. This uses a WHERE clause to specify which rows of the surftemp column we want to extract.

> SELECT surftemp 
         FROM measure_table
         WHERE location = 1;
surftemp
--------
272.7   
282.2   
282.2   
289.8   
293.2   
301.4   
...

Again, the result is too large to show all values, so only the first few are shown. Figure 9.4 shows a plot of all of the values.

Figure 9.4: All of the surface temperature measurements from the 2006 JSM Data Expo for location 1. Vertical grey bars mark the change of years.
\begin{figure}% the gap below seems to be important!??
\par
\includegraphics[width=\textwidth]{extract-onetablewhere}\end{figure}

The interesting feature here is that we can see a cyclic change in temperature, as we might expect, with the change of seasons.

Recall that the order of the rows in a database table is not guaranteed. This means that, whenever we extract information from a table, we should be explicit about the order that we want for the results. This is achieved by specifying an ORDER BY clause in the query. For example, the following SQL command extends the previous one to ensure that the temperatures for location 1 are returned in chronological order.

> SELECT surftemp 
         FROM measure_table
         WHERE location = 1
         ORDER BY date;

The WHERE clause can use other comparison operators besides equality. For example, the following query selects all locations where the air pressure is less than 1000 millibars. Figure 9.5 displays these locations in terms of longitude and latitude, which shows that these locations, as expected, are mostly on land (the air pressure at sea level is roughly 1000 millibars). As well as requesting the location, we also request the pressure; this allows us to check that the result is producing what we want (all pressures should be less than 1000).

> SELECT location, pressure
         FROM measure_table
         WHERE pressure < 1000;
location  pressure
--------  --------
1         835.0   
2         810.0   
3         810.0   
4         775.0   
5         795.0   
6         915.0   
...

Figure 9.5: The locations for which air pressure is less than 1000 millibars (on average) for at least one month during 1995 to 2000. (Data from the the 2006 JSM Data Expo.)
\begin{figure}% the gap below seems to be important!??
\par
\includegraphics[width=\textwidth]{extract-expoland}\end{figure}

As well as extracting raw data values, it is possible to calculate derived values by combining columns with simple arithmetic operators or by using a function to produce the sum or average of the values in a column.

An SQL function will produce a single overall value for each column of a table. What is usually more interesting is the value of the function for subgroups within a column, so these sorts of operations are commonly combined with a GROUP BY clause, which means that the result is computed for subsets of the column.

For example, instead of investigating the change in surface temperature over time for a single location, we could look at the change in the surface temperature averaged across all locations. The following code performs this query and Figure 9.6 plots the result.

> SELECT date, AVG(surftemp) avgtemp
         FROM measure_table
         GROUP BY date
         ORDER BY date;
date        avgtemp
----------  -------
1995-01-16  294.985
1995-02-16  295.486
1995-03-16  296.315
1995-04-16  297.119
1995-05-16  297.244
1995-06-16  296.976
...

Overall, it appears that 1997 and 1998 were generally warmer years.

Figure 9.6: The surface temperature measurements from the 2006 JSM Data Expo averaged across all locations for each time point. Vertical grey bars mark the change of years.
\begin{figure}% the gap below seems to be important!??
\par
\includegraphics[width=\textwidth]{extract-onetablegroup}\end{figure}

One other feature to notice about this example SQL query is that it defines a column alias, avgtemp, for the column of averages. This alias can be used within the SQL query, which can make the query easier to type and easier to read. The alias is also used in the presentation of the result.


9.2.3 Querying several tables: Joins

As demonstrated in the previous section, database queries from a single table are quite straightforward. However, most databases consist of more than one table, and most interesting database queries involve extracting information from more than one table. In database terminology, most queries involve some sort of join between two or more tables.

9.2.4 Case study: Commonwealth swimming

New Zealand sent a team of 18 swimmers to the Melbourne 2006 Commonwealth Games, 10 women and 8 men. The results from their races are recorded in a database with the following structure.

\usebox{\swimmerbox}
 
This table has one row for each swimmer and contains the first and last name of each swimmer. Each swimmer also has a unique numeric identifier.

\usebox{\distancebox}
 
This table defines the set of valid swim distances: 50, 100, 200, 400.

\usebox{\strokebox}
 
This table defines the set of valid swim strokes: breaststroke (Br), freestyle (Fr), butterfly (Bu), backstroke (Ba), and individual medley (IM).

\usebox{\genderbox}
 
This table defines the valid genders: male (M) and female (F).

\usebox{\stagebox}
 
This table defines the valid types of race that can occur: heats (heat), semifinals (semi), and finals (final).

\usebox{\resultbox}
 
This table contains information on the races swum by individual swimmers. Each row specifies a swimmer and the type of race (distance, stroke, gender, and stage). In addition, the swimmer's time and position in the race (place) are recorded.

As an example of the information stored in this database, the following code shows that the swimmer with an ID of 1 is called Zoe Baker.

> SELECT * FROM swimmer_table 
         WHERE ID = 1;
ID  first  last 
--  -----  -----
1   Zoe    Baker

The following code shows that Zoe Baker swam in three races, a heat, a semifinal and the final of the women's 50m breaststroke, and she came 4th in the final in a time of 31 minutes and 27 seconds.

> SELECT * FROM result_table 
         WHERE swimmer = 1;
swimmer  distance  stroke  gender  stage  time   place
-------  --------  ------  ------  -----  -----  -----
1        50        Br      F       Final  31.45  4    
1        50        Br      F       Heat   31.7   4    
1        50        Br      F       Semi   31.84  5


9.2.5 Cross joins

The most basic type of database join, upon which all other types of join are based, is a cross join. The result of a cross join is the cartesian product of the rows of one table with the rows of another table. In other words, row 1 of table 1 is paired with each row of table 2, then row 2 of table 1 is paired with each row of table 2, and so on. If the first table has n1 rows and the second table has n2 rows, the result of a cross join is a table with n1 x n2 rows.

The simplest way to create a cross join is simply to perform an SQL query on more than one table. As an example, the following code performs a cross join on the distance_table and stroke_table in the swimming database to generate all possible swimming events.

> SELECT length, stroke 
         FROM distance_table, stroke_table;
length  stroke          
------  ----------------
50      Breaststroke    
50      Freestyle       
50      Butterfly       
50      Backstroke      
50      IndividualMedley
100     Breaststroke    
100     Freestyle       
100     Butterfly       
100     Backstroke      
100     IndividualMedley
200     Breaststroke    
200     Freestyle       
200     Butterfly       
200     Backstroke      
200     IndividualMedley
400     Breaststroke    
400     Freestyle       
400     Butterfly       
400     Backstroke      
400     IndividualMedley

A cross join can also be obtained more explicitly using the CROSS JOIN syntax as shown below (the result is exactly the same as for the code above).

SELECT length, stroke 
    FROM distance_table CROSS JOIN stroke_table;

9.2.6 Inner joins

An inner join is the most common way of combining two tables. In this sort of join, only matching rows are extracted from two tables. Typically, a foreign key in one table is matched to the primary key in another table.

Conceptually, an inner join is a cross join, with only the desired rows retained. In practice, DBMS software analyses queries and obtains the result more directly in order to use less time and less computer memory.

9.2.7 Case study: The Data Expo (continued)

In order to demonstrate inner joins, we will return to the Data Expo database (see 9.2.2) and consider the problem of determining which geographic locations have pressure less than 1000 (at least once), but have an elevation equal to zero. In other words, which locations have pressure which is below the normal sea-level pressure, but are actually at sea-level.

In order to answer this question, we need information from both the measure_table (to provide air pressure) and the location_table (to provide elevation). The following SQL command produces the information we need (see Figure 9.7).

> SELECT longitude, latitude, pressure, elevation
         FROM measure_table mt, location_table lt
         WHERE mt.location = lt.ID AND
             pressure < 1000 AND elevation = 0
         ORDER BY latitude, longitude;
longitude  latitude  pressure  elevation
---------  --------  --------  ---------
-76.25     -21.25    995.0     0.0      
-76.25     -21.25    995.0     0.0      
-76.25     -21.25    995.0     0.0      
-76.25     -21.25    990.0     0.0      
-76.25     -21.25    995.0     0.0      
-76.25     -21.25    995.0     0.0      
...

Figure 9.7: The locations for which air pressure is less than 1000 millibars (on average) for at least one month during 1995 to 2000 and elevation is equal to 0. (Data from the the 2006 JSM Data Expo.)
\begin{figure}% the gap below seems to be important!??
\par
\includegraphics[width=\textwidth]{extract-expoodd}\end{figure}

The most important feature of this code is the fact that it obtains information from two tables.

    FROM measure_table mt, location_table lt

In order to merge the information from the two tables in a sensible fashion, we must specify how rows from one table are matched up with rows from the other table. In most cases, this means specifying that a foreign key from one table matches the primary key in the other table, which is precisely what has been done in this case.

    WHERE mt.location = lt.ID

Another feature of this code is that it makes use of table aliases. For example, mt is defined as an alias for the measure_table. This makes it easier to type the code and can also make it easier to read the code.

Another way to specify this join uses a different syntax that places all of the information about the join in the FROM clause of the query. The following code produces exactly the same result as before, but uses the key words INNER JOIN between the tables that are being joined and follows that with a specification of the columns to match ON.

> SELECT longitude, latitude, pressure, elevation
         FROM measure_table mt INNER JOIN location_table lt
             ON mt.location = lt.ID
         WHERE pressure < 1000 AND elevation = 0
         ORDER BY latitude, longitude;

One thing to notice about this example is that we are not actually selecting the locations with lower pressure. What we are really doing is selecting the pressures that are less than 1000 and reporting the location information for each of those pressures. This means that several locations are repeated in the result; at these locations the pressure dropped below 100 for more than one month. We could show this effect by counting how many months the pressure was below 1000, for each location.

> SELECT longitude, latitude, COUNT(*) numMonths
         FROM measure_table mt INNER JOIN location_table lt
             ON mt.location = lt.ID
         WHERE pressure < 1000 AND elevation = 0
         GROUP BY location
         ORDER BY latitude, longitude;
longitude  latitude  numMonths
---------  --------  ---------
-76.25     -21.25    24       
-73.75     -21.25    1        
-78.75     -18.75    6        
-73.75     -18.75    17       
-81.25     -16.25    3        
-78.75     -16.25    22       
...

Now consider a major summary of temperature values: what is the average temperature per year, across all locations on land (above sea level)?

In order to answer this question, we need to know the temperatures from the measure_table, the elevation from the location_table, and the years from the date_table. In other words, we need to combine all three tables together.

This situation is one reason for using the INNER JOIN syntax shown above, because it naturally extends to joining more than two tables, and provides a way for us to control the order in which the tables are joined. The following code performs the desired query (see Figure 9.8).

> SELECT year, AVG(surftemp) avgtemp
         FROM measure_table mt 
              INNER JOIN location_table lt
                  ON mt.location = lt.ID
              INNER JOIN date_table dt
                  ON mt.date = dt.date
         WHERE elevation > 0
         GROUP BY year;
year  avgtemp
----  -------
1995  295.380
1996  295.006
1997  295.383
1998  296.416
1999  295.258
2000  295.315

Figure 9.8: The average temperature over land per year. (Data from the the 2006 JSM Data Expo.)
\begin{figure}% the gap below seems to be important!??
\par
\includegraphics[width=\textwidth]{extract-expoyeartemp}\end{figure}

This result shows only 1998 as warmer than other years; the higher temperatures for 1997 that we saw in Figure 9.6 must be due to higher temperatures over water.

9.2.8 Sub-queries

It is possible to use an SQL query within another SQL query. The nested query is called a sub-query.

As a simple example, consider the problem of extracting the date at which the maximum surface temperature occurred. It is simple enough to determine the maximum surface temperature.

> SELECT MAX(surftemp) max FROM measure_table;
max  
-----
314.9

However, it is not so easy to report the date along with the maximum temperature because it is not valid to mix aggregated columns with non-aggregated columns. For example, the following SQL code will either trigger an error message or produce an incorrect result.

SELECT date, MAX(surftemp) max FROM measure_table;

The column date returns 41472 values, but the column MAX(surftemp) only returns 1.

The solution is to use a subquery as shown below.

> SELECT date, surftemp temp
         FROM measure_table
         WHERE surftemp = ( SELECT MAX(surftemp)
                                FROM measure_table );
date        temp 
----------  -----
1998-07-16  314.9
1998-07-16  314.9

The query that calculates the maximum surface temperature is inserted within brackets as a subquery within the WHERE clause. The outer query returns only the rows of the measure_table where the surface temperature is equal to the maximum.

The maximum temperature occured in July 1998 at two different locations.

9.2.9 Outer Joins

9.2.10 Case study: Commonwealth swimming (continued)

In Section 9.2.5 we saw how to generate all possible combinations of distance and stroke in the swimming database using a cross join. There are four distances and five strokes, so the cross join produces 20 different combinations.

We will now take that cross join and combine it with the table of race results using an inner join. We will summarize the result of all races for a particular distance/stroke combination by calculating the average time from such races. The following code performs this inner join.

> SELECT length, st.stroke style, AVG(time) avg
         FROM distance_table dt 
             CROSS JOIN stroke_table st 
             INNER JOIN result_table rt
                 ON dt.length = rt.distance AND
                    st.ID = rt.stroke
         GROUP BY length, st.stroke;
length  style             avg  
------  ----------------  -----
50      Backstroke        28.04
50      Breaststroke      31.29
50      Butterfly         26.40
50      Freestyle         26.16
100     Backstroke        60.55
100     Breaststroke      66.07
100     Butterfly         56.65
100     Freestyle         57.10
200     Backstroke        129.7
200     Butterfly         119.0
200     Freestyle         118.6
200     IndividualMedley  129.5
400     IndividualMedley  275.2

This result has only 13 rows. What has happened to the remaining 7 combinations of distance and stroke? They have been dropped from the result because some combinations of distance and stroke do not appear in the result_table. For example, no New Zealand swimmer competed in the 50m individual medley (which is hardly surprising given that that event never took place).

This feature of inner joins is not always desirable and can produce misleading results, which is why an outer join is sometimes necessary. The following code is the same as before, except that it performs a left outer join so that all distance/stroke combinations are reported, even though there is no average time information available for some combinations.

> SELECT length, st.stroke style, AVG(time) avg
         FROM distance_table dt 
             CROSS JOIN stroke_table st 
             LEFT JOIN result_table rt
                 ON dt.length = rt.distance AND
                    st.ID = rt.stroke
         GROUP BY length, st.stroke;
length  style             avg  
------  ----------------  -----
50      Backstroke        28.04
50      Breaststroke      31.29
50      Butterfly         26.40
50      Freestyle         26.16
50      IndividualMedley  NULL 
100     Backstroke        60.55
100     Breaststroke      66.07
100     Butterfly         56.65
100     Freestyle         57.10
100     IndividualMedley  NULL 
200     Backstroke        129.7
200     Breaststroke      NULL 
200     Butterfly         119.0
200     Freestyle         118.6
200     IndividualMedley  129.5
400     Backstroke        NULL 
400     Breaststroke      NULL 
400     Butterfly         NULL 
400     Freestyle         NULL 
400     IndividualMedley  275.2

9.2.11 Self joins

It is useful to remember that database joins always begin with a cartesian product of the rows of the tables being joined (conceptually at least). The different sorts of database join are all just different subsets of a cross join. This makes it possible to answer questions that, at first sight, may not appear to be database queries.

For example, it is possible to join a table with itself--a so-called self join. The result is all possible combinations of the rows of a table, which can be used to answer questions that require comparing a column within a table to itself or to other columns within the same table.

9.2.12 Case study: The Data Expo (continued)

Consider the following question: at what locations and dates did the surface temperature at location 1 for January 1995 reoccur?

This question requires a comparison of one row of the surftemp column in the measure_table with the other rows in that column. The code below performs the query using a self join.

> SELECT mt1.surftemp temp1, mt2.surftemp temp2, 
            mt2.location loc, mt2.date date 
         FROM measure_table mt1, measure_table mt2 
         WHERE mt1.surftemp = mt2.surftemp AND 
               mt1.date = '1995-01-16' AND 
               mt1.location = 1;
temp1  temp2  loc  date      
-----  -----  ---  ----------
272.7  272.7  1    1995-01-16
272.7  272.7  2    1995-12-16
272.7  272.7  3    1995-12-16
272.7  272.7  2    1996-01-16
272.7  272.7  3    1996-01-16
272.7  272.7  5    1996-12-16
272.7  272.7  5    1997-02-16
272.7  272.7  27   1997-12-16
272.7  272.7  29   1997-12-16
272.7  272.7  7    2000-12-16
272.7  272.7  8    2000-12-16
272.7  272.7  13   2000-12-16
272.7  272.7  14   2000-12-16

The temperature occurred again in neighbouring locations in December and January of 1995/1996 and again in other locations in later years.

Paul Murrell

Creative Commons License
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.