SQL consists of three components:
In this chapter, we will only be concerned with the DML part of SQL, and only then with the extraction of data from a database.
Everything we do will be a variation on the SELECT command of SQL, which has the following basic form:
This will extract the specified columns from the specified tables, but only the rows for which the row_condition is true.
The Data Expo was a data analysis competition run at the JSM 2006 conference. The Data Expo data set consists of seven atmospheric measurements at locations on a 24 by 24 grid averaged over each month for six years (72 time points). The elevation (height above sea level) at each location is also included in the data set (see Section 7.5.6 for more details).
The data set was originally provided as 505 plain text files, but the data can also be stored in a database with the following structure (see Section 7.9.7).
location_table ( ID [PK], longitude, latitude, elevation ) date_table ( date [PK], month, year ) measure_table ( date [PK] [FK date_table.date], location [PK] [FK location_table.ID], cloudhigh, cloudlow, cloudmid, ozone, pressure, surftemp, temperature )
The location_table contains all of the geographic locations at which measurements were taken, and includes the elevation at each location. The date_table contains all of the dates at which measurements were taken. This table also includes the text form of each month and the numeric form of the year. These have been split out to make it easier to perform queries based on months or years. The full dates are stored using the ISO 8601 format so that alphabetical ordering gives chronological order.
The measure_table contains all of the atmospheric measurements for all dates and locations. Locations are represented by simple ID numbers, referring to the appropriate complete information in the location_table.
The goal for contestants in the Data Expo was to summarize the important features of the data set. In this section, we will use SQL statements to extract subsets of the data in order to produce some basic summaries.
A simple first step is just to view the univariate distribution of each measurement variable. For example, the following code extracts all air pressure values from the database by extracting all rows from the pressure column of the measure_table.
> SELECT pressure FROM measure_table;
pressure -------- 835.0 810.0 810.0 775.0 795.0 915.0 ...
The result contains 41472 ( 24 x 24 x 72) values, so only the first few are shown here. Figure 9.3 shows a plot of all of the values.
The resolution of the data is immediately apparent; the pressure is only recorded to the nearest multiple of 5. However, the more striking feature is the change in the spread of the second half of the data. NASA have confirmed that this change is real, but unfortunately have not been able to give an explanation for why it occurred. Perhaps something happened to the collecting satellite.
An entire column of data from the measure_table in the Data Expo database represents measurements of a single variable at all locations for all time periods. One useful way to “slice” the Data Expo data is to look at the values for a particular location over all time periods. For example, how does surface temperature vary over time at a particular location? Here is how we could extract all of the surface temperature values for location 1. This uses a WHERE clause to specify which rows of the surftemp column we want to extract.
> SELECT surftemp FROM measure_table WHERE location = 1;
surftemp -------- 272.7 282.2 282.2 289.8 293.2 301.4 ...
Again, the result is too large to show all values, so only the first few are shown. Figure 9.4 shows a plot of all of the values.
The interesting feature here is that we can see a cyclic change in temperature, as we might expect, with the change of seasons.
Recall that the order of the rows in a database table is not guaranteed. This means that, whenever we extract information from a table, we should be explicit about the order that we want for the results. This is achieved by specifying an ORDER BY clause in the query. For example, the following SQL command extends the previous one to ensure that the temperatures for location 1 are returned in chronological order.
> SELECT surftemp FROM measure_table WHERE location = 1 ORDER BY date;
The WHERE clause can use other comparison operators besides equality. For example, the following query selects all locations where the air pressure is less than 1000 millibars. Figure 9.5 displays these locations in terms of longitude and latitude, which shows that these locations, as expected, are mostly on land (the air pressure at sea level is roughly 1000 millibars). As well as requesting the location, we also request the pressure; this allows us to check that the result is producing what we want (all pressures should be less than 1000).
> SELECT location, pressure FROM measure_table WHERE pressure < 1000;
location pressure -------- -------- 1 835.0 2 810.0 3 810.0 4 775.0 5 795.0 6 915.0 ...
As well as extracting raw data values, it is possible to calculate derived values by combining columns with simple arithmetic operators or by using a function to produce the sum or average of the values in a column.
An SQL function will produce a single overall value for each column of a table. What is usually more interesting is the value of the function for subgroups within a column, so these sorts of operations are commonly combined with a GROUP BY clause, which means that the result is computed for subsets of the column.
For example, instead of investigating the change in surface temperature over time for a single location, we could look at the change in the surface temperature averaged across all locations. The following code performs this query and Figure 9.6 plots the result.
> SELECT date, AVG(surftemp) avgtemp FROM measure_table GROUP BY date ORDER BY date;
date avgtemp ---------- ------- 1995-01-16 294.985 1995-02-16 295.486 1995-03-16 296.315 1995-04-16 297.119 1995-05-16 297.244 1995-06-16 296.976 ...
Overall, it appears that 1997 and 1998 were generally warmer years.
One other feature to notice about this example SQL query is that it defines a column alias, avgtemp, for the column of averages. This alias can be used within the SQL query, which can make the query easier to type and easier to read. The alias is also used in the presentation of the result.
As demonstrated in the previous section, database queries from a single table are quite straightforward. However, most databases consist of more than one table, and most interesting database queries involve extracting information from more than one table. In database terminology, most queries involve some sort of join between two or more tables.
New Zealand sent a team of 18 swimmers to the Melbourne 2006 Commonwealth Games, 10 women and 8 men. The results from their races are recorded in a database with the following structure.
As an example of the information stored in this database, the following code shows that the swimmer with an ID of 1 is called Zoe Baker.
> SELECT * FROM swimmer_table WHERE ID = 1;
ID first last -- ----- ----- 1 Zoe Baker
The following code shows that Zoe Baker swam in three races, a heat, a semifinal and the final of the women's 50m breaststroke, and she came 4th in the final in a time of 31 minutes and 27 seconds.
> SELECT * FROM result_table WHERE swimmer = 1;
swimmer distance stroke gender stage time place ------- -------- ------ ------ ----- ----- ----- 1 50 Br F Final 31.45 4 1 50 Br F Heat 31.7 4 1 50 Br F Semi 31.84 5
The most basic type of database join, upon which all other types of join are based, is a cross join. The result of a cross join is the cartesian product of the rows of one table with the rows of another table. In other words, row 1 of table 1 is paired with each row of table 2, then row 2 of table 1 is paired with each row of table 2, and so on. If the first table has n1 rows and the second table has n2 rows, the result of a cross join is a table with n1 x n2 rows.
The simplest way to create a cross join is simply to perform an SQL query on more than one table. As an example, the following code performs a cross join on the distance_table and stroke_table in the swimming database to generate all possible swimming events.
> SELECT length, stroke FROM distance_table, stroke_table;
length stroke ------ ---------------- 50 Breaststroke 50 Freestyle 50 Butterfly 50 Backstroke 50 IndividualMedley 100 Breaststroke 100 Freestyle 100 Butterfly 100 Backstroke 100 IndividualMedley 200 Breaststroke 200 Freestyle 200 Butterfly 200 Backstroke 200 IndividualMedley 400 Breaststroke 400 Freestyle 400 Butterfly 400 Backstroke 400 IndividualMedley
A cross join can also be obtained more explicitly using the CROSS JOIN syntax as shown below (the result is exactly the same as for the code above).
SELECT length, stroke FROM distance_table CROSS JOIN stroke_table;
An inner join is the most common way of combining two tables. In this sort of join, only matching rows are extracted from two tables. Typically, a foreign key in one table is matched to the primary key in another table.
Conceptually, an inner join is a cross join, with only the desired rows retained. In practice, DBMS software analyses queries and obtains the result more directly in order to use less time and less computer memory.
In order to demonstrate inner joins, we will return to the Data Expo database (see 9.2.2) and consider the problem of determining which geographic locations have pressure less than 1000 (at least once), but have an elevation equal to zero. In other words, which locations have pressure which is below the normal sea-level pressure, but are actually at sea-level.
In order to answer this question, we need information from both the measure_table (to provide air pressure) and the location_table (to provide elevation). The following SQL command produces the information we need (see Figure 9.7).
> SELECT longitude, latitude, pressure, elevation FROM measure_table mt, location_table lt WHERE mt.location = lt.ID AND pressure < 1000 AND elevation = 0 ORDER BY latitude, longitude;
longitude latitude pressure elevation --------- -------- -------- --------- -76.25 -21.25 995.0 0.0 -76.25 -21.25 995.0 0.0 -76.25 -21.25 995.0 0.0 -76.25 -21.25 990.0 0.0 -76.25 -21.25 995.0 0.0 -76.25 -21.25 995.0 0.0 ...
The most important feature of this code is the fact that it obtains information from two tables.
FROM measure_table mt, location_table lt
In order to merge the information from the two tables in a sensible fashion, we must specify how rows from one table are matched up with rows from the other table. In most cases, this means specifying that a foreign key from one table matches the primary key in the other table, which is precisely what has been done in this case.
WHERE mt.location = lt.ID
Another feature of this code is that it makes use of table aliases. For example, mt is defined as an alias for the measure_table. This makes it easier to type the code and can also make it easier to read the code.
Another way to specify this join uses a different syntax that places all of the information about the join in the FROM clause of the query. The following code produces exactly the same result as before, but uses the key words INNER JOIN between the tables that are being joined and follows that with a specification of the columns to match ON.
> SELECT longitude, latitude, pressure, elevation FROM measure_table mt INNER JOIN location_table lt ON mt.location = lt.ID WHERE pressure < 1000 AND elevation = 0 ORDER BY latitude, longitude;
One thing to notice about this example is that we are not actually selecting the locations with lower pressure. What we are really doing is selecting the pressures that are less than 1000 and reporting the location information for each of those pressures. This means that several locations are repeated in the result; at these locations the pressure dropped below 100 for more than one month. We could show this effect by counting how many months the pressure was below 1000, for each location.
> SELECT longitude, latitude, COUNT(*) numMonths FROM measure_table mt INNER JOIN location_table lt ON mt.location = lt.ID WHERE pressure < 1000 AND elevation = 0 GROUP BY location ORDER BY latitude, longitude;
longitude latitude numMonths --------- -------- --------- -76.25 -21.25 24 -73.75 -21.25 1 -78.75 -18.75 6 -73.75 -18.75 17 -81.25 -16.25 3 -78.75 -16.25 22 ...
Now consider a major summary of temperature values: what is the average temperature per year, across all locations on land (above sea level)?
In order to answer this question, we need to know the temperatures from the measure_table, the elevation from the location_table, and the years from the date_table. In other words, we need to combine all three tables together.
This situation is one reason for using the INNER JOIN syntax shown above, because it naturally extends to joining more than two tables, and provides a way for us to control the order in which the tables are joined. The following code performs the desired query (see Figure 9.8).
> SELECT year, AVG(surftemp) avgtemp FROM measure_table mt INNER JOIN location_table lt ON mt.location = lt.ID INNER JOIN date_table dt ON mt.date = dt.date WHERE elevation > 0 GROUP BY year;
year avgtemp ---- ------- 1995 295.380 1996 295.006 1997 295.383 1998 296.416 1999 295.258 2000 295.315
This result shows only 1998 as warmer than other years; the higher temperatures for 1997 that we saw in Figure 9.6 must be due to higher temperatures over water.
It is possible to use an SQL query within another SQL query. The nested query is called a sub-query.
As a simple example, consider the problem of extracting the date at which the maximum surface temperature occurred. It is simple enough to determine the maximum surface temperature.
> SELECT MAX(surftemp) max FROM measure_table;
max ----- 314.9
However, it is not so easy to report the date along with the maximum temperature because it is not valid to mix aggregated columns with non-aggregated columns. For example, the following SQL code will either trigger an error message or produce an incorrect result.
SELECT date, MAX(surftemp) max FROM measure_table;
The column date returns 41472 values, but the column MAX(surftemp) only returns 1.
The solution is to use a subquery as shown below.
> SELECT date, surftemp temp FROM measure_table WHERE surftemp = ( SELECT MAX(surftemp) FROM measure_table );
date temp ---------- ----- 1998-07-16 314.9 1998-07-16 314.9
The query that calculates the maximum surface temperature is inserted within brackets as a subquery within the WHERE clause. The outer query returns only the rows of the measure_table where the surface temperature is equal to the maximum.
The maximum temperature occured in July 1998 at two different locations.
In Section 9.2.5 we saw how to generate all possible combinations of distance and stroke in the swimming database using a cross join. There are four distances and five strokes, so the cross join produces 20 different combinations.
We will now take that cross join and combine it with the table of race results using an inner join. We will summarize the result of all races for a particular distance/stroke combination by calculating the average time from such races. The following code performs this inner join.
> SELECT length, st.stroke style, AVG(time) avg FROM distance_table dt CROSS JOIN stroke_table st INNER JOIN result_table rt ON dt.length = rt.distance AND st.ID = rt.stroke GROUP BY length, st.stroke;
length style avg ------ ---------------- ----- 50 Backstroke 28.04 50 Breaststroke 31.29 50 Butterfly 26.40 50 Freestyle 26.16 100 Backstroke 60.55 100 Breaststroke 66.07 100 Butterfly 56.65 100 Freestyle 57.10 200 Backstroke 129.7 200 Butterfly 119.0 200 Freestyle 118.6 200 IndividualMedley 129.5 400 IndividualMedley 275.2
This result has only 13 rows. What has happened to the remaining 7 combinations of distance and stroke? They have been dropped from the result because some combinations of distance and stroke do not appear in the result_table. For example, no New Zealand swimmer competed in the 50m individual medley (which is hardly surprising given that that event never took place).
This feature of inner joins is not always desirable and can produce misleading results, which is why an outer join is sometimes necessary. The following code is the same as before, except that it performs a left outer join so that all distance/stroke combinations are reported, even though there is no average time information available for some combinations.
> SELECT length, st.stroke style, AVG(time) avg FROM distance_table dt CROSS JOIN stroke_table st LEFT JOIN result_table rt ON dt.length = rt.distance AND st.ID = rt.stroke GROUP BY length, st.stroke;
length style avg ------ ---------------- ----- 50 Backstroke 28.04 50 Breaststroke 31.29 50 Butterfly 26.40 50 Freestyle 26.16 50 IndividualMedley NULL 100 Backstroke 60.55 100 Breaststroke 66.07 100 Butterfly 56.65 100 Freestyle 57.10 100 IndividualMedley NULL 200 Backstroke 129.7 200 Breaststroke NULL 200 Butterfly 119.0 200 Freestyle 118.6 200 IndividualMedley 129.5 400 Backstroke NULL 400 Breaststroke NULL 400 Butterfly NULL 400 Freestyle NULL 400 IndividualMedley 275.2
It is useful to remember that database joins always begin with a cartesian product of the rows of the tables being joined (conceptually at least). The different sorts of database join are all just different subsets of a cross join. This makes it possible to answer questions that, at first sight, may not appear to be database queries.
For example, it is possible to join a table with itself--a so-called self join. The result is all possible combinations of the rows of a table, which can be used to answer questions that require comparing a column within a table to itself or to other columns within the same table.
Consider the following question: at what locations and dates did the surface temperature at location 1 for January 1995 reoccur?
This question requires a comparison of one row of the surftemp column in the measure_table with the other rows in that column. The code below performs the query using a self join.
> SELECT mt1.surftemp temp1, mt2.surftemp temp2, mt2.location loc, mt2.date date FROM measure_table mt1, measure_table mt2 WHERE mt1.surftemp = mt2.surftemp AND mt1.date = '1995-01-16' AND mt1.location = 1;
temp1 temp2 loc date ----- ----- --- ---------- 272.7 272.7 1 1995-01-16 272.7 272.7 2 1995-12-16 272.7 272.7 3 1995-12-16 272.7 272.7 2 1996-01-16 272.7 272.7 3 1996-01-16 272.7 272.7 5 1996-12-16 272.7 272.7 5 1997-02-16 272.7 272.7 27 1997-12-16 272.7 272.7 29 1997-12-16 272.7 272.7 7 2000-12-16 272.7 272.7 8 2000-12-16 272.7 272.7 13 2000-12-16 272.7 272.7 14 2000-12-16
The temperature occurred again in neighbouring locations in December and January of 1995/1996 and again in other locations in later years.
Paul Murrell
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.