When a data set becomes very large, or even just very complex in its structure, the ultimate storage solution is a database.
The term “database” can be used generally to describe any collection of information. In this section, the term “database” means a relational database, which is a a collection of data that is organised in a particular way.
Other types of database exist, such as hierarchical databases, network databases, and, more recently, object-oriented databases and XML databases, but relational databases are by far the most common.
The actual physical storage mechanism for a database-whether binary files or text files are used, whether one file or many files are used--will not concern us. We will only be concerned with the high-level, conceptual organisation of the data and will rely on software to decide how best to store the information in files.
The software that handles the physical representation, and allows us to work at a conceptual level, is called a database management system (DBMS), or in our case, more specifically a relational database management system (RDBMS).
A relational database consists of a set of tables, where a table is conceptually just like a plain text file or a spreadsheet: a set of values arranged in rows and columns. The difference is that there are usually several tables in a single database, and the tables in a database have a much more formal structure than a plain text file or a spreadsheet.
For example, below we have a simple database table containing information on books. The table has three columns--the ISBN of the book, the title of the book, and the author of the book--and four rows, with each row representing one book.
ISBN title author ---------- -------------------------- ---------------- 0395193958 The Hobbit J. R. R. Tolkien 0836827848 Slinky Malinki Lynley Dodd 0393310728 How to Lie with Statistics Darrell Huff 0908783116 Mechanical Harry Bob Kerr
Each table in a database has a unique name and each column in a table has a unique name.
Each column in a database table also has a data type associated with it, so all values in a single column are the same sort of data. In the book database example, all three columns are text or character values. The ISBN is not stored as an integer because it is a sequence of 10 digits (as opposed to a decimal value). For example, if we stored the ISBN as an integer, we would lose the leading 0.
Each table in a database has a primary key. The primary key must be unique for every row in a table. In the book table, the ISBN provides a perfect primary key because every book has a different ISBN.
It is possible to create a primary key by combining the values of two or more columns. This is called a composite primary key. A table can only have one primary key, but the primary key may be composed from more than one column.
A database containing information on books might also contain information on book publishers. Below we show another table in the same database containing information on publishers.
ID name city country -- ----------------- ---------- ----------- 1 Mallinson Rendel Wellington New Zealand 2 W. W. Norton New York USA 3 Houghton Mifflin Boston USA
Tables within the same database are related to each other using foreign keys. These are columns in one table which specify a value from the primary key in another table. For example, we can relate each book in the book_table to a publisher in the publisher_table by adding a foreign key to the book_table. This foreign key consists of a column, pub, containing the appropriate publisher ID. The book_table now looks like this:
ISBN title author pub ---------- -------------------------- ---------------- --- 0395193958 The Hobbit J. R. R. Tolkien 3 0836827848 Slinky Malinki Lynley Dodd 1 0393310728 How to Lie with Statistics Darrell Huff 2 0908783116 Mechanical Harry Bob Kerr 1
Two of the books have the same publisher.
A database compared to a flat text file is like a Ferrari compared to a horse and cart: much more expensive!
A database is of course also far more sophisticated than a flat text file, not to mention faster, more agile, and so on, but the cost is worth keeping in mind because a database is not always the best option. It is also worth noting that the cost is not just the software--there are several open source (free) database management systems--there is also the cost of acquiring or hiring the expertise necessary to create, maintain, and interact with data stored in a database.
Databases tend to be used for large data sets because, for most DBMS, there is no limit on the size of a database. However, even when a data set is not enormous, there are advantages to using a database because the organisation of the data can improve accuracy and efficiency. In particular, databases allow the data to be organised in a variety of ways so that, for example, data with a hierarchical structure can be stored in an efficient and natural way. These issues will be discussed further in Section 7.9.5.
Databases are also advantageous because most DBMS provide advanced features that are far beyond what is provided by the software that is used to work with data in other formats (e.g., text editors and spreadsheet programs). These features include the ability to allow multiple people to access and even modify the data at once and advanced control over who has access to the data and who is able to modify the data.
In the examples so far, we have only been seeing the contents of a database table. In the next section, on Database Design, it will be more important to describe the structure of a database table--the table schema. For this purpose, the contents of each row are not important; instead we are interested in the names of tables, the names of columns, which columns are primary keys, and which columns are foreign keys. The notation we will use is a simple text description, with primary keys and foreign keys indicated in square brackets. For example, these are the schema for the publisher_table and the book_table in the book database:
publisher_table (ID [PK], name, city, country) book_table (ISBN [PK], title, author, pub [FK publisher_table.ID])
For a foreign key, the name of the table and the name of the column that the foreign key references are also described.
In this section, we will look at some issues relating to the design of databases. Although designing a database is not a very common task for a scientist, having an understanding of the concepts and tasks involved can be useful for several reasons:
This section provides neither an exhaustive discussion nor a completely rigorous discussion of database design. The importance of this section is to provide a basic introduction to some useful ideas and ways to think about data.
Data modelling is a more general task than designing a database. The aim is to produce a model of a system (a conceptual data model), which can then be converted to a database representation, but could alternatively be converted into an object-oriented program, or something else (all logical or physical data models).
A conceptual data model is usually constructed using some sort of pictorial or diagram language, such as Entity-Relationship Diagrams (ERDs) or the Unified Modeling Language (UML). These languages are beyond the scope of this book, but we will use some of the building blocks of these conceptual models to help guide us in our brief discussion of database design.
One way to approach database design is to think in terms of entities and the relationships between them.
An entity is most easily thought of as a person, place, or physical object (e.g., a book), an event, or a concept (e.g., a publisher). An attribute is a piece of information about the entity. For example, the title, author, and ISBN are all attributes of a book entity.
The simple rule for starting to design a database is that there should be a table for each entity and a column in that table for each attribute of the entity.
Rather than storing a data set as one big table of information, this rule suggests that we should use several tables, with information about different entities in separate tables.
A relationship is an association between entities. For example, a publisher publishes books and a book is published by a publisher. Relationships are represented in a database by foreign key-primary key pairs, but the details depend on the cardinality of the relationship--whether the relationship is one-to-one (1:1), many-to-one (M:1), or many-to-many (M:N).
A book is published by exactly one publisher,7.11 but a publisher publishes many books, so the relationship between books and publishers is many-to-one. This sort of relationship can be represented by placing a foreign key in the table for books (the “many” side), which refers to the primary key in the table for publishers (the “one” side).
One-to-one relationships can be handled similarly to many-to-one relationships (it does not matter which table gets the foreign key), but many-to-many relationships are more complex.
In our book database example, we can identify another sort of entity: authors. This suggests that there should be another table for author information. For now, the table only contains the author's name, but other information, such as the author's age and nationality could be added.
author_table (ID [PK], name)
What is the relationship between books and authors? An author can write several books and a book can have more than one author, so this is an example of a many-to-many relationship.
A many-to-many relationship can only be represented by creating a new table. For example, we can create a table that contains the relationship between authors and books. This table contains a foreign key that refers to the author table and a foreign key that refers to the book table. The representaion of book entities, author entities, and the relationship between them now consists of three tables like this:
author_table (ID [PK], name) book_table (ISBN [PK], title, pub [FK publisher_table.ID]) book_author_table (ID [PK], book [FK book_table.ISBN], author [FK author_table.ID])
The contents of these tables for several books are shown below. The author table just lists the authors for whom we have information:
ID name -- ---------------- 2 Lynley Dodd 5 Eve Sutton
The book table just lists the books that are in the database:
ISBN title ---------- ------------------------------------ 0908606664 Slinky Malinki 1908606206 Hairy Maclary from Donaldson's Dairy 0908606273 My Cat Likes to Hide in Boxes
The association between books and authors is stored in the book_author_table:
ID book author -- ---------- ------ 2 0908606664 2 3 1908606206 2 6 0908606273 2 7 0908606273 5
Notice that author 2 (Lynley Dodd) has written more than one book and book 0908606273 has more than one author (rows 6 and 7).
Another reason for creating an additional table in a database is for the purpose of constraining the set of possible values for an attribute. For example, if the table of authors records the nationality of the author, it can be useful to have a separate table that contains the possible nationalities. The column in the author table then becomes a foreign key referring to the nationality table and, because a foreign key must match the value of the corresponding primary key (or be NULL), we have a check on the validity of the nationality in the author table.
The redesigned author table now looks like this:
author_table (ID [PK], name, nationality [FK nationality_table.ID]) nationality_table (ID [PK], nationality)
Normalisation is a formal process of ensuring that a database satisfies a set of rules called normal forms. There are several of these rules, but we will only mention the first three. The proper definition of normalisation depends on more advanced relational database concepts that are beyond the scope of this book, so the descriptions below are just to give a feel for how the process works.
The first part of this rule says that a column in a database table must only contain a single value. As an example, consider the following table for storing information about books:
title authors ----------------------------- ----------------------- Slinky Malinki Lynley Dodd My Cat Likes to Hide in Boxes Eve Sutton, Lynley Dodd
The first column of this table is acceptable because it just contains one piece of information: the title of the book. However, the second column is not atomic because it contains a list of authors for each book. For example, the book on the second row has two authors.
The second part of the rule says that a table cannot have two columns containing the same information. For example, in the following table, the columns author1 and author2 are duplicative columns.
title author1 author2 ----------------------------- ----------- ----------- Slinky Malinki Lynley Dodd NULL My Cat Likes to Hide in Boxes Eve Sutton Lynley Dodd
The final part of the rule says that there must be a column in the table that has a unique value in every row.
This rule formalises the idea that there should be a table for each entity in the data set. Consider the following table for storing information about books:
ISBN author price ---------- ----------- ----- 0908606664 Lynley Dodd 30.00 1908606206 Lynley Dodd 30.00 0908606273 Lynley Dodd 25.00 0908606273 Eve Sutton 25.00
The primary key for this table is a combination of ISBN and author (each row of the table carries information about one author of a book). The price column relates to the ISBN; this is the price of the book. However, the price column does not relate to the author; this is not the price of the author!
The table needs to be split into two tables, one with the information about books and one with the information about authors.
When a new table is created for author information, it is vital that the new table has a link to the book table via some sort of foreign key (see the earlier discussion of the relationships between entities).
This rule further emphasizes the idea that there should be a separate table for each entity in the data set. Consider the following table for storing information about books:
ISBN title publisher country ---------- ---------------- ---------------- ------- 0395193958 The Hobbit Houghton Mifflin USA 0836827848 Slinky Malinki Mallinson Rendel NZ 0908783116 Mechanical Harry Mallinson Rendel NZ
The primary key of this table is the ISBN, which uniquely identifies a book. The title column relates to the book; this is the title of the book. The publisher column also relates to the book; this is the publisher of the book. However, the country column does not relate directly to the book; this is the country of the publisher. That obviously is information about the book--it is the country of the publisher of the book--but the relationship is indirect, through the publisher.
There is a simple heuristic that makes is easy to spot this sort of problem in a database table. Notice that the information in the publisher and country columns is identical for the books published by Mallinson Rendel. When two or more columns repeat the same information over and over, it is a sure sign that either second or third normal form is not being met.
In this case, the analysis of the table suggests that there should be a separate table for information about the publisher.
Again, it is important to link the new table with a foreign key.
Normalizing a database is an effective and formalized way of achieving the design goals that we outlined previously: memory efficiency, improved accuracy (data integrity), and ease of maintenance.
Applying the rules of normalisation usually results in the creation of more tables in a database. The previous discussion of relationships should be consulted for making sure that any new tables are linked to at least one other table in the database.
A well designed database will have the feature that each piece of information is stored only once. Less repetition of data values means that a well-designed database will usually require less memory than storing an entire data set in a naive single-table format. Less repetition also means that a well-designed database is easier to maintain and update, because a change only needs to be made in one location. Furthermore, there is less chance of errors creeping into the data set. If there are multiple copies of information, then it is possible for the copies to disagree, but with only one copy there can be no disagreements.
These ideas are an expression of the DRY principle from Section 2.11. A well-designed database is the ultimate embodiment of the DRY principle for data storage.
The Data Expo data set consists of seven atmospheric variables recorded at 576 locations for 72 time points (every month for 6 years), plus elevation data for each location (see Section 7.5.6).
The data were originally stored as 505 plain text files, where each file contains the data for one variable for one month. Figure 7.11 shows the first few lines from one of the plain text files.
As we have discussed earlier in this chapter, this simple format makes the data very accessible. However, this is an example where a plain text format is quite inefficent, because many values are repeated. For example, the longitude and latitude information for each location in the data set is stored in every single file, which means that that information is repeated over 500 times! That not only takes up more storage space than is necessary, but it also violates the DRY principle, with all of the negative consequences that follow .
VARIABLE : Mean TS from clear sky composite (kelvin) FILENAME : ISCCPMonthly_avg.nc FILEPATH : /usr/local/fer_dsets/data/ SUBSET : 24 by 24 points (LONGITUDE-LATITUDE) TIME : 16-JAN-1995 00:00 113.8W 111.2W 108.8W 106.2W 103.8W 101.2W 98.8W ... 27 28 29 30 31 32 33 ... 36.2N / 51: 272.7 270.9 270.9 269.7 273.2 275.6 277.3 ... 33.8N / 50: 279.5 279.5 275.0 275.6 277.3 279.5 281.6 ... 31.2N / 49: 284.7 284.7 281.6 281.6 280.5 282.2 284.7 ... 28.8N / 48: 289.3 286.8 286.8 283.7 284.2 286.8 287.8 ... 26.2N / 47: 292.2 293.2 287.8 287.8 285.8 288.8 291.7 ... 23.8N / 46: 294.1 295.0 296.5 286.8 286.8 285.2 289.8 ... ...
|
In this section, we will consider how the Data Expo data set could be stored as a relational database.
To start with, we will consider what entities there are in the data set. In this case, the different entities that are being measured are relatively easy to identify. There are measurements on the atmosphere, and the measurements are taken at different locations and at different times. We have information about each time point (i.e., a date), we have information about each location (longitude and latitude and elevation), and we have several measurements on the atmosphere. This suggests that we should have three tables: one for locations, one for time points, and one for atmospheric measures.
We could also look at the data set from a normalisation perspective. We start with a single table containing all columns (only 7 rows shown):
date lon lat elv chi cmid clo ozone press stemp temp ---------- ------ ----- --- ---- ---- ---- ----- ------ ----- ----- 1995-01-16 -56.25 36.25 0.0 25.5 17.5 38.5 298.0 1000.0 289.8 288.8 1995-01-16 -56.25 33.75 0.0 23.5 17.5 36.5 290.0 1000.0 290.7 289.8 1995-01-16 -56.25 31.25 0.0 20.5 17.0 36.5 286.0 1000.0 291.7 290.7 1995-01-16 -56.25 28.75 0.0 12.5 17.5 37.5 280.0 1000.0 293.6 292.2 1995-01-16 -56.25 26.25 0.0 10.0 14.0 35.0 272.0 1000.0 296.0 294.1 1995-01-16 -56.25 23.75 0.0 12.5 11.0 32.0 270.0 1000.0 297.4 295.0 1995-01-16 -56.25 21.25 0.0 7.0 10.0 31.0 260.0 1000.0 297.8 296.5
In terms of first normal form, all columns are atomic and there are no duplicative columns, and we can, with a little effort, find a primary key: we need a combination of date, lon (longitude), and lat (latitude) to get a unique value for all rows.
Moving on to second normal form, the column elv (elevation) immediately fails. The elevation at a particular location clearly relates to the longitude and latitude of the location, but it has very little to do with the date. We need a new table to hold the longitude, latitude, and elevation data.
The new table looks like this (only 7 rows shown):
lon lat elv ------ ----- --- -56.25 36.25 0.0 -56.25 33.75 0.0 -56.25 31.25 0.0 -56.25 28.75 0.0 -56.25 26.25 0.0 -56.25 23.75 0.0 -56.25 21.25 0.0
This “location” is in third normal form. It has a primary key (a combination of longitude and latitude), and the elv column relates directly to that primary key.
Going back to the original table, the remaining columns of atmospheric measurements are all related to the primary key; the data in these columns represents an observation at a particular location at a particular time point.
Having split the data set into separate tables, we must make sure that the tables are linked to each other (at least indirectly), and in order to achieve this, we need to determine the relationships between the tables.
We have two tables, one representing atmospheric measurements, at various locations and times, and one representing information about the locations. What is the relationship between these tables? Each location (each row of the location table) corresponds to several measurements, but each individual measurement (each row of the measurement table) corresponds to only one location, so the relationship is many-to-one.
This means that the table of measurements should have a foreign key that references the primary key in the location table. The design could be expressed like this:
location_table ( longitude [PK], latitude [PK], elevation ) measure_table ( date [PK], longitude [PK] [FK location_table.longitude], latitude [PK] [FK location_table.latitude], cloudhigh, cloudlow, cloudmid, ozone, pressure, surftemp, temperature )
Both tables have composite primary keys, the measure_table also has a composite foreign key (to match the composite primary key), and the longitude and latitude columns of the measure_table have roles in both the primary key and the foreign key.
This design is a reasonable one, but we will go a bit further because the date column deserves a little more consideration.
As mentioned elsewhere, dates can be tricky to work with. The dates have been entered into the database as text. They are in the ISO 8601 format, so that alphabetical order is chronological order. This makes it easy to sort or extract information from a contiguous set of dates (e.g., all dates after December 1998). However, it would be difficult to extract non-contiguous subsets of the data (e.g., all data from December for all years). This sort of task would be much easier if we had separate columns of month and year information. If we add these columns to the data set, we get a table like this (only 7 rows shown; not all atmospheric variables shown):
date lon lat month year chi cmid clo ozone ---------- ------ ----- ------- ---- ---- ---- ---- ----- 1995-01-16 -56.25 36.25 January 1995 25.5 17.5 38.5 298.0 1995-01-16 -56.25 33.75 January 1995 23.5 17.5 36.5 290.0 1995-01-16 -56.25 31.25 January 1995 20.5 17.0 36.5 286.0 1995-01-16 -56.25 28.75 January 1995 12.5 17.5 37.5 280.0 1995-01-16 -56.25 26.25 January 1995 10.0 14.0 35.0 272.0 1995-01-16 -56.25 23.75 January 1995 12.5 11.0 32.0 270.0 1995-01-16 -56.25 21.25 January 1995 7.0 10.0 31.0 260.0
With these extra columns added, the table violates second normal form again. The month and year columns relate to the date, but have nothing to do with longitude and latitude. We must create a new table for the date information.
This new table consists of date, month, and year, and we can use date as the primary key. The relationship between this table and the original table is many-to-one (each date corresponds to many measurements, but each individual measurement was taken on a single date), so another foreign key is added to the original table to link the tables together. The new date table looks like this (only 7 rows shown):
date month year ---------- ------- ---- 1995-01-16 January 1995 1995-02-16 Februar 1995 1995-03-16 March 1995 1995-04-16 April 1995 1995-05-16 May 1995 1995-06-16 June 1995 1995-07-16 July 1995
One possible final adjustment to the database design is to consider a surrogate auto-increment key as the primary key for the location table, because the natural primary key is quite large and cumbersome. This leads to a final design that can be expressed like this:
date_table ( date [PK], month, year ) location_table ( ID [PK], longitude, latitude, elevation ) measure_table ( date [PK] [FK date_table.date], location [PK] [FK location_table.ID], cloudhigh, cloudlow, cloudmid, ozone, pressure, surftemp, temperature )
The final database, stored as an SQLite file, is a little over 2 MB in size, compared to 4 MB for the original plain text files.
Atlantic Cod (Gadeus Morhua).
Illustration by carolewalsh.com.7.12
One of the research projects conducted by Pêches et Océans Canada, the Canadian Department of Fisheries and Oceans (DFO), involves collecting data on the diet of Atlantic cod (Gadus morhua) in the Gulf of St.Lawrence, Eastern Canada.
Large quantities of cod are collected by a combination of research vessels and contracted fishing vessels and the contents of the cod stomachs are analysed to determine which species the cod have eaten.
Fish are collected when one or more ships set out on a fishing “trip”. On a single trip, each ship performs several “sets”, where each set consists of either a hooked line or a net being placed in the water and then subsequently being recovered (and checked for fish).
The primary experimental unit is a lump of something (a “prey item”) found in a cod stomach. For each lump, the following variables are measured:
region ship_type ship_id trip set fish_id fish_length prey_mass prey_type "SGSL" "2" NA "95" 3 30 530 27.06 "Other" "SGSL" "2" NA "95" 3 30 530 1.47 "Other" "SGSL" "2" NA "95" 3 30 530 4.77 "Other" "SGSL" "2" NA "95" 3 31 490 34.11 "Other" "SGSL" "2" NA "95" 3 31 490 0.17 "Other" "SGSL" "2" NA "95" 3 31 490 2.27 "Other" "SGSL" "2" NA "95" 3 32 470 0.52 "Other" "SGSL" "2" NA "95" 3 32 470 0.21 "Other" "SGSL" "2" NA "95" 3 32 470 1.7 "Other" "SGSL" "2" NA "95" 3 33 480 1.97 "Other" ... |
We can start off by identifying a few simple entities within this data set. For example, thinking about the physical objects involved, there is clearly information about individual ships and information about individual fish, so we will have a fish_table and a ship_table.
It is abundantly clear that there are going to be several tables in the database; from a normalisation point of view, it is clear that we could not have a single table because there would be columns that do not relate to the primary key, or relate not only to the primary key, but also to each other; not to mention that we would have trouble finding a primary key for one big table in the first place.
For each ship we have one or two identification numbers. We need both numbers in order to uniquely identify each ship (different commercial vessels share the same ship_type), so we cannot use either variable on its own as a primary key. Furthermore, the ship_id for research vessels is missing (in database terms, the value will be NULL), which means that the ship_id variable cannot be used as part of the primary key. We will use an artificial, auto-increment key for this table.
ship_table (ID [PK], ship_type, ship_id)
For each fish, we have a numeric label and the length of the fish; we also have lots of information about what was in the stomach of the fish, but we will leave that until later. The fish_id label is not unique for each fish, so again we will use an auto-increment primary key.
fish_table (ID [PK], fish_id, fish_length)
Another important physical entity in the data set is a prey item (a lump found in a fish stomach). We could have a lump_table where, for each lump, we have information about the species of the prey item and the weight of the lump. We will add an auto-increment variable to provide a unique identifier for each lump in the entire data set.
lump_table (ID [PK], prey_mass, prey)
We will develop this table more a little later.
The prey_type variable is a good example where we might like to create a new table for validating the species entered in the lump_table. Another reason for considering this approach is the possibility that we might be interested in a species that does not occur in the data set we have (but could conceivably occur in future studies). We could also use a prey_table to provide a mapping between the generic Other species category and individual species which have been grouped therein. We could use the species name itself as the primary key for the prey_table, but in terms of efficiency of storage, having a single integer identifier for each species requires less storage in the (large) lump_table than storing the full species label.
prey_table (ID [PK], prey_type)
The relationship between the lump_table and the prey_table is many-to-one, so we place a prey foreign key in the lump_table.
Lumps are also fairly obviously related to fish; each lump comes from exactly one fish and each fish can have several lumps. We also place a fish foreign key in the lump_table.
lump_table (ID [PK], prey_mass, prey [FK prey_table.ID], fish [FK fish_table.ID]))
It is worth pointing out that, through the lump_table, we have resolved the many-to-many relationship between fish and prey.
Now we come to the more complicated part of modelling the cod data set. How are fish and ships related to each other? And how do we bring in the other information in the data set (region, trip, and set)? At this point, thinking about physical entities does not help us much; we need to think in terms of the events involved in the data collection instead.
Initially, the situation does not look too bad; each fish was caught by exactly one ship (and each ship caught many fish). However, the process was not that simple. Each fish was caught in exactly one set (one check of the net or hooks) and each set occurred on exactly one trip. However, some trips involved several ships and some ships conducted more than one trip. There is another many-to-many relationship lurking within the data. To resolve this, we will focus on the fishing sets.
For each set we have a label, set_num. The set occurred on exactly one trip, so we can include the label for the trip.7.13 The set was performed by exactly one ship, so we can include a foreign key to the ship table. This resolves the many-to-many relationship between ships and trips. Finally, we include information about the region. This is expanded into a separate table, which allows us to provide a more expansive description of each region. The original region code makes a nice primary key and because it is fairly compact, will not result in too much inefficiency in terms of space. An auto-increment variable provides a unique identifier for each set.
region_table (region [PK], name) set_table (ID [PK], set_num, trip, fish [FK fish_table.ID], region [FK region_table.region])
Paul Murrell
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.