Figure 7.7 shows the surface temperature data at the Pacific Pole of Inaccessibility (see Section 1.1) in two different formats: the original plain text and an XML format.
<?xml version="1.0"?> <temperatures> <variable>Mean TS from clear sky composite (kelvin)</variable> <filename>ISCCPMonthly_avg.nc</filename> <filepath>/usr/local/fer_dsets/data/</filepath> <subset>93 points (TIME)</subset> <longitude>123.8W(-123.8)</longitude> <latitude>48.8S</latitude> <case date="16-JAN-1994" temperature="278.9" /> <case date="16-FEB-1994" temperature="280" /> <case date="16-MAR-1994" temperature="278.9" /> <case date="16-APR-1994" temperature="278.9" /> <case date="16-MAY-1994" temperature="277.8" /> <case date="16-JUN-1994" temperature="276.1" /> ... </temperatures>
VARIABLE : Mean TS from clear sky composite (kelvin) FILENAME : ISCCPMonthly_avg.nc FILEPATH : /usr/local/fer_dsets/data/ SUBSET : 93 points (TIME) LONGITUDE: 123.8W(-123.8) LATITUDE : 48.8S 123.8W 23 16-JAN-1994 00 / 1: 278.9 16-FEB-1994 00 / 2: 280.0 16-MAR-1994 00 / 3: 278.9 16-APR-1994 00 / 4: 278.9 16-MAY-1994 00 / 5: 277.8 16-JUN-1994 00 / 6: 276.1 ... |
One fundamental similarity between these formats is that they are both just text. This is an important and beneficial property of XML; we can read it and manipulate it without any special skills or any specialized software.
There are many advantages to storing information in a plain text format, mostly related to the simplicity of plain text files (see Section 7.5.3). However, that same simplicity also creates problems. XML is a storage format that is still based on plain text, but does not suffer from many of the problems of plain text files because it adds a level of rigour and standardization.
The XML format of the data consists of two parts: XML mark up and the actual data itself. For example, the information about the latitude at which these data were recorded is stored with XML tags, <latitude> and </latitude>, surrounding the latitude value. The combination of tags and content are together described as an XML element.
<latitude>48.8S</latitude>
Each temperature measurement is contained within a case element, with the date and temperature data recorded as attributes of the element.
<case date="16-JAN-1994" temperature="278.9" />
This should look very familiar because these are exactly the same notions of elements and attributes that we saw in HTML documents (see Chapter 2).
In what ways is the XML format better or worse than the plain text format?
The core advantage of an XML document is that it is self-describing.
The tags in an XML document provide information about where the information is stored within the document. This is an advantage because it means that humans can find information within the file easily. That is true of any plain text file, but it is especially true of XML files because the tags essentially provide a level of documentation for the human reader. For example, a line like this ...
<latitude>48.8S</latitude>
... not only makes it easy to determine that the value 48.8S constitutes a single data value within the file, but it also makes it clear that this value is a north-south geographic location.
The fact that an XML document is self-describing is also a huge advantage from the perspective of the computer. An XML document provides enough information for software to determine how to read the file, without any further human intervention. Looking again at the line containing latitude information ...
<latitude>48.8S</latitude>
... there is enough information for the computer to be able to detect the value 48.8S as a single data value, and the computer can also record the latitude label so that if a human user requests the information on latitude, the computer knows what to provide.
One consequence of this feature that may not be immediately obvious is that it is much easier to modify the structure of data within an XML document compare to a plain text file. The location of information within an XML document is not so much dependent on where it occurs within the file, but where the tags occur within the file. As a trivial example, consider swapping the following lines in an XML file ...
<longitude>123.8W(-123.8)</longitude> <latitude>48.8S</latitude>
... so that they look like this instead ...
<latitude>48.8S</latitude> <longitude>123.8W(-123.8)</longitude>
The information is now at a different location within the file, but the task of retrieving the information on latitude is exactly the same. This can be a huge advantage if larger modifications need to be made to a data set, such as adding an entire new variable.
The second main advantage of the XML format is that it can accommodate complex data strucutures. Consider the hierarchical data set in Figure 7.4. Because XML elements can be nested within each other, this sort of data set can be stored in a sensible fashion with families grouped together to make parent-child relations implicit and avoid repetition of the parent data. The plain text representation of these data are reproduced from page below along with a possible XML representation.
John,33,male,Julia,32,female,Jack,6,male John,33,male,Julia,32,female,Jill,4,female John,33,male,Julia,32,female,John jnr,2,male David,45,male,Debbie,42,female,Donald,16,male David,45,male,Debbie,42,female,Dianne,12,female
<family> <parent gender="male" name="John" age="33" /> <parent gender="female" name="Julia" age="32" /> <child gender="male" name="Jack" age="6" /> <child gender="female" name="Jill" age="4" /> <child gender="male" name="John jnr" age="2" /> </family> <family> <parent gender="male" name="David" age="45" /> <parent gender="female" name="Debbie" age="42" /> <child gender="male" name="Donald" age="16" /> <child gender="female" name="Dianne" age="12" /> </family>
Another important advantage of the XML format is that it provides some level of checking on the correctness of the data file (a check on the data integrity). We will discuss this more when we look at XML schema (see Section 7.6.6), but even just the fact that an XML document must obey the rules of XML means that we can use a computer to check that an XML document at least has a sensible structure.
The major disadvantage of XML is that it generates large files. With it being a plain text format, it is not memory efficient to start with, then with all of the additional tags around the actual data, files can become extremely large. In many cases, the tags can take up more room than the actual data!
These issues can be particularly acute for scientific data sets, where the structure of the data may be quite straightforward. For example, geographic data sets containing many observations at fixed locations naturally form a 3-dimensional array of values, which can be represented very simply and efficiently in a plain text or binary format. In such cases, having highly repetitive XML tags around all values can be very inefficient indeed.
The verbosity of XML is also a problem for entering data into an XML format. It is just too laborious for a human to enter all of the tags by hand, so, in practice, it is only sensible to have a computer generate XML documents.
It should also be acknowledged that the additional sophistication of XML creates additional costs. Users have to be more educated and the software has to be more complex (which makes compatible software more rare).
The fact that computers can read XML easily and effectively, plus the fact that computers can produce XML rapidly (verbosity is less of an issue for a computer), means that XML is an excellent format for transferring information between different software programs. XML is a good language for computers to use to talk to each other, with the added bonus that humans can still easily eavesdrop on the conversation.
As mentioned in the previous section, one of the advantages of XML is that the computer can perform checks on the correctness of an XML document, which provides at least some checks the correctness of the data that are stored in the document (also see Section 7.6.6). This section provides a bit more information on the basic rules that an XML document must obey.
<?xml version="1.0"?>
As with HTML, the characters <, >, and & (among others) are special and must be replaced with special escape sequences, <, >, and & respectively.
These escape sequences can be very inconvenient when storing data values, so it is also possible to mark an entire section of an XML document as “plain text” by placing it within a CDATA section, as follows:
<myxmlelement> <![CDATA[ Lots of "<"s, ">"s, "&"s, "'"s and """s ]]> </myxmlelement>
Though it is important to understand why XML documents are used, designing an XML document may be a rare event for a scientist. Nevertheless, we have something to gain from a brief consideration of how data might be organised in an XML file.
The first point is that there are many ways that a data set could be stored within an XML document. XML is actually a meta-language; it is a language for defining languages. In other words, we get to decide the structure for an XML document and XML is a language for describing the structure that we choose.
So what structure should we choose? We will look at some issues that might influence the design of an XML document. These will be useful in understanding why an XML document that you encounter has a particular structure and they will be useful as an introduction to similar ideas that we will encounter when we discuss relational databases (Section 7.9).
The first XML design issue is to make sure that each value within a data set can be clearly identified. In other words, it should be trivial for a computer to extract each individual value. This means that all values should be either the content of an element or the value of an attribute. The XML document shown in Figure 7.7 demonstrates this idea.
Figure 7.8 shows two other possible XML representations of the Pacific Pole of Inaccessibility temperature data. The example at the top of the figure demonstrates that it is very easy to create an XML document that follows the rules of XML, but it is provides no benefits over the original plain text format.
<?xml version="1.0"?> <temperatures> VARIABLE : Mean TS from clear sky composite (kelvin) FILENAME : ISCCPMonthly_avg.nc FILEPATH : /usr/local/fer_dsets/data/ SUBSET : 93 points (TIME) LONGITUDE: 123.8W(-123.8) LATITUDE : 48.8S 123.8W 23 16-JAN-1994 00 / 1: 278.9 16-FEB-1994 00 / 2: 280.0 16-MAR-1994 00 / 3: 278.9 16-APR-1994 00 / 4: 278.9 16-MAY-1994 00 / 5: 277.8 16-JUN-1994 00 / 6: 276.1 ... </temperatures>
<?xml version="1.0"?> <temperatures> <variable>Mean TS from clear sky composite (kelvin)</variable> <filename>ISCCPMonthly_avg.nc</filename> <filepath>/usr/local/fer_dsets/data/</filepath> <subset>93 points (TIME)</subset> <longitude>123.8W(-123.8)</longitude> <latitude>48.8S</latitude> <cases> 16-JAN-1994 00 / 1: 278.9 16-FEB-1994 00 / 2: 280.0 16-MAR-1994 00 / 3: 278.9 16-APR-1994 00 / 4: 278.9 16-MAY-1994 00 / 5: 277.8 16-JUN-1994 00 / 6: 276.1 ... </cases> </temperatures> |
The example at the bottom of Figure 7.8 is more interesting. In this case, the irregular and one-off metadata values are individually identified within elements or attributes, but the regular and repetitive raw data values are not. This is not ideal from the point of view of the file being self-describing, but it may be a viable option when the raw data values have a very simple format (e.g., comma-delimited) and the data set is very large (so avoiding lengthy tags and attribute names is a major saving).
When a data set has a non-rectangular structure, such as the family tree in Figure 7.4, an XML document can be designed to store the information more efficiently and more appropriately. The main idea here is to avoid repeating values.
When presented with a data set, the following questions should guide the design of the XML format:
The rule of thumb is then to have an element for each object in the data set (and a different type of element for each different type of object) and then have an attribute for each measurement in the data set. Simple relationships between objects can sometimes be expressed by nesting elements.
For example, in the family tree data set, there are obviously measurements taken on people, those measurements being names and ages and genders. We could distinguish between parent objects and child objects, so we have elements like:
<parent gender="female" name="Julia" age="32" /> <child gender="male" name="Jack" age="6" />
There are two distinct families of people, so we could have elements to represent the different families and nest the relevant people within the appropriate family element to represent membership of a family.
<family> <parent gender="male" name="John" age="33" /> <parent gender="female" name="Julia" age="32" /> <child gender="male" name="Jack" age="6" /> <child gender="female" name="Jill" age="4" /> <child gender="male" name="John jnr" age="2" /> </family> <family> <parent gender="male" name="David" age="45" /> <parent gender="female" name="Debbie" age="42" /> <child gender="male" name="Donald" age="16" /> <child gender="female" name="Dianne" age="12" /> </family>
As demonstrated so far, the easiest solution is to store all measurements as the values of attributes. However, this is not always possible or appropriate. A measurement may have to be stored as the content of a separate element in the following cases:
We have already discussed the fact that an XML document can provide some checks that a data set is correct because the XML document must obey the basic rules of XML (elements must nest, attribute values must be surrounded by quotes, etc). While this sort of checking is better than nothing (as in plain text files), the checks are very basic. It is much more useful to be able to perform more advanced checks such as whether necessary data values are included in a document, whether elements contain the correct sort of data value, and so on. With a little more work, XML provides these more advanced checking features as well.
The way that this extra information can be specified is by creating a schema for an XML document, which is a description of the structure of the document. A number of technologies exist for specifying XML schema, but we will focus on the Document Type Definition (DTD) language.
A DTD is a set of rules for an XML document. It contains element type declarations that describe what elements are permitted within the XML document, in what order, and how they may be nested within each other. The DTD also contains attribute list declarations that describe what attributes an element can have, whether attributes are optional or not, and what sort of values may be specified for each attribute.
Figure 7.9 shows the temperature data at Point Nemo in an XML format (this is a reproduction of Figure 7.7 for convenience).
<?xml version="1.0"?> <temperatures> <variable>Mean TS from clear sky composite (kelvin)</variable> <filename>ISCCPMonthly_avg.nc</filename> <filepath>/usr/local/fer_dsets/data/</filepath> <subset>93 points (TIME)</subset> <longitude>123.8W(-123.8)</longitude> <latitude>48.8S</latitude> <case date="16-JAN-1994" temperature="278.9" /> <case date="16-FEB-1994" temperature="280" /> <case date="16-MAR-1994" temperature="278.9" /> <case date="16-APR-1994" temperature="278.9" /> <case date="16-MAY-1994" temperature="277.8" /> <case date="16-JUN-1994" temperature="276.1" /> ... </temperatures> |
The structure of this XML document is as follows: there is a single overall temperatures element that contains all other elements. There are several elements containing various sorts of metadata: a variable element containing a description of the variable that has been measured; a filename element and a filepath element containing information about the file from which these data were extracted; and three elements, subset, longitude, and latitude, that together decribe the temporal and spatial limits of this subset of the original data. Finally, there are a number of case elements that contain the raw temperature data; each element contains a temperature measurement and the date of the measurement.
A DTD describing this structure is shown in Figure 7.10.
<!ELEMENT temperatures (variable, filename, filepath, subset, longitude, latitude, case*)> <!ELEMENT variable (#PCDATA)> <!ELEMENT filename (#PCDATA)> <!ELEMENT filepath (#PCDATA)> <!ELEMENT subset (#PCDATA)> <!ELEMENT longitude (#PCDATA)> <!ELEMENT latitude (#PCDATA)> <!ELEMENT case EMPTY> <!ATTLIST case date ID #REQUIRED temperature CDATA #IMPLIED> |
For each element, there has to be an <!ELEMENT > declaration. The simplest example is for case elements because they are empty (they have no content), as indicated by the keyword EMPTY. Most other elements are similarly straightforward because their contents are just text, as indicated by the #PCDATA keyword. The temperatures element is more complex because it can contain other elements. The specification given in Figure 7.10 states that six elements (variable to latitude) must be present, and they must occur in the given order. There may also be zero or more case elements (the * means “zero or more”).
The case elements also have attributes, so there is an <!ATTLIST> declaration as well. This says that there must (#REQUIRED) be a date attribute and that each date value must be unique (ID). The temperature attribute is optional (#IMPLIED) and, if it occurs, the value can be any text (CDATA).
The rules given in a DTD are associated with an XML document using a Document Type Declaration as the second line of the XML document. This can have one of two forms:
The DRY principle suggests that an external DTD is by far the most sensible approach.
If an XML document is well-formed--it obeys the basic rules of XML syntax--and it obeys the rules given in a DTD, then the document is said to be valid.
The use of a DTD has some shortcomings, such as a lack of support for specifying the data type of attribute values or the contents of elements, plus the fact that the DTD language is completely different from XML. XML Schema is an XML-based technology that solves both of those problems, but it comes at the cost of much greater complexity. The latter problem has lead to the development of further technologies that simplify the XML Schema syntax, such as Relax NG. The interested reader is referred to Section 8.3 for further pointers.
Another way to represent relationships between objects (elements) in an XML document is to use the special ID and IDREF (or IDREFS) attributes.
<family id="family1" /> <family id="family2" /> <parent gender="male" name="John" age="33" family="family1" /> <parent gender="female" name="Julia" age="32" family="family1" /> <child gender="male" name="Jack" age="6" family="family1" /> <child gender="female" name="Jill" age="4" family="family1" /> <child gender="male" name="John jnr" age="2" family="family1" /> <parent gender="male" name="David" age="45" family="family2" /> <parent gender="female" name="Debbie" age="42" family="family2" /> <child gender="male" name="Donald" age="16" family="family2" /> <child gender="female" name="Dianne" age="12" family="family2" />
When these attributes are used, additional data integrity checks can be enforced, because the DTD rules state that an IDREF attribute must have a value that matches the value of an ID element somewhere within the same XML document.
This sort of design problem is discussed in more detail in Section 7.9.5 on database design.
Paul Murrell
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.