Subsections


7.7 Binary files

As we saw in Section 7.4, all electronic information, regardless of the format, is ultimately stored in a binary form--as a series of bits (zeroes and ones). However, the same value can be recorded as a binary value in a number of different ways.

For example, given the number 12345, we could store it as individual characters 1, 2, 3, 4, and 5, using one byte for each character:

00110001 00110010 00110011 00110100 00110101

Alternatively, we could store the number as a four-byte integer (see Section 7.4.3):

00111001 00110000 00000000 00000000

When we store information as individual one-byte characters, the result is a plain text file. This tends to be a less efficient method because it tends to consume more memory, but it has the advantage that the file has a very simple structure. This means that it is very simple to write software to read the file because we know that each byte just needs to be converted to a character. There may be problems determining data values from the individual characters (see Section 7.5), but the process of reading the basic unit of information (a character) from the file is straightforward.

For the purposes of this book, a binary format is just any format that is not plain text.

The characteristic feature of a binary format is that there is not a simple rule for determining how many bits or how many bytes constitute a basic unit of information. Given a series of, say, four bytes, we cannot assume that these correspond to four characters, or a single four-byte integer, or half of an eight-byte floating-point value (see Section 7.4.3). It is necessary for there to be a description of the rules for the format (we will look at one example soon) that state what information is stored and how many bits or bytes are used for each piece of information.

Binary formats are consequently much harder to write software for, which results in there being less software available to do the job.

However, some binary formats are easier to read than others. Given that a description is necessary to have any chance of reading a binary file, proprietary formats, where the file format description is kept private, are extremely difficult to deal with. Open standards become more important than ever.

7.7.1 Binary file structure

One of the advantages of binary files is that they are more efficient.

In terms of memory, storing values using numeric formats such as IEEE 754, rather than as text characters, tends to use less memory.

In addition, binary formats also offer advantages in terms of speed of access. While the basic unit of information is very straightforward in a plain text file (one byte equals one character), finding the actual data values is often much harder. For example, in order to find the third data value on the tenth row of a CSV file, the reader software must keep reading bytes until nine end-of-line characters have been found and then two delimiter characters have been found. This means that, with text files, it is usually necessary to read the entire file in order to find any particular value.

For binary formats, some sort of format description, or map, is required to be able to find the location (and meaning) of any value in the file. However, the advantage of having such a map is that any value within the file can be found without having to read the entire file.

As a typical example, a standard feature of binary files is the inclusion of some sort of header information, both for the overall file, and for subsections within the file. This header information contains information such as the byte location within the file where a set of values begins (a pointer), the number of bytes used for each data value (the data size), plus the number of data values. It is then very simple to find, for example, the third data value within a set of values, which is: pointer + 2 x size.

More information is required in order to locate values within a binary format, but once that information is available, navigation within the file is faster and more flexible.

7.7.2 NetCDF

Paul Murrell

Creative Commons License
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.