Chapter 4. Digital Signals

Table of Contents

4.1. Transduction and Digitisation
4.2. Manipulating Digital Signals
4.3. Exercises
4.4. Additional Reading

4.1. Transduction and Digitisation

Much of this unit will be concerned with the manipulation of speech signals by digital computers. The power of modern desktop computers means that advanced signal processing and speech analysis software can run on the average computer. In order to take advantage of this, and in order to understand the properties and limitations of digital signals, we need to look at how a speech siganl is converted into digital form.

4.1.1. Transduction

An acoustic signal is transmitted via the motion of the molecules of the air or other medium through which it passes. As we saw earlier the signal takes the form of changes in pressure over time and we characterise the signal by measuring this pressure as a function of time. Pressure is an analog quantity in that it varies continuously and can take on an infinite number of different values.

Although sound starts it's life as changes in air pressure, it can also be transmitted in different ways, for example via a microphone and telephone cable to a remote listener. The process by which sound changes from being a pattern of vibration in the air to a pattern of electrical energy in the telephone cable is known as transduction. Devices which change the form of a sound signal, in this case the microphone and loudspeaker, is called a transducer. The most interesting class of transducers for us are those that turn sound pressure waves into patterns of electrical energy. We can measure the electrical energy in the same way that we measure pressure to get a picture of the pattern of variation within the acoustic signal. Electrical energy is again an analog quantity and so we now need some way of changing this into the digital domain of modern computers.

4.1.2. Digitisation

Digital computers work at their lowest level with binary or base 2 numbers. In order to analyse and manipulate speech signals in the computer we need to convert the continuously varying signal into a series of discrete numbers; this process is known as digitisation. The basic digitisation hardware is known as and analog-to-digital or A/D converter and it takes snapshots of an input analog signal at regular intervals outputting a number which is closest to the magnitude of the snapshot measurement. Hence if the converter sees 1.376 volts it might output a value of 1.4 volts -- the closest `round number' to the input measurement. The questions of importance when talking about digitisation are the rate at which snapshots are taken and the size of the smallest change that can be detected in the input signal.

4.1.2.1. Sampling Frequency

It should be clear that taking a series of snapshots of a signal can only capture an approximation of the original. When a movie or TV camera captures thirty or so frames per second we rely on persistence of vision to transform the result into a continuous moving image when played pack. Even so, any event that happens between snapshots or any changes that happen too quickly will not be reproduced properly: the classic example is the wheels on covered wagons in John Wayne movies which appear to be going backwards. The same problem arises in digitising audio signals and so we must be careful to capture at least the important information in the signal; to do so we need to understand what the limitations are.

The sampling frequency or sampling rate is a measure of the number of snapshots taken from the signal each second. Sampling frequency is measured in Hertz, just like any other frequency measure.

If we consider sampling a sinusoid signal we can see what the limits of the process are. Figure 5.2 from Harrington and Cassidy (reproduced here) shows two sinusoids with different frequencies sampled at the same sampling frequency of 10Hz. In the bottom pane, a signal with a frequency of 5Hz is sampled at at each peak and trough and once close to the midpoint of each cycle: four samples for each cycle in total. The result is a version of the original which retains at least its frequency and amplitude if not the exact shape. The upper pane shows a 15Hz signal and we can see that because the sampling frequency is too low, most of the peaks and troughs are missed; as it happens the sampled signal matches exactly that from the 5Hz example, so the original 15Hz signal would be reproduced as a 5Hz signal if a sampling rate of 10Hz was used. This phenomenon is known as aliasing; the higher frequency signal is said to be aliased onto the lower frequency signal.

Figure 4.1. Sampling two sinusoids. Taken from Harrington and Cassidy, figure 5.2, p 134.

The absolute minimum number of samples per cycle needed to properly reproduce a sinusoid is two -- one at the peak, one at the trough. This will give a crude approximation to the original signal but will be able to capture the frequency and amplitude. This means that the sampling frequency should be at least twice the frequency of the sinusoid being digitised; this is know as the Nyquist Frequency.

Since we know that any complex signal can be thought of as the summation of a number of sinusoids, the above result can be stated more generally. The sampling frequency should be at least twice the frequency of the highest frequency of interest in the input signal. The telephone system uses a sampling frequency of 8000Hz and so can capture only information up to 4000Hz. In studying speech recorded in quiet conditions we often use a sampling frequency of 20000Hz which gives information up to 10000Hz.

From the earlier example we saw that signals above the Nyquist frequency will be aliased so that they look like signals below it. In fact the Nyquist frequency acts like a mirror such that any sinusoid with a frequency of FN + x will appear as a sinusoid of frequency FN - x. For complex signals, this has the consequence that any information above the Nyquist frequency will reflect back onto the lower frequencies and contaminate the result. In order to reproduce the information below the Nyquist frequency accurately the higher frequency information must be removed before the signal is digitised. The device which does this is a low pass filter often called an anti-aliasing filter.

4.1.2.2. Quantisation

We are used to using a base ten number scheme but modern computers use binary numbers for various reasons. Binary numbers consist only of ones and zeros which is useful as these can be stored and transmitted as the presence or absence of a physical property, like electrical charge or a high frequency tone. Each binary digit is called a bit and all information inside a computer is stored as strings of bits.

A binary number can be decoded by adding up powers of two; each place in the digit corresponds to a higher power of two, starting from the right most position which is 20 = 1 followed by 21 = 2 then 4, 8, 16 etc. So the binary number `1011' is 1*8 + 0*4 + 1*2 + 1*1 = 3. Decoding these numbers by hand can be longwinded but the process is easy to understand. Any decimal integer can be rewritten as a series of binary digits.

When we digitise an acoustic signal it is turned into a sequence of binary numbers by the analog-to-digital hardware. There is an important consequence of this process that we need to understand: that these devices use a fixed number of binary digits to represent each sample and hence that the size of the smallest change that can be detected in the input is related to the number of bits used.

Analog to digital hardware uses a fixed sample size to represent the sampled acoustic signal; typically 12 or 16 bits are used per sample. A little arithmatic will tell us that 12 bits will give us a maximum of 212 = 4096 different numbers while 16 bits gives 216 = 65536 values. These numbers will be used to represent the different input voltages taken from the microphone. When the hardware measures the size of the input voltage from the microphone, instead of calculating a voltage value it merely assigns it a number on a scale of 0 to 4096 (for a 12 bit digitiser). Since our input signal consists of both positive and negative peaks (oscillations), the 0 point should correspond to the largest negative peak and the 4096 point to the largest positive peak. The output of the digitiser when the input is zero should be around 2048.

The outcome of this is that the continuously variable input signal is quantised into one out of 4096 values. If, for example, the range of the input were +3v to -3v then each binary digit (bit) would correspond to a change of 6/4096=0.0015v in the input. Any change in the input smaller than this will not be captured. This is a similar consequence to that in the previous section where events which happen between samples cannot be resolved. The result of quantisation can be seen in Figure 5.4 on page 135 of Harrington and Cassidy (reproduced below). The sampled signal (dotted line) is an approximation of the original in both time and amplitude.

A digitiser that uses 16 bits will be correspondingly more accurate than one which uses 12 bits. The compact disc standard uses 16 bit samples taken at a 44kHz sample rate. For speech analysis it is common to use 12 bit samples at around 20kHz for studio quality recordings or 8-10kHz for telephone or office environment recordings.