COMP449: Speech Recognition

Steve Cassidy

Department of Computing, Macquarie University


          Sydney
          Australia
          <Steve.Cassidy@mq.edu.au>
        


Table of Contents

1. Outline
2. Assignments and Assesment
I. Acoustics and Digital Signal Processing
II. Speech Recognition

You can get a single file version of this document here.

Chapter 1. Outline

This unit is intended to introduce you to the area of speech recognition. The overall goal is that you understand enough about the underlying processes in a recogniser and about the structure of spoken language that you can make effective use of ASR systems and perhaps contribute to improve their inner workings.

These notes are intended to guide you through the unit and will be suplemented by copies of papers and book chapters as appropriate. You might notice that in places they seem cobbled together which is not surprising since they are (from SLP801 and SLP806 offered last year in the MSc Speech and Language Processing).

Chapter 2. Assignments and Assesment

2.1. Assesment

The assesment for this unit will consist of a number of practical assignments and one written assignment.

  • Data collection and Gaussian classification (20%) You will collect some speech data (from yourself) and perform a small study on vowel classification using just one feature vector per vowel. You will compare some alternative feature types for this task. Due 15 April 2002

  • Speech Recognition (30%) you will work in teams of 2 to train a the Sphinx speech recognition system using Australian speech data. You will be assesed based on a report of the results of your work. Due end of semester.

  • Speaker Recognition (10%) You will develop a small speaker recogniser using data recorded from yourself and other students. Due ??.

  • Written Assignment (40%) You will write two reports on topics in speech recognition. Due: Weeks 6 & 13.

In order to avoid long delays processing marks, I will be unwilling to provide extensions except in genuine cases. If you wish to request an extension, please contact me before the deadline for an assignment. Penalties will be deducted for late submissions without good reason.

2.2. Data Collection and Gaussian Classification

The intention of this assignment is to introduce you to speech data collection and to give you first hand experience with the annotation of speech data. You will record a small collection of isolated words and digit strings spoken by yourself or another volunteer -- these recordings will be made directly into your computer. Once you have this data you will view the acoustic signal and spectrogram and then annotate the data with the start and end times of each phoneme.

To make the recordings you will need a computer with a soundcard and a microphone. To produce good quality recordings you should use a good quality headset microphone and adjust the recording level so that the peak amplitude is close to the maximum level but does not exceed it -- you want to avoid `clipping' of the signal but get as much detail as possible in the recordings.

To complete this assignment you'll need the Emu speech database system which you can download from here for either Unix, Mac or Windows. You should also get a copy of the R statistical package and the associated Emu library as described on this page. I'll give you some pointers for using this software as the term progresses.

To record the data, use the Emu Capture tool which should be installed by default on Windows and Mac and which is available from me for Unix. This tool takes a list of words to be recorded (one word per line) and a template skeleton and then prompts you for the words and saves the recordings and a template file to disk in the right place for Emu to find them. Once you've recorded your data you should be able to double click the template file (on Windows/Mac) to annotate the data with the Emu labeller. On Unix type emulabel digits to accomplish the same thing.

The set of words you record should include at least four repetitions of all the monophthongal vowels in your variety of English in an hVd context (eg. had, hard, etc). The goal for these recordings is to be able to describe your own vowel space and to compare this with those for standard accents such as Australian English or Southern British English. As an example, the following words can be used for Australian English:

hod   heard   hid   had    hard  
hood  heed    whod  head   hudd
   

When making these recordings, take care to randomise the position of the words in the list for each repitition and as far as possible avoid `list intonation'. Note that repitition means a separately prompted recording, not just saying the word four times at the prompt.

You should also record two repititions of the following digit strings:

9541
1249
3263
9874
6521
5169
6853
7268
5094
5320
4877
6687
5039
3300
2140
8790
1457
2121
3879
4806
   

Once you have recorded the data you need to construct an Emu database template for your data. This is a text file that describes the kinds of annotation you will make on the data, the location of the data and label files and various other configuration parameters. You will be labelling phoneme boundaries and grouping the phonemes into words. You will need the following two template files (one for isolated words and one for connected digits):

! template skeleton for COMP449 data collection -- isolated words
!  save as words.tpx in your Emu installation directory 

level Speaker
level Word       Speaker
level Phoneme    Word

label Speaker Sex
label Speaker Accent
label Speaker Set

labfile Phoneme :extension phn :time-factor 1000

! Here we define the legal label categories for various levels
! this gives us a menu in the labeller and we can use the
! category names in queries
!
legal Phoneme vowel A E I O V U ai ei oi i@ u@ au @u @: @ a: e: i: o: u:
legal Phoneme monophthong A E I O V U @: a: e: i: o: u:
legal Phoneme shortvowel A E I O V U 
legal Phoneme longvowel @: a: e: i: o: u:
legal Phoneme diphthong ai ei oi i@ u@ au @u 
legal Phoneme stop p tS dZ t k b d g 
legal Phoneme nasal m n N M
legal Phoneme fricative f v s z S Z h D D- T
legal Phoneme approximant w j l r 
legal Phoneme other H #

legal Set - hVd digits
legal Sex - M F 
! possible accent labels, you could add more if you don't fit one of these
legal Accent - Australian US Singapore Malaysia

! ************************************************************
! Define the location of files, this should be set to the right
! location by the capture script.
! ************************************************************
path hlb,phn,trg      %labels
path wav              %signals

! this tells Emu to look for sampled speech data in wav files
track samples     wav

set PrimaryExtension wav
    
! template skeleton for COMP449 data collection -- digit strings
!  save as digit.tpx in your Emu installation directory 

level Speaker
level Word       Speaker

label Speaker Sex
label Speaker Accent
label Speaker Set

labfile Word :extension wrd :time-factor 1000

legal Word - one two three four five six seven eight nine ten zero oh

legal Set - hVd digits
legal Sex - M F 
! possible accent labels, you could add more if you don't fit one of these
legal Accent - Australian US Singapore Malaysia

! ************************************************************
! Define the location of files, this should be set to the right
! location by the capture script.
! ************************************************************
path hlb,wrd          %labels
path wav              %signals

! this tells Emu to look for sampled speech data in wav files
track samples     wav

set PrimaryExtension wav

2.2.1. Annotation

When you have your database working you should annotate each recording. For the isolated hVd words you should mark in the boundaries of the phonemes in each word. For the connected digits you should mark in the start and end of the digit string and label it (at the Word level) with the numerical string being spoken. You will mark these on the levels shown in the signal view of the Emu labeller. For some notes on using the labeller for this task see Section 4.1 of the Interactive Worksheets on the CDROM with Harrington and Cassidy and the relevant sections of the Emu manual.

2.2.2. Analysis and Presentation

Our goal is to plot the positions of the first two formants for each vowel in your hVd database so that you can compare them with other data. Unfortunately, I can't make an automatic formant tracker available to you, we only have a commercial one which runs on Unix systems. The easiest solution is for you to take measurements of the formant frequencies by hand from the spectrograms displayed in Emulabel. I realise that this is a very backward methodology and makes little use of the Emu software that you have learned to use in SLP802 but it really is the only workable solution at the moment.

  • Measure F1 and F2 at the vowel target for each of the vowels in your isolated word set. This should give you two measurements per vowel (from the two repetitions).

  • Generate plots of your data on the F1/F2 plane.

  • Present your results as a web page, include links to download your speech data (pack it up as a zip file), label files and the template file used for Emu.

2.2.3. Gaussian Classification

The aim of this exercise is to look at the use of distance measures and Gaussian models in classifying speech data according to various feature sets. We will study vowel data taken from the ANDOSL isolated word set that is included on the Harrrington and Cassidy CDROM. The goal of the study is to compare the utility of formants, bark scaled spectra and cepstra in classifying these vowels.

The result of your analysis will be in the form of a report on the experiment; it should include a method section, a description of the different experiments you carried out and a presentation and discussion of your results.

The data for analysis is in this zip file (235k), the file contains the following files:

  • short.seg - a segment list of 860 short vowels A E I O U V from all speakers in the isolated database.

  • short.info - a list of the speaker identifiers and their sex for each vowel, speakers are db, dw, jc, mb, nd (male) and dp, jn, nb, ru, vm (female). Sex is coded as m or f.

  • short-fft.dat - a 512 point fourier transform taken at the mid point of each vowel. The signal was pre-emphasised before the fft was carried out. This data is not intended for a direct classification experiment, instead it was used to generate the cepstral data and could be used to generate other parameters if you wish. This file is not included in the above zip file but is available seperately to download if you wish as short-fft.zip (1.2M).

  • short-fm.dat - the first four formants, automatically tracked using the ESPS formant tracker.

  • short-bark.dat - 22 bark scaled spectral bands averaged over the duration of each vowel. The signal was pre-emphasised before the analysis was carried out.

  • short-cep.dat - 40 cepstral coefficients generated from the fft data in short-fft.dat.

You will use R and Emu to analyse this data, the instructions below give you more or less everything you'll need to carry out the experiments. See me for help getting everything installed etc. Once you have R running you need to load the Emu library with:

library(emu)
   

After which you should see the Emu version number etc to verify that it has loaded ok.

Each of the data files is a text file with one row per vowel and different numbers of columns depending on the size of the feature vector. To read them into R you can use the matscan function as follows:

short.fm <- matscan( "short-fm.dat", 4 )
short.fft <- matscan( "short-fft.dat", 512 )

Where the second argument to matscan is the width of the data. matscan is part of the Emu library so you will need to load that first.

You can read the segment list into R using the command:

short.seg <- read.segs( "short.seg" )
    

This is just the result of a query to the database and has the same form as the result of emu.query.

You can read the speaker information using the following command:

short.info <- read.table("short.info", col.names=c("speaker", "sex"))
   

Cepstral data can be generated from the fft data by first defining a function to derive the cepstrum:

## define a function to make a cepstrum from an fft spectrum
fft2cep <- function(x,n) {
  as.vector(Mod(fft(log10(x), T ))/length(x))[1:n]
}

and then applying that function to each row of the fft data matrix. Note that the number 30 here refers to the number of cepstral coefficients (the argument 'n' to fft2cep):

## apply the fft2cep function to each row of the fft data
## we need to transpose t() the result because of the way that apply()
## works
short.cep <- t(apply( short.fft, 1, fft2cep, 30))
    

As an aside, note that if you want to calculate a cepstrally smoothed spectrum you need to keep the cepstrum in the complex domain (the result of an fft is always a complex vector representing amplitude and phase). The following function will derive a cepstrally smooted spectrum from a raw fft:

fft2cspectrum <- function(x,n) { 
    ## calculate the cepstrum
    y <- fft(log(x))/length(x) 
    ## blank out the middle section
    y[(n+1):(length(x)-n-1)] <- 0
    ## return the real fft
    return(Mod(fft(y)))
}

You can then plot an FFT and smoothed spectrum of the first row of short.fft as follows:

par(mfrow=c(2,1))
x <- short.fft[1,]
plot(log(x), type="l")
plot(fft2cspectrum(x,50), type="l")

2.2.3.1. Procedure

You will carry out a set of classification experiments as outlined in section 9.4 of Harrington and Cassidy. The steps are as follows:

  • Partition the data into training and testing sets.

  • Train a Gaussian model on the training data.

  • Classify the test data using the model.

  • Compare the classifications with the true labels.

This translates to the following R commands, for example:
## we will take the first 600 tokens as training data, the rest to test
select <- 1:600
traindata <- short.fm[select,]         # note the comma means keep all columns.
trainlabs <- label(short.seg)[select]  # but not here (it's a vector)

select <- 601:860
testdata  <- short.fm[select,]
testlabs <- label(short.seg)[select]

## now we'll train a model
model <- train( traindata, trainlabs )

## and test it
autolabs <- classify( testdata, model )

## and report on the results
confusion( testlabs, autolabs )
perform(confusion( testlabs, autolabs ))

The report on the experiment above is a confusion matrix as discussed in Harrington and Cassidy. The perform function sums the diagonal elements of the matrix to give an overall performance score.

For the cepstral data, remember that the lower numbered cepstral coefficients encode the vocal tract shape while higher order ones encode source information. Speech recognition systems often use 10-12 cepstral coefficients. You should experiment with different numbers of coefficients to find an optimal number for this experiment. Look at the results for a large number of coefficients (40), comment the results of these experiments.

When you are experimenting with different numbers of coefficients, the proper procedure is to use a different set of testing data to your final test set. In this way you ensure that you are not optimising the feature set for this particular set of data and are in fact getting a realistic evaluation in your final test. One option is to perform closed tests while optimising the feature set -- test on the same data that you train on. A better way is to partition the data into three sets, one for training, one for optimisation testing and one for the final evaluation. Each partition should be representitive of the data as a whole as far as possible, one valid approach if you are interested in cross-speaker performance is to train and test on different speakers, eg. train on five speakers, optimise on two speakers and test on three speakers.

To select according to speaker use the following procedure:

select <- muclass( short.info$speaker, c("db", "jc") )
traindat <- short.fm[select,]
trainlab <- label(short.seg)[select,]

The function muclass returns a vector which is true for labels (first argument) which are in the supplied label vector (second argument). This vector can then be used to subset the data.

Hint: you might like to encode the experimental procedure as an R function so that you don't have to repeat the same command sequence over and over. This is quite easy to do, for example the above commands could be made into a function with the lines:

experiment <- function( data, labels ) {

      ## we will take the first 600 tokens as training data, 
      ## the rest to test
      select <- 1:600
      traindata <- data[select,] 
      trainlabs <- labels[select]
      
      ...       
      confusion( testlabs, autolabs )
}

# this can be called with
experiment( short.fm, label(short.seg) )
    

If you wanted to be more general you could calculate the train/test partition boundary based on the number of tokens in the data set.

2.2.3.1.1. Principal Components Analysis

As an optional additional experiment, you can use principal components analysis to reduce the dimensionality of the Bark scaled spectral data. The princomp function performs a PCA on a data matrix, the procedure would be as follows:

## load the mva library to get the princomp function
library(mva)
short.bark.pca <- princomp(short.bark)
short.bark.transformed <- short.bark %*% loadings(short.bark.pca) 

You can now perform a classification experiment on the transformed data set, using either the whole data or a subset of the columns; for example short.bark.transformed[1:10,] gives you the first ten principal components.

You can generate a scree plot (see H&C p. 269) by just plotting the result of the call to princomp, see the princomp help page for an example.

2.2.3.2. Your Report

The result of your analysis will be in the form of a report on your experiments; it should include a method section, a description of the different experiments you carried out and a presentation and discussion of your results. While I don't want to see all of the R commands you used to generate your results, you should explain the major steps. You may want to include plots of vowel data in the formant plane to illustrate the distribution of the data.

2.3. Speaker Recognition

The aim of this exercise is to implement a basic speaker recognition system using VQ distortion measures.

2.3.1. Speaker Recognition

As a second stage of this assignment you will implement a basic speaker identification system using the same basic components. The major difference is that the stored models correspond to a single speaker rather than a word. You will implement a VQ distortion based speaker identification system.

To implement a speaker recognition system you need to be able to build a per-speaker VQ codebook and measure the VQ distortion for an input speech signal given a codebook. Here is an outline of the components you will need to write:

  • Data Input. You will use Emu to retrieve digit data from your own labelled recordings. Emu can give you the data corresponding to each digit along with it's label.

  • Feature Extraction. There are a number of options for feature extraction. To simplify the process we can pre-calculate feature vectors using tools outside of the R environment and read them in using Emu.

  • Vector Quantisation. You will implement a basic vector quantiser using the inbuilt k-means clustering functions. VQ allows you to pre-compute the distances between codebook entries.

  • Write a function to generate a VQ codebook model for each speaker in the input data.

  • Write a function to calculate the VQ distortion for some input speach and a VQ codebook. VQ distortion is defined as the summed distance between each input feature vector and the closest VQ codebook entry. You should write one function to calculate the distortion for one feature vector and another for applying this to the whole input vector sequence. Note that these functions will be very similar to those you use to VQ encode an input sequence. The summed distortion should be normalised by dividing by the length of the input sequence.

  • Write a function to compare the VQ distortion on a number of speaker codebooks for an input feature vector sequence (using the above functions) and output a speaker decision.

2.3.2. Vector Quantisation

The first stage in constructing your recogniser is to build a vector quantisation routine. This is aided significantly by the provision of a k-means clustering routine in R. You can generate a set of cluster centers using the single line:

library(mva)     # to load in the kmeans function
clusters <- kmeans( digits.a.melcep$data, 64 )
which will find 64 clusters in the data. The cluster centers are now stored in clusters$centers as a matrix with one row per cluster.

You need to write a function which takes a set of data and returns the result of the kmeans function; this will act as your VQ model which you will then use in another function to assign data to one of the clusters. This second function needs to take a data matrix and the cluster object as an argument and return a vector of cluster numbers. For example:

clusters <- generate.vq( digits.a.melcep )
data.vq <-  vq( digits.a.melcep[1]$data, clusters )
The second function is operating on the data for the first token in the melcep trackdata object (it is possible to write a function which operates on a trackdata object, returning a new trackdata object. I can explain how if you want to go that way.)

As a final step you can pre-calculate the distances between each vq cluster centre and store the result in a matrix. There is a convenient function dist which does exactly this but you should ensure that the distance metric that it is using is the same that you have used in your vq function. Once you have a distance matrix you can visualise the result using the image function.

In order to be able to test your functions it will be useful to work with data that can be visualised. I suggest you use some of the formant data we worked with earlier and generate a vq codebook for the first two formants. Then you can plot the formant data with their cluster labels (using eplot) to verify that you are indeed doing sensible clustering with your vq routine. I tested my routines on the first two columns of the mel cepstral data, here is a picture of the resulting clustering and the commands used to generate it.

model <- generate.vq( digits.a.melcep$data[,1:2], 8 )
foo <- vq( digits.a.melcep$data[,1:2], model )
eplot( digits.a.melcep$data[,1:2], foo, dopoints=T, doellipse=F )
    

Figure 2.1. An example clustering of the first two columns of the melcep data for speaker `a'. Note that the data for vq label `8' has been plotted in white, it fits in between the green '3' cluster and the cyan '5' cluster.

2.3.3. Template Matching

The second part of your assignment is to implement the Dynamic Programming template match discussed in Chapter 11. The core of the template matcher is a function which compares two input sequences, in our case these will be sequences of codebook entries. The comparison is made using the time warping algorithm we looked at, you can take the simplest form as your starting point making it more complex if you have time and enthusiasm. The function will need to build a matrix of distance values as it compares the two sequences. The value of each cell is calculated from those of the neighbouring cells. The resulting distance is the value of the final cell in the top right of the matrix. At this point we aren't interested in the path taken through the matrix but you might want to add that functionality if you are interested in seeing it work.

To write the dtw function, first allocate a matrix of distance values of the appropriate size (dd <- matrix(nrow=length(template), ncol=length(input))) and then step through the template and input (with two nested for loops) calculating the distance for each cell. The result of the function is the distance value in the final cell of the matrix. This basic function can then be optimised by adding in the adjustment window condition so that only cell contents within some distance of the line i=j need to be calculated.

It is difficult to know whether you've got the right answer once you've written the code. To verify your procedure I'd suggest constructing a simple case to which you can work out the answer by hand. Let's say you have a codebook with centers:

1: 1 2
2: 3 4
3: 5 6 
4: 7 8
then you can work out the between code distance matrix easily and then calculate the distance between the input vectors:
[1,2,3,4,3,2,1]  and
[1,3,4,3,3,1]
     
for example. You should be able to do this by hand and then verify the result using your function (including looking at the matrix values which would normally be thrown away).

As an extension to the simple dtw function, you could calculate best match path between the two vectors. To do this you need a second array which records which of the three possible paths into a cell were taken. At the end of the procedure you can trace back the best path in this array.

Once you have the core dtw function you need to write code to match an input vector against a set of stored templates and output the label of the closest template. You can store the set of templates as a list in R (you can't store them in an array since they aren't of the same length). I suggest you choose a random example of each digit as your template in the first instance, the alternative is to implement some kind of template selection procedure which is clearly more complicated. For example, the first element in the digits.a data is a nine, and the second is a five. Here is an example function which selects the first example of each label as the template and returns a list of the templates and their labels. It assumes there is a function vq which can vector quantise a data sequence.

select.templates <- function(segments, data, codebook) {
  ## select a set of vq templates given an input data set,
  ## we want to choose one example of each label and store
  ## the vq version of it as our template.
  ## The result is an object with components $templates, which
  ## is a list of the vq templates, and $labels which is a vector
  ## of the corresponding labels.
  ## eg.
  ## codebook <- generate.vq( mydata, 32 )
  ## tpl <- select.templates( mysegs, mydata, codebook )
  ## tpl$templates[[1]]  -- is the first template
  ## tpl$labels[1]       -- is the corresponding label


  ## ind is a vector of index numbers, one per segment in the data
  ind <- 1:nrow(segments)
  
  ## set up an empty result
  templates <- NULL
  templates$templates <- NULL
  templates$labels <- NULL
  
  for( lab in unique( label(segments) ) ) {
    ## select an example of this label, take the first one
    n <- ind[label(segments)==lab][1]

    ## calculate the vq version of the nth data element
    nvq <- vq( data[n]$data, codebook )

    ## and add it to the list
    templates$templates <- c(templates$templates, list( nvq ))
    ## and add the label to the labels list
    templates$labels <- c(templates$labels, lab)
  }
  return(templates)
}

Your recognition function will take the result of select.templates (or similar) and an input sequence. It will vector quantise the input and match it against every template, selecting the best match and returning the corresponding label. You can then write another function to call this function with each element of the data for testing purposes. You should include some test results in your final submission: how good is your recogniser on a same-speaker evaluation or on cross-speaker evaluation? How does the codebook size affect recognition performance? You should also look inside your decision algorithms and see how close the correct answer was in the cases where you got it wrong.

2.3.4. R Hints

One peculiarity of R is that for loops are not the most efficient way to process vectors or arrays of data. R is able to do vector arithmetic as you know, and this is much more efficient than doing the same calculation in a loop. Similarly, if you are doing some operation to every row of a matrix to generate a new matrix, it is faster to use the apply function than to use a for loop. For example, to sum the rows of a matrix foo:

foosum <- apply( foo, 1, sum )
     
is much better than:
foosum <- 0
for( i in 1:length(foo) ) {
         foosum <- foosum + foo[i]
}
look at help(apply) for details. The second argument refers to the dimension to which the function is applied, 1 for rows, 2 for columns. You can substitute any function of one argument for sum, even one you're written yourself. The result of the function should be a single item or a vector of fixed length; the result of apply is then either a vector or a matrix.

This might help you make your program faster, although you can still write a perfectly valid (and often more readable) function with for loops, you won't be penalised for not using apply.

2.3.5. To Hand In

You will hand in your completed code for the speaker recognition problems plus a short report on how the problem was solved and how the programs perform on the data provided. The evaluation part is important since I want to see that you've run the code that you've written and if appropriate, modified the programs in the light of what you observe (eg. changing thresholds etc.).

2.4. Speech Recognition

The goals of this assignment are to gain first hand experience in training an HMM based speech recogniser. We will use the freely available Sphinx recogniser and train it on some Australian speech to get a localised recognition engine. This project is sufficiently involved to have you work in pairs to complete it. I will give each pair a different set of training data so that we can compare results at the end. Tasks that will need doing are:

  • Pre-process the speech data to fit into the formats required by the training tools.

  • Develop a pronunciation dictionary for Aus. English in the appropriate format for Sphinx. There are a few sources of pronunciations that we might use here.

  • Establish an HMM architecture, how many states per model, generate initial models etc.

  • Train the HMM

  • Develop a language model for a chosen test application (eg. route planning!)

  • Deploy and evaluate the recogniser.

Many of the details will be worked out as we go but we are helped by a good set of tools and documentation from CMU at the Sphinx website.

Acoustics and Digital Signal Processing

This section is included here as a reference only. The material here was covered briefly in the first two weeks of class.

Chapter 3. Basic Acoustics

3.1. Speech signals

Speech propagates as a longitudinal wave in a medium, such as air or water. The speed of propagation depends on the density of the medium. It is common to plot the amplitude of air pressure variation corresponding to a speech signal as a function of time; this kind of plot is known as a speech pressure waveform or just a speech waveform. Here's an example taken from part of a vowel:

The plot shows the change in air pressure with time and we can see that in this example the pattern of changes is regular in that each up-down cycle has roughly the same shape. This corresponds to a periodic signal and these signals make up the voiced speech sounds such as vowels, and voiced consonants such as `b', `w' and `z'. Another kind of sound is irregular or aperiodic and so appears to be just random variations in air pressure. Examples are the hissing sound in `s' and `sh'.

The simplest form of sound that we can describe is called a sinusoid. This waveform corresponds to a pure tone and as we will see later forms the basis for all more complex sounds. Since a sinusoid has such a pure shape it is a useful starting point when describing the properties of sound waves.

3.2. Properties of Sinusoids

A sinusoid corresponds to a simple smooth up and down movement when drawn as a time waveform. Three kinds of measurement can be taken from a sinusoid and together they completely define the shape of waveform. These measurements are:

  • Amplitude: the size of the displacement of the sinusoid above and below the mid line. This corresponds to the energy in the sound wave and hence how loud it appears to be. There are many ways of measuring amplitude; since it relates to the size of the pressure variations in the air it can be measured in units of pressure. More often we talk about deciBels (dB) which measure amplitude on a logarithmic scale relative to a standard sound. The dB scale is useful since it maps directly to the way that humans percieve loudness.

  • Frequency: the number of cycles made by the sinusoid each second. A cycle consists of an oscilation from the mid line to the maximum, down to the minimum and back to the midline. Frequency is measured in cycles per second or Hertz (Hz). The period is the inverse of this -- the time taken for one cycle. Changes in frequency are percieved as changes in the pitch of a sound (although pitch has a more complicated perceptual definition so the two quanitities are not the same).

  • Phase: measures the position of the starting point of the sinusoid. Those starting at the maximum have a phase of zero while those starting at the minimum have a phase of π radians. Phase is best explained with reference to a rotating wheel, see Harrington and Cassidy Chapter 2, Figure 2.7. for more details. The phase of a sinusoid cannot be percieved but we can detect relative changes in phase between two signals, in fact this forms the basis of binaural hearing as the brain works out the location of a sound based on the different phases heard at the two ears.

3.3. Adding Sinusoids

When two sinusoids are added together the result depends upon their amplitude, frequency and phase. The effects are easiest to observe when only one of these is varied between the two sinusoids being added.

In the simplest case, when two sinusoids with the same frequency and phase but with different amplitudes are added together the result is a sinusoid who's amplitude is the sum of the originals and who's frequency and phase remain unchanged.

When the two sinusoids have different frequencies the result is more complicated. The new signal is no longer a sinusoid since it doesn't follow the simple up and down pattern. Instead we see the higher frequency sinusoid as a `ripple' superimposed on the lower frequency sinusoid (see figure). The frequency of the resulting signal will be the lower of the two original frequencies. In the figure we can see that the resulting signal goes through two and a half cycles (that is it repeats itself this many times) just like the second sinusoid. The amplitude of the resulting signal is the sum of the originals and the phase is unchanged.

Figure 3.1. Addition of sinusoids with different frequencies.

Adding together sinusoids with different phases can have surprising results. If two sinusoids are `in phase' then their peaks and troughs coincide and the result is as observed earlier. If the sinusoids are `out of phase' or in other words if they differ in phase by π radians then their peaks and troughs oppose each other and they will cancel each other out. The phase of the resulting sinusoid is the sum of the phases of the constituents.

Adding up more than two sinusoids can produce complex looking waveforms. When a number of different frequencies are combined the frequency of the result will in general be that of the lowest frequency component. This is often called the fundemental frequency of the signal.

3.4. Fourier theory

Fourier theory says that any complex periodic waveform can be decomposed into a set of sinusoids with different amplitudes, frequencies and phases. The process of doing this is called Fourier Analysis and the result is a set of amplitudes, phasesand frequencies for each of the sinusoids that makes up the complex waveform. Adding these sinusoids together again will reproduce exactly the original waveform.

A plot of the frequency or phase of a sinusoid against amplitude is called a spectrum. For a pure sinusoidal tone this is just a single line as shown in Figure 2.9. of Harrington and Cassidy. An amplitude spectrum of a complex waveform shows the amplitude of each frequency component or sinusoid. The phase spectrum shows the phases of these components but is very seldom used in speech analysis. Spectra are used extensively in speech analysis to discern the characteristic shapes associated with different classes of speech sound.

Any periodic signal shows a pattern of repitition in the time waveform which corresponds to the primary rate of vibration of the signal, known as the fundamental frequency. This corresponds to the lowest major frequency component in the signal and, in the case of voiced speech, equates to the frequency of vibration of the vocal folds. A vibration source like the vocal folds also produces what are called harmonics which are oscillations at multiples of the fundamental frequency. For example, a 100Hz source vibration will give rise to harmonics at 200Hz, 300Hz, 400Hz etc. In the spectrum of such a signal, the harmonics show up as spikes at these frequencies.

The fundamental frequency can be measured from a speech waveform by looking for the period of oscillation of the signal around the zero axis. Estimates can also be made from the spectrum since it shows a large peak at this frequency and at each multiple due to the harmonics. Measurements can be made of the frequency of the major peak or of the distance between harmonic peaks.

3.5. Spectrograms

A spectrum shows the freqency content of a short section of a waveform but it we want to see how the spectrum changes with time through a section of speech we need a different kind of display. A standard way of displaying this information is a spectrogram; this is a two dimensional plot of frequency against time where the amplitude at each frequency is represented by the darkness of the corresponding point in the display. The figure shows an example spectrogram aligned with the speech waveform for the word `stamp'. Experienced phoneticians can glean a lot of information from these kinds of displays and you will look at lots of them during your later practical work. For now we can observe a few prominent features: the dark are corresponding to the high frequency energy in the /s/ phoneme, the vertical striations in the /A/ phoneme corresponding to the fact that this vowel is a periodic signal (voiced) and the sudden onset of the relatively light area corresponding to the release of the /p/ at the end of the word.

Figure 3.2. An example wideband spectrogram of the word `stamp'.

The vertical striations in the vowel of `stamp' give another way of measuring the fundamental frequency of the signal at this point. In this kind of spectrogram (called a wideband spectrogram) each vertical striation corresponds to one open-close cycle of the glottis; hence measuring the time between striations can tell us the period of oscillation and hence the fundemental frequency of the speech signal at this time.

3.6. Exercises

The worksheets included with Harrington and Cassidy contain a number of exercises which are useful in illustrating the properties of sinusoids and spectra. Section 2 of the 'Interactive Workseets' illustrates the shape of sinusoids and the result of adding them together, measurements of frequency from complex waveforms and shows some examples of spectrograms.

3.6.1. Work to submit

You should submit your answers to the six questions in Section 2.3 of the worksheets which cover measurements taken from spectrograms on male and female speech.

3.7. Additional Reading

Basic acoustics is covered in both Ladafoged (Chapter 8) and Harrington and Cassidy (Chapter 2). For a much more detailed and slower paced discussion of sound propagation and the properties of waveforms see Introduction to Sound by Charles E. Speaks published by Singular Publishing (ISBN: 1879105985).

Chapter 4. Digital Signals

4.1. Transduction and Digitisation

Much of this unit will be concerned with the manipulation of speech signals by digital computers. The power of modern desktop computers means that advanced signal processing and speech analysis software can run on the average computer. In order to take advantage of this, and in order to understand the properties and limitations of digital signals, we need to look at how a speech siganl is converted into digital form.

4.1.1. Transduction

An acoustic signal is transmitted via the motion of the molecules of the air or other medium through which it passes. As we saw earlier the signal takes the form of changes in pressure over time and we characterise the signal by measuring this pressure as a function of time. Pressure is an analog quantity in that it varies continuously and can take on an infinite number of different values.

Although sound starts it's life as changes in air pressure, it can also be transmitted in different ways, for example via a microphone and telephone cable to a remote listener. The process by which sound changes from being a pattern of vibration in the air to a pattern of electrical energy in the telephone cable is known as transduction. Devices which change the form of a sound signal, in this case the microphone and loudspeaker, is called a transducer. The most interesting class of transducers for us are those that turn sound pressure waves into patterns of electrical energy. We can measure the electrical energy in the same way that we measure pressure to get a picture of the pattern of variation within the acoustic signal. Electrical energy is again an analog quantity and so we now need some way of changing this into the digital domain of modern computers.

4.1.2. Digitisation

Digital computers work at their lowest level with binary or base 2 numbers. In order to analyse and manipulate speech signals in the computer we need to convert the continuously varying signal into a series of discrete numbers; this process is known as digitisation. The basic digitisation hardware is known as and analog-to-digital or A/D converter and it takes snapshots of an input analog signal at regular intervals outputting a number which is closest to the magnitude of the snapshot measurement. Hence if the converter sees 1.376 volts it might output a value of 1.4 volts -- the closest `round number' to the input measurement. The questions of importance when talking about digitisation are the rate at which snapshots are taken and the size of the smallest change that can be detected in the input signal.

4.1.2.1. Sampling Frequency

It should be clear that taking a series of snapshots of a signal can only capture an approximation of the original. When a movie or TV camera captures thirty or so frames per second we rely on persistence of vision to transform the result into a continuous moving image when played pack. Even so, any event that happens between snapshots or any changes that happen too quickly will not be reproduced properly: the classic example is the wheels on covered wagons in John Wayne movies which appear to be going backwards. The same problem arises in digitising audio signals and so we must be careful to capture at least the important information in the signal; to do so we need to understand what the limitations are.

The sampling frequency or sampling rate is a measure of the number of snapshots taken from the signal each second. Sampling frequency is measured in Hertz, just like any other frequency measure.

If we consider sampling a sinusoid signal we can see what the limits of the process are. Figure 5.2 from Harrington and Cassidy (reproduced here) shows two sinusoids with different frequencies sampled at the same sampling frequency of 10Hz. In the bottom pane, a signal with a frequency of 5Hz is sampled at at each peak and trough and once close to the midpoint of each cycle: four samples for each cycle in total. The result is a version of the original which retains at least its frequency and amplitude if not the exact shape. The upper pane shows a 15Hz signal and we can see that because the sampling frequency is too low, most of the peaks and troughs are missed; as it happens the sampled signal matches exactly that from the 5Hz example, so the original 15Hz signal would be reproduced as a 5Hz signal if a sampling rate of 10Hz was used. This phenomenon is known as aliasing; the higher frequency signal is said to be aliased onto the lower frequency signal.

Figure 4.1. Sampling two sinusoids. Taken from Harrington and Cassidy, figure 5.2, p 134.

The absolute minimum number of samples per cycle needed to properly reproduce a sinusoid is two -- one at the peak, one at the trough. This will give a crude approximation to the original signal but will be able to capture the frequency and amplitude. This means that the sampling frequency should be at least twice the frequency of the sinusoid being digitised; this is know as the Nyquist Frequency.

Since we know that any complex signal can be thought of as the summation of a number of sinusoids, the above result can be stated more generally. The sampling frequency should be at least twice the frequency of the highest frequency of interest in the input signal. The telephone system uses a sampling frequency of 8000Hz and so can capture only information up to 4000Hz. In studying speech recorded in quiet conditions we often use a sampling frequency of 20000Hz which gives information up to 10000Hz.

From the earlier example we saw that signals above the Nyquist frequency will be aliased so that they look like signals below it. In fact the Nyquist frequency acts like a mirror such that any sinusoid with a frequency of FN + x will appear as a sinusoid of frequency FN - x. For complex signals, this has the consequence that any information above the Nyquist frequency will reflect back onto the lower frequencies and contaminate the result. In order to reproduce the information below the Nyquist frequency accurately the higher frequency information must be removed before the signal is digitised. The device which does this is a low pass filter often called an anti-aliasing filter.

4.1.2.2. Quantisation

We are used to using a base ten number scheme but modern computers use binary numbers for various reasons. Binary numbers consist only of ones and zeros which is useful as these can be stored and transmitted as the presence or absence of a physical property, like electrical charge or a high frequency tone. Each binary digit is called a bit and all information inside a computer is stored as strings of bits.

A binary number can be decoded by adding up powers of two; each place in the digit corresponds to a higher power of two, starting from the right most position which is 20 = 1 followed by 21 = 2 then 4, 8, 16 etc. So the binary number `1011' is 1*8 + 0*4 + 1*2 + 1*1 = 3. Decoding these numbers by hand can be longwinded but the process is easy to understand. Any decimal integer can be rewritten as a series of binary digits.

When we digitise an acoustic signal it is turned into a sequence of binary numbers by the analog-to-digital hardware. There is an important consequence of this process that we need to understand: that these devices use a fixed number of binary digits to represent each sample and hence that the size of the smallest change that can be detected in the input is related to the number of bits used.

Analog to digital hardware uses a fixed sample size to represent the sampled acoustic signal; typically 12 or 16 bits are used per sample. A little arithmatic will tell us that 12 bits will give us a maximum of 212 = 4096 different numbers while 16 bits gives 216 = 65536 values. These numbers will be used to represent the different input voltages taken from the microphone. When the hardware measures the size of the input voltage from the microphone, instead of calculating a voltage value it merely assigns it a number on a scale of 0 to 4096 (for a 12 bit digitiser). Since our input signal consists of both positive and negative peaks (oscillations), the 0 point should correspond to the largest negative peak and the 4096 point to the largest positive peak. The output of the digitiser when the input is zero should be around 2048.

The outcome of this is that the continuously variable input signal is quantised into one out of 4096 values. If, for example, the range of the input were +3v to -3v then each binary digit (bit) would correspond to a change of 6/4096=0.0015v in the input. Any change in the input smaller than this will not be captured. This is a similar consequence to that in the previous section where events which happen between samples cannot be resolved. The result of quantisation can be seen in Figure 5.4 on page 135 of Harrington and Cassidy (reproduced below). The sampled signal (dotted line) is an approximation of the original in both time and amplitude.

A digitiser that uses 16 bits will be correspondingly more accurate than one which uses 12 bits. The compact disc standard uses 16 bit samples taken at a 44kHz sample rate. For speech analysis it is common to use 12 bit samples at around 20kHz for studio quality recordings or 8-10kHz for telephone or office environment recordings.

4.2. Manipulating Digital Signals

Once a digital signal has been captured it can be stored inside the computer as a sequence of numbers. Our aim in doing this is to be able to manipulate and analyse this digital form to extract information from it and convert it to different forms. To this end we need to look at the concept of vectors and the operations of vector arithmetic.

A vector is the mathematical name for a sequence of numbers of arbitrary length. In Harrington and Cassidy and in these notes we write a vector as a list within square brackets:

[1, 3, 19, 299, 34, 12]

This vector consists of six elements. We often denote a vector algebraically by a bold letter, for example

x = [1, 3, 19, 299, 34, 12]

A number of simple operations are possible on such vectors and these are described in Harrington and Cassidy Section 5.3, page 138. In summary we can add a scalar (a scalar is a single number, as opposed to a vector which is a sequence of numbers) to a vector, multiply by a scalar or add/multiply two vectors. When combining vectors in this way both must have the same length since the elements are combined in a pairwise manner. See the text for examples and notation.

When a vector is used to denote an acoustic signal some more assumptions are made. Acoustic signals are notionally infinite (they have no start or end) so for any finite vector we assume that the numbers before and after the vector exist and are zero. This becomes important when we shift vectors to the left or right (a useful operation in some common signal processing operations) -- in this case a zero is shifted in the the first (or last) position of the vector.

4.3. Exercises

The exercises for this section come from the "Practical Exercises with XlispStat" -- the second set of worksheets on the CDROM accompanying Harrington and Cassidy. These worksheets require that you learn a small amount of the Lisp language although most of the exercises can be completed by copying from the examples in the worksheets. You should begin with the introduction in section 2 which covers the basics of the language and environment. The exercises for later chapters will build on the basic commands that you learn here.

For this chapter you should work through the worksheets in section 3 (Time Domain Processing) up to section 3.4. Make sure that you are comfortable with the notion of storing sequences of numbers as vectors and the operations that you can perform on them. You will use these operations later to perform some simple digital signal processing on speech signals.

4.3.1. Work to submit

There are no exercises to submit for this topic.

4.4. Additional Reading

This material is covered in Harrington and Cassidy, Chapter 5, sections 5.1 - 5.3.

Chapter 5. Time Domain Processing

Having digitised a speech signal we now have a vector of samples which can be manipulated in various ways. Our aim in this course is to be able to extract characteristic parameters from a signal, which can be used to identify different kinds of speech sounds. This session will discuss some parameters which can be extracted directly from the time domain signal.

5.1. Windowing Signals

When we analyse signals we tend to do so on only small portions at a time. The reason for this is that we typically assume that the signal is constant (eg. has a constant frequency) over the time-span of our analysis. Since speech is actually changing rapidly, we have to cut it into small parts for this assumption to hold. Hence we typically window a speech signal before analysis, which just means that we cut out a small section.

Windowing can be seen as multiplying a signal by a window which is zero everywhere except for the region of interest, where it is one. Since we pretend that our signals are infinite, we can discard all of the resulting zeros and concentrate on just the windowed portion of the signal.

The above window (zeros and ones) is known as a rectangular window because of its shape. One problem with this kind of window is the abrupt change at the edge, which can cause distortion in the signal being analysed (in fact any windowing operation causes distortion, since the signal is being modified by the window). To reduce this distortion we often use a smoother window shape called a Hamming window. This window is zero at the edges and rises gradually to be 1 in the middle. When this window is used the edges of the signal are de-emphasised and the edge effects are reduced.

It is important to use a Hamming (or the similar Hann window) in some kinds of analysis, particularly the frequency domain methods we will look at later. For many time domain methods the rectangular window is preferable.

If we are analysing a long signal, then successive windows are taken along the length of the signal. Often we overlap the windows if a Hamming window is being used so that the de-emphasised part of one window becomes the middle of the next.

5.2. Time Domain Parameters

5.2.1. RMS

The root mean square amplitude is a measure of the energy in a speech signal. When applied to successive windows of a speech signal it gives us a measure of the change in amplitude over time.

5.2.2. Zero Crosssing Rate

Zero crossing rate measures the number of times a signal crosses the zero line per unit of time. This gives us a measure of the dominant frequency in a signal -- the frequency with the largest amplitude. This is often correlated with the first formant in vowels. ZCR can be useful in differentiating between voiced and unvoiced sounds since unvoiced sounds tend to have a large ZCD while in voiced sounds it is smaller.

A combination of RMS and ZCR can be used to make a simple speech/nospeech decision. ZCR tends to be high in unvoiced sounds, RMS is high in voiced sounds -- a suitable weighted sum will be high for any speech sound and low for non-speech.

5.2.3. Autocorrelation

Autocorrelation can be used to find the fundamental frequency or pitch of a signal. The technique relies on finding the correlation between a signal and a delayed version of itself. If the delay corresponds to exactly one pitch period, then the signal and the delayed version will co-vary -- that is when one goes up the other will go up, etc -- we say that they will be highly correlated. If the delay corresponds to half a pitch period, they will be opposed to one another or un-correlated. If we plot the degree of correlation (which varies between 1 and -1) vs. the lag between the two signals we get an autocorrelation curve, as shown in the figure. We can see a peak in the curve corresponding to a lag of one pitch period -- and hence autocorrelation can be used to determine the pitch of a signal.

Real speech doesn't give as clean a curve as in the example above (which was from a sinusoid). But as the example here shows we can often see good peaks in the autocorrelation curve of voiced speech (right). Unvoiced speech (left) has a very different autocorrelation pattern and hence autocorrelation is another way of making the voiced/unvoiced decision.

5.3. Exercises

You should again do some exercises from the `Practical Exercises...' worksheets from Harrington and Cassidy. Continue with section 3.5 on windowing signals and then look at the first part of section 4 on speech signal parameters which gives examples of calculating autocorrelation coefficients for a sample signal. Leave the remaining exercises in section 4 until later.

5.3.1. Work to submit

This worksheet is intended to explore the use of autocorrelation for pitch tracking of speech signals. You will use the autocorrelation routines in XlispStat/Emu to look at various speech signals and asses the ease of automatic pitch tracking.

XlispStat/Emu has an autocorrelation routine which can be accessed by sending the :autocorr message to a signal object. The autocorrelation is by default performed on the entire signal, so if you want to analyse just a window from the signal you need to window the signal separately with the :window message. For example:

(def w (send stamp :window :start 2000 :width 1024 :type 'rectangular)
(def w-ac (send w :autocorr))
(send w-ac :plot)    
      
You can plot the autocorrelation by sending the :plot message as above, by default this will also show the four largest peaks as determined by a fairly crude peak picking algorithm.

You can use the window-demo function to preview the result of applying autocorrelation to different parts of a signal as follows:

(window-demo stamp :transform :autocorr :width 1024 :type 'rectangular)
      
A 1024 point window corresponds to just over two cycles of a 100Hz signal and so should be ok if the pitch is around that value. A rectangular window is appropriate here since we don't care about the frequency response of the window and in fact it's probably best to preserve the amplitude of the signal at the edges.

Answer the following questions:

  • For the stamp signal, estimate the pitch of the signal using autocorrelation at regular intervals along it's length.

  • Why can't you estimate the pitch in an unvoiced section like the initial /s/? Suggest a method of differentiating between voiced and unvoiced speech using only an autocorrelation analysis.

5.4. Additional Reading

Chapter 5. of Harrington and Cassidy, sections 5.4 - 5.6

Chapter 6. Frequency Domain Analysis

6.1. The Discrete Fourier Transform

The Fourier transform transforms a time domain signal into a frequency domain representation of that signal. This means that it generates a description of the distribution of the energy in the signal as a function of frequency. This is normally displayed as a plot of frequency (x-axis) against amplitude (y-axis) called a spectrum.

In digital signal processing the Fourier transform is almost always performed using an algorithm called the Fast Fourier Transform or FFT. This is, as its name suggests, a quick way of performing this transform and it gives the same results as the slower Discrete Fourier Transform (DFT) would. We needn't worry here about any of the details of how the transform is carried out, we are interested only in the results. One consequence of using the FFT algorithm is that the length of the signal being analysed must be a power of two, eg. 128, 256, 512, 1024 and so on (although many modern implementations get around this limitation).

Applying the FFT algorithm to a 512 point window of data taken from a speech signal results in a vector of 512 numbers corresponding to the energy at 512 frequencies spanning the range from 0Hz to the sampling frequency. From Fourier's theorem we know that adding together 512 sinusoids at these frequencies will exactly reconstruct the original signal. From Nyquist's theorem we know that the largest frequency component in the original signal must be half the sampling frequency, all frequencies above this are aliased onto lower frequencies. It follows that the second half of the spectrum is the mirror image of the first half:

In fact, the first point in the spectrum isn't reflected and the frequency range from 0 to half the sampling frequency is covered by (N/2)+1 points where N is the size of the window. So from a 512 point fft of speech sampled at 10000Hz we get 257 unique spectral points covering the range 0 to 5000Hz.

6.2. Bandwidth

For any digital spectrum we can work out the space between points in on the frequency axes, in the example above this will be 5000/257 or 19.5 Hz. This means that if there are two peaks (or dips) in the spectrum separated by less than 19.5Hz we won't be able to resolve them -- they will be blured together. If we used a wider window, say 1024 points, the space between points on the frequency axis drops to 9.7Hz, now we can see more detail than before. The resolving power of a spectrum is called it's bandwidth, refering to the width of the frequency bands that each point in frequency represents. A large bandwidth implies few points in the spectrum and hence poor resolution in frequency; a small bandwidth implies many points in the spectrum and good resolution in frequency.

There is a tradeoff between resolution in frequency and resolution in time. To get good frequency resolution we need to take a large window of points from the signal. By analysing this large window we blur any spectral changes which happen over it's length -- that is, if two events occur close together in time, a large window will tend to blur them together. Hence good frequency resolution implies poor time resolution and good time resolution implies poor frequency resolution. We can see the result of this tradeoff in wide and narrowband spectrograms (end of Chapter 2 in the text). The wideband spectrogram has a large bandwidth (small window) and so blurs the frequency scale but gives good time resolution -- you can see vertical striations on the spectrogram corresponding to the opening and closing of the glottis. The narrowband spectrogram has a small bandwidth (large window) and so looses time resolution but displays excellent frequency resolution -- you can see horizontal bands corresponding to the harmonics of the pitch frequency.

The figure below shows dB spectra corresponding to a section of the vowel in stamp calculated withh window sizes of 1024 points (red) and 256 points (black). Notice the peaks due to pitch harmonics in the red line which are blurred over in the black.

When the window size is very small, the spectrum generated by the fourier transform will have a correspondingly small number of points and hence will appear very jagged. In order to produce a smoother result in this case the input window can be zero-padded to extend the size of the window without taking more of the input signal. The fourier transform of the zero padded signal will have many more points and the result is that the spectrum appears to have been smoothed out. An example can be seen in Figure 6.1 which shows a raw spectrum taken on 50 points next to a smoother version taken from the same 50 points padded with 500 zeros.

Figure 6.1. A Zero Padded Spectrum

A similar example is shown in Harrington and Cassidy p. 167. The smoothed spectrum has the same peaks and troughs as the original but note that it also introduces new peaks and troughs of it's own. Thus it is important to know when you are looking at a zero padded spectrum -- not all of what you see is characteristic of the input signal.

6.3. Decibels

We commonly transform a spectrum as calculated by the fft into a decibel spectrum to better reflect the response of the human ear. The units of amplitude in the raw spectrum shown above are the same as those of the sampled acoustic signal. The decibel scale is a logarithmic scale of relative amplitude where we first devide by a reference amplitude (which corresponds to 0dB) and then take the naturnal logarithm. Finaly we multiply by 20:

AdB = 20*log(A / Aref)

where AdB is the amplitude in decibels, A is the raw amplitude and Aref is the (raw) reference amplitude. There are five basic choices for Aref:

  1. Choose a known standard reference value. This requires calibration of your recording and digitising equipment against some standard.

  2. Take Aref to be the average amplitude in the utterance you took your window from.

  3. Make Aref the peak amplitude value in the spectrum, this makes the peak always have an amplitude of 0dB.

  4. Make Aref the amplitude at a fixed frequency, say 1000Hz, for this spectrum. The spectrum will always now pass through 0dB at 1000Hz.

  5. Make Aref zero, that is just take the log of the raw amplitude value.

The choice of value for Aref depends on the needs of your application. The big question is whether you want to preserve differences in amplitude between spectra or between utterances. For example, if you think that amplitude is going to be a useful cue then you need to preserve the relative amplitude of your spectra, so you can't use either method 3 or 4 above. If you want to factor out the overall loudness of a recording then method 2 might be useful. If you want to get precise relative amplitude measures then you need to use a calibrated system (method 1).

6.4. Frequency Domain Filtering

Filtering is an operation on an acoustic signal that changes the shape of the spectrum of the signal. A filter is a device (or an computer program) which applies such a change to a signal. Examples of filters are the bass/treble controls on your stereo which attenuate the higher or lower frequency ranges in the music signal. Each filter has a characteristic spectral shape which describes which parts of the original signal will be attenuated and which will be left alone. An example for a low pass filter might be as shown in Figure 6.2.

Figure 6.2. An example of the spectral shape for a low pass filter

The effect of applying a filter to a signal is to modify the shape of the signal's spectrum. In the frequency domain the effect of applying a filter to a signal is to multiply the spectrum of the signal by that of the filter. (Note that we multiply the raw spectra and not the dB spectra -- since dB spectra are logarithms they should be added to achieve the same effect since log(a*b) = log(a)+log(b).) The result is a spectrum that combines the features of those of the input signal and the filter. An example is shown in Figure 6.3 (taken from Harrington and Cassidy, p. 191) of a simple impulse train input signal (which has a `flat' spectrum) filtered through a bandpass filter; the output signal shows the characteristics of both spectra.

Figure 6.3. Applying a filter in the frequency domain

Having applied a filter by combining the spectra, a new filtered signal can be derived by performing an inverse fourier transform on the new spectrum.

The question arises as to how we can find the spectrum of a filter, or indeed how a filter is represented digitally. A digital filter can be characterised by an acoustic signal called an impulse response -- literally this is the result of passing an impulse (a 1 followed by lots of zeros) through the filter. The spectrum of a filter can be derived by taking the spectrum of its impulse response.

Filters are an incredibly important topic in the study of speech acoustics. As we will see, the vocal tract acts as a filter which changes the spectral characteristics of the acoustic source generated by the glottis or by frication. Understanding the nature of filters and the separability of the source and filter are fundamental to the study of speech acoustics and form the basis of much of speech technology research.

6.5. Convolution and Time Domain Filtering

Although it is possible to apply a filter in the frequency domain by multiplying raw spectra, this is not the most practical way of applying a digital filter. Fortunately there is a time-domain operation that has the equivalent effect: convolution. Convolution is an operation that combines two vectors to give a third vector. If the first vector is an acoustic signal and the second is the impulse response of a filter, then the result of convolution is a filtered signal.

The process of convolution is described in detail in Harrington and Cassidy, Section 5.6, p. 148. It involves a series of multiplications and additions of corresponding points in the two vectors. Rather than reproducing the explanation here I refer you to that discussion.

An important point to take away from the discussion of convolution is that the operation is equivalent to weighted differencing of the input signal. The filter provides the weighting coefficients. Also, remember the formula y = x*h where y is the output signal, x is the input signal and h is the filter impulse response. We will see this formula later when we look at more advanced DSP methods; it provides an important key to analysing speech signals.

6.6. Exercises

For this section you should complete some of the worksheets on the source filter model from the `Interactive Worksheets on Speech Acoustics' on the Harrington and Cassidy CDROM. Worksheet sections 3.1, 3.2 and 3.3 are relevant. The later sections will be useful when we talk about the source filter model of speech production in the next section.

6.6.1. Work to submit

You should submit your answers to worksheet 3.2 (Bandwidth) in the `Interactive Worksheets on Speech Acoustics'.

Chapter 7. The Source Filter Model of Speech Production

We are now equipped to begin talking about an important model of speech production: the source filter model. This model is at the heart of many speech analysis methods and drives thinking in speech perception research also.

The view put forward in the source filter model is that speech sounds are produced by the action of a filter, the vocal tract, on a sound source, either the glottis or some other constriction within the vocal tract. Fundamental to the model is the assumption that these are independent -- that is the properties of the filter can be modified without changing the properties of the source and vice versa. This assumption may not be strictly true in all cases but in practical terms provides us with a very useful and largely accurate model of speech production.

7.1. The Source

There are two acoustic sources in speech production corresponding to voiced and unvoiced speech sounds.

7.1.1. Voiced Speech

The source in voiced speech is the vibration of the vocal folds in response to airflow from the lungs. This vibration is periodic and if it could be examined independantly of the properties of the vocal tract (which change it's spectral shape) would be seen to consist of a series of broad spikes. Figures 3.4 and 3.5 from Harrington and Cassidy (reproduced in Figure 7.1) show the waveform and spectrum of the glottal source.

Figure 7.1. A glottal source waveform and it's spectrum

The spectrum of the glottal source is made up of a number of frequency spikes corresponding to the harmonics of the fundamental frequency of vibration of the vocal folds. The spectrum decreases in amplitude with increasing frequency at a rate of around -12dB per octave -- that is for each doubling in frequency, the amplitude of the spectrum decreases by around 12dB.

The effect of increasing the frequency of vibration of the vocal folds is to widen the gap between the frequency spikes in the glottal source spectrum, since these spikes occur at multiples of the base or fundamental frequency. The overall shape of the spectrum (the -12dB per octave attenuation) remains unchanged.

The average male voice has a source frequency of around 100Hz, female and child voices are typically higher in pitch: around 200 Hz for an average female voice and 200-300 Hz for children.

7.1.2. Unvoiced Speech

In unvoiced speech the sound source is not a regular vibration but rather vibrations are caused by turbulent airflow due to a constriction in the vocal tract. The various positions at which the vocal tract can constrict have been discussed in the earlier part of the course (the places of articulation for fricative sounds).

The sound created via a constriction is described as a noise source. It contains no dominating periodic component and has a relatively flat spectrum meaning that every frequency component is represented equally (in fact for some sounds the noise spectrum may slope down at around 6dB/octave). Looking at the time waveform of a noise source we see only a random pattern of movement around the zero axis. The spectrum looks the same, with random peaks and troughs but the overall trend is for it to be flat accross the frequency range.

7.2. The Filter

We have seen that a filter is a system that alters the frequency content of an acoustic signal. The filter in speech production is the vocal tract, a tube of uneven cross section which is closed at one end (by the glottis) and measures around 17.5 cm for the average male. As with any other filter, this tube has a characteristic spectrum. Importantly, this spectrum changes as the shape of the vocal tract changes during speech production. The different qualities of the speech sounds are produced by changing the shape of the vocal tract to produce a particular set of filter characteristics.

To understand the effects of the vocal tract we can model it as a cylindrical tube, closed at one end, which makes the calculation of the filter spectrum much easier. The spectrum of this tube has a set of peaks which correspond to resonant frequencies -- basically the frequencies of vibration which fit into the tube best. For a full discussion of this model see Harrington and Cassidy Section 3.3.1. Resonance is the property which allows you to get a single note out of a half full beer bottle by blowing over the top: the wavelength of the vibration is such that it fits within the tube. Change the length of the tube (by taking a swig of beer) and the note changes because now a different wavelength fits the tube best. In fact you can get more than one note if you are well practiced -- this is how a trumpet can play so many notes with only three `keys'. Each length of tube can fit a number of different resonances but the resonances frequencies are all related to the length of the tube.

The end result then, is that a cylindrical tube has a spectrum with a set of peaks at around 500Hz, 1500Hz, 2500Hz etc. which correspond to the resonant frequencies of the tube. Now, if the shape of the tube is changed, for example by adding a constriction halfway along it, the position of these resonances changes also. In the case of a constriction, the effect is to move the resonant frequencies relative to each other -- so that they are no longer equally spaced. Translating this back to the real vocal tract, the cylindrical tube corresponds approximately to a relaxed vocal tract and the corresponding pattern of resonance matches that of the schwa vowel. Constrictions are introduced by raising the tongue and moving the position of the raised tongue back and forward. This has the effect of modifying the positions of the resonances relative to each other, producing the different vowel sounds.

Another filter is involved in speech production, corresponding to the effect of the lips and the air beyond them on the acoustic signal. This lip-radiation filter has a characteristic spectrum which is a 6dB/octave rise -- that is the lower frequencies are attenuated wheras higher frequencies are not. This spectrum is multiplied by the spectrum of the vocal tract to produce an overall spectrum for the system.

7.3. Combining Source and Filter

As discussed earlier, the effect of filtering a waveform can be achieved either by convoltion in the time domain or by multiplication in the frequency domain. To produce speech sounds the source (voiced or unvoiced), is passed through the vocal tract filter as described above.

For voiced speech, the combined signal has peaks due to the harmonics of the glottal source superimposed upon peaks due to the resonances of the vocal tract. These peaks of the combined spectrum are formants, which as was discussed earlier in the unit are characteristic features of voiced speech sounds. It is important to note that the formant peak in the resulting spectrum will occur on the closest source harmonic peak to the resonance peak -- hence the resonant frequency and the formant frequency aren't necessarily the same thing.

Fricative speech sounds do not display the harmonic peaks of voiced speech but the characteristic peaks of the vocal tract filter can still be seen. Again, the position of these peaks serves to characterise the speech sound since it is intimately related to the vocal tract configuration that produced it.

7.4. Source Filter Synthesis

In the last section we discussed a simplified model of speech production which used a single tube of constant cross section to model the vocal tract. As was mentioned their, a slightly more realistic model includes a constriction in the tube to represent the position of the tongue during articulation. Fairly detailed models of articulation can be developed by using a small number of tubes to model the vocal tract -- a common model uses one tube for the area behind the tongue, one for the constriction created by the tongue, one for the area front of the tongue and a final tube for the lip apperture. The lengths and cross sectional areas of these tubes can be adjusted to account for a number of speech sounds. This model is very useful in predicting the gross changes of formant positions caused as the tongue is raised or lowered and moved from the back to the front of the mouth.

As discussed in Chapter 3 of Harrington and Cassidy, the effects of different vocal tract configurations can be explored via a nomogram which is a plot of the predicted resonant frequencies in a four tube model as the constriction tube is moved from along the vocal tract. The tube configuration is illustrated in Figure 7.2 while Figure 7.3 shows an example nomogram.

Figure 7.2. A four tube vocal tract model

Figure 7.3. An example nomogram. The solid line corresponds to a lip tube area of 4 cm2 (unrounded vowels) while the dotted line corresponds to a lip tube area of 0.65 cm2 (rounded vowels).

You can explore tube models in the exercises for this lecture. These models illustrates one approach to generating synthetic speech, namely articulatory synthesis. Models with a larger number of tubes can be used to produce more accurate reproduction of various speech sounds. It is also possible to derive a tube model for a given speech sound via LPC analysis -- this won't be covered in this unit but is discussed in Harrington and Cassidy, Chapter 8. Such a derived tube model can be used to re-synthesise speech by applying a suitable source to the model.

7.5. Source Filter Analysis

The source filter model describes the process of speech production in terms of two independent contributions from the sound source and the vocal tract filter. As a model of production, this allows us to synthesise sounds as described in the previous section, but the model can also be used to aid the analysis of speech sounds. The key is to notice the importance of the independence of the source and filter; this means that each makes a separate contribution to the characteristic features of the resulting speech sounds. The source, in voiced speech, is responsible for the pitch of the sound while the vocal tract filter is responsible for the location of the formants and the overal spectral shape. In a real speech spectrum the overall filter shape and the location of the formants is often drowned out by the effects of the source spectrum. If it were possible to remove the source effects then the two spectra could be studied separately to give a more accurate picture of the precursors of the speech sound.

Removing the effects of the source from a spectrum also has the effect of removing many sources of variability from the signal since many of these are realised in the source rather than in the filter. Hence removing the source can help make, for example, the vowel sounds more distinct from one another and could make vowel sounds from different speakers look more similar.

There are a number of techniques for separating source and filter in a speech signal. In this section I will discuss only one of these: Cepstral analysis.

Cepstral analysis relies on the observation that a logarithmic speech spectrum is made up from the source and filter spectra added together. To understand this, recall that filtering in the frequency domain is achieved by multiplying spectra together. Multiplication corresponds to adding logarithms and so a filtered spectrum can be derived by adding logarithmic (or dB) spectra. The procedure for cepstral analysis is to take the inverse fourier transform of the dB spectrum, thus converting the signal back into the time domain. This time domain signal is not a regular acoustic signal, since it was derived from the logarithmic spectrum; for this reason it is called a cepstrum, sort of a backward spectrum. The important property of the cepstrum is that since the dB spectrum is the sum of two spectra, so the cepstrum is the sum of two components corresponding to the source and the filter. Usefully the two components are separated: the lower end of the cepstrum corresponds to the filter while the higher end (or rather the middle of the reflected cepstrum) corresponds to the source. An illustration of the process of applying cepstral analysis is shown in Figure 7.4.

Figure 7.4. The process of generation a cepstrum and a cepstrally smoothed spectrum.

A further operation on the cepstrum is to remove the central part of the reflected cepstrum, the part that corresponds to the source, and perform a fourier transform to again generate a frequency domain version of the signal. Since the effect of the source has been removed from this spectrum, it shows the characteristics of the vocal tract filter. This is manifested as a much smoother spectrum than the original; the degree of smoothing depends upon the number of cepstral coefficients removed prior to the final fourier transform.

The technique of cepstral analysis is described in more detail in Harrington and Cassidy. The technique can be used to generate smoothed spectra which show the characteristics of the filter as described above. The cepstrum can also be used as a means of determining the fundamental frequency of voiced speech since the part of the cepstrum corresponding to the source is often manifested as a single spike. The location of this spike gives a measure of the frequency of the source signal.

Cepstral analysis is an important technique for examining speech signals. You should ensure that you understand the process of generating a smoothed spectrum and the reasons why the process works the way it does. Experimenting with the online examples may help you in coming to terms with it. This is just one of a family of techniques that seek to separate the source and filter in speech analysis. These techniques are useful in that they enable us to remove variability from the speech signal which is due to the source and concentrate our analysis on the vocal tract filter.

7.6. Exercises

In the `Interactive Worksheets' on the Harrington and Cassidy CDROM, Chapter 3 presents some exercises on the source filter model. You should already have looked at the first three sections of this chapter in regard to filters. Section 3.4 covers the glottal source. Sections 3.5 and 3.6 cover tube models of the vocal tract and allow you to visualise the spectrum of a simple tube model as well as see the effect of combining source and filter spectra. You can also use this tool to hear the difference between the filtered and unfiltered source.

Cepstral processing is covered in the `Practical Exercises with Xlisp-Stat' worksheets, in sections 4.2 to 4.4. These worksheets demostrate how to generate a cepstrally smoothed spectrum in Xlisp-Stat and explore the effect of the number of cepstral coefficients retained in the analysis.

7.6.1. Work to submit

You should submit your answers to the first three questions from worksheet 3.6 (Tube Models II).

From the Cepstral processing worksheet (Practical Exercises... 4.2) submit your estimates of the pitch of the stamp signal estimated from the cepstrum. Compare these with the estimates you derived via autocorrelation in an earlier worksheet.

7.7. Further Reading

Chapter 3 of Harrington and Cassidy covers the source filter thoery in detail. Cepstral analysis is covered in Chapter 6.

Speech Recognition

Chapter 8. The Architecture of ASR

In this session I want to introduce the major components of a speech recognition system and basically run through a preview of the contents of this unit. The reading for this session is the chapter by Power: The listening telephone -- automating speech recognition over the PSTN which is taken from a book Speech Technology for Telecommunications edited by Westall, Johnston and Lewis. This chapter provides a nice overview of the field, it does go into a little more detail than I'd prefer at this point but it's all material that we will be covering in depth later in the course.

In these notes I'll touch on the major points as I see them in this article.

8.1. The Problem of Speech Recognition

The domain of this course is the related technologies of speech recognition and speaker verification. Speech recognition is the process of finding a linguistic interpretation of a spoken utterance; typically this means finding the sequence of words that were spoken but, as we'll see later, there is other linguistic information that we could recover, such as prosodic cues. Speaker verification is the problem of verifying that a `voice print' from an claimed user is the same as that stored for that user.

Both of these applications share some components. They must first pre-process the acoustic signal to parameterise it in a more useable and useful form. They must both match the input signal against a stored pattern and then make a decision about accepting or rejecting the match. Each problem has its own difficulties: in speech recognition the number of possible utterances can be very large and differences between speakers confound the problem; in speaker verification only 100% accuracy is really acceptable and differences in a speaker's voice due to the time of day or the microphone used can be as large as the differences between two similar speakers.

Most of this unit will look at speech recognition since this is both the most complex topic and the most used technology. Many of the techniques we learn about are applicable to the verification task as well.

8.2. Basic Components

The overall architecture of a speech recognition or speaker verification system is shown in Power's Figure 9.1. This diagram covers both speech and speaker recognition. Notice in particular the two paths through the system for training on the one hand and pattern matching or recognition on the other. The training process is vitally important and involves building some kind of internal model of the speech signal. In speaker verification there is one model for each speaker and the model must capture enough about the speaker's voice characteristics to be able to match that speaker and no-one else. In speech recognition the models might correspond to words, syllables, phonemes or other units. Much of what we'll study in this course is to do with what constitutes a model and how they are compared with incomming speech signals.

8.2.1. Speech Capture

This is a topic that isn't often covered in detail but obviously a pre-requisite for any functional speech input system. In many systems there is a defined time when input is listened for, which simplifies the problem a little. Imagine though a `pervasive' recognition system which must be able to listen all the time and process whatever input is recieved, perhaps in a noisy environment. Such a system will need a complex speech/nospeech decision algorithm.

The issue of barge-in is also becoming important as we learn how to implement more intelligent spoken language dialog systems. When the user is able to respond at any time the system needs to be listening even while it is speaking -- just like we do.

8.2.2. Feature Extraction

We've already begun to explore the problem of how to extract useful information from a speech signal. We looked at techniques for separating source and filter components of the signal and providing parameterised versions of these. The input to speech recognition systems is a sequence of such parameter vectors. We will explore the motivations for selecting the parameters for an application and look at some more techniques for processing the input speech.

8.2.3. Models

The major component of a speech recognition system in terms of complexity is the Hidden Markov Model. An HMM is a statistical model of a class of speech signals which can be used to estimate the probability that a given input sequence belongs to the class on which it was trained. The class can be a word, phoneme or other unit as mentioned earlier. In general we want to use sub-word models as building blocks to produce word or multi-word recognisers; in this case the problem of finding an interpretation amongst the many possible interpretations becomes paramount. It is here that we begin to incorporate models of the language we are recognising to help constrain the search -- we don't consider arbitrary sequence of phonemes or words, only those which fit our models of the language and the task. These issues are discussed in Power's section 9.6 on Speech Recognition.

8.2.4. Result Processing

What is the result of the speech recognition process? When you use Dragon Naturally Speaking the result is text appearing in your word processor window. The raw output from the speech recognition engine though can be richer than this. To begin we can associate a probability score with the output which measures the closeness of match between the input and the model. In addition the search procedure can return more than one sentence or word hypothesis along with probability scores for each, allowing us to perhaps try one of the alternatives if they provide a more reasonable semantic interpretation. As an example, in a banking application the recognition system might return My savings amount as it's first hypothesis and My savings account as it's second; in a dialog system we might have additional contextual knowledge which could help us decide between the two.

8.3. Speaker Recognition

The general task of identifying who is talking can be described in terms of two different applications: speaker verification and speaker identification. In speaker verification (SV), an individual make a claim to his or her identity and we must verify that claim by matching a speech sample to a stored model. In speaker identification (SI), no claim is made and we must identify the speaker from a set of known speaker models. Speaker verification is used in security applications where a reliable method of identification is needed; for example for banking transactions over the telephone. Speaker identification might be used in a dialog system to allow tracking of multiple speakers in a dialog or to allow the system to recognise a user from earlier dialogs.

In order to verify a subject the input speech needs to be compared to a stored model for that user. The decision as to whether to accept the identity claim is based on a similarity threshold -- if the input is closer than the threshold, then we accept. However, this ignores all the other user's we know about so it is common to also take into account the similarity of the input to a group of `similar' users; we might say that if the input is closer to the claimed identity than it is to any of the similar users then we accept, otherwise we reject. As you can imagine, there are many variations on this theme.

One of the major considerations in SV applications in the reliability of the verification decision. The classic tradeoff in SV is between false acceptances and false rejections: if the threshold is relaxed, all of the valid claims will be accepted but so will many invalid ones; if the threshold is made more restrictive, most invalid claims will be rejected but so will many valid ones. In order to objectively evaluate a system's performance a measure called the Equal Error Rate (EER) is often quoted. EER measures the error rate of a system when the threshold is adjusted so that the number of false acceptances is equal to the number of false rejections. In a real application, the error performance can be adjusted to suit the level of security required: secure or convenient.

The speaker model used in SV or SI systems can be any kind of model of the acoustic signal. HMMs are common as is a method called vector quantisation; we'll discuss the details later. Unlike in speech recognition, the goal is to retain as much speaker specific information as possible in the signal.

Chapter 9. Feature Extraction for ASR

In this section we revisit the subject of the last part of SLP801 to look at how the speech signal is processed prior to matching against stored patterns. To provide an overview of the goals and methods of this stage we will look at a paper by Joe Picone (Signal Modeling Techniques in Speech Recognition from the Proceedings of the IEEE, 1993). This paper describes this stage as Signal Modeling to emphasis that we are attempting to derive a parameterised version of the input speech signal which captures it's important qualities while discarding unimportant and distracting features. In addition, we will refer to the last part of Chapter 6 (6.6 onwards) and Chapter 8 from Harrington and Cassidy for details on one signal modelling technique.

9.1. Sources of Variability in Speech

No two utterances of the same word or sentence are likely to give rise to the same digitial signal; this obvious point underlies the difficulty in speech recognition but also means that we may be able to extract more than just a sequence of words from the signal. If we can understand the different sources of variability in the signal then we can begin to approach the problem of either separating them out for subsequent analysis stages.

Let's consider the factors which could cause two random speech samples to differ from one another:

  • Phonetic identity: the two samples might correspond to different phonetic segments, eg. a vowel and a fricative. Another source of variability is coarticulation between phonemes.

  • Pitch: pitch and other source features such as breathiness and amplitude can be varied independently.

  • Speaker: different speakers have different vocal tracts and source physiology. Speakers also get colds, get emotional and do other things to modify their voice properties.

  • Microphone: and other properties of the transmission channel (eg. fixed vs. mobile telephone).

  • Environment: background noise, room acoustics, distance from microphone.

Clearly the kind of variability we want to preserve in our signal model for speech recognition is that due to phonetic identity, which is largely due to the vocal tract configuration. All of the other sources might be considered noise to be removed from the signal as far as possible; however, there might be situations when some of these could provide useful information. We are helped by the observation that these sources of variability are largely independant from each other and so might be amenable to some kind of de-convolution operation; remember that this is what we did to separate the source and filter components in cepstral analysis.

The requirements of the signal modelling stage will differ depending on the application environment: telephone speech recognition applications may need to use techniques to compensate for line noise and changes in the properties of the telephone line during each utterance; an in-car system will need to be robust to significant amounts of background noise. These may call for different signal modelling techniques in addition to different approaches to pattern matching.

9.1.1. Separating Phoneme Classes

The aim of signal modelling is to derive a feature vector such that the vectors for the same phoneme are as close to each other as possible while the vectors for different phonemes are maximally different to each other. In other words, the feature vectors we derive for all /a/ vowels should be very similar to each other (by some metric that we have yet to define) while being different to those for any other phoneme class.

This goal can best be achieved by finding a signal transform that isolates the variability due to phonetic class from the other sources of variability in the speech signal.

9.2. Linear Predictive Coding

In order to understand LPC and many other signal processing techniques it is necessary to look at the mathematics behind the z-transform. Unfortunately this is quite a hairy topic for a non-mathematical audience and so can be a hurdle in properly understanding these techniques. In our book we try to give a simple step-by-step coverage of the z-transform and how it relates to digital filtering and I would like to walk through the major points in these notes. The aim of this exercise is so that you can look at, for example, equation 18 in Picone's paper and not only stay upright but maybe even say, "hmm, I see".

The story begins back in Chapter 6 on page 178. If you remember back to 801, we talked about frequency domain filtering as a mulitplication operation between the spectra of the source and filter. In this section we develop a new way of talking about the frequency domain, the z-transform, such that we can characterise a signal and a filter as a polynomial, and perform the filtering operation by multiplying these together.

9.2.1. The Spectrum and the Z-transform

Section 6.6.1 explains how the amplitude and phase of a digital signal can be expressed as a vector of complex numbers and in particular that the spectrum of an impulse [1 0 0 0 ...] can be expressed as the vector X[k] = 1 + 0i for all k. This corresponds to saying that the unit impulse is the sum of sinusoids at all (digital) frequencies with amplitude 1 and phase 0 radians. Following this we see that the spectrum of a time shifted version of this impulse signal differs only in the phase component and can be neatly expressed via Euler's relation as X[k] = Aexp(-iWkp) (where exp denotes the exponential and the signal has an amplitude of A and has been shifted by P points relative to the original unit impulse). We then define a new variable z = exp(iWk) so that the shifted spectrum can be written as X(z) = Az-p -- a polynomial in z or the z-transform of the shifted impulse signal.

We next note that any signal can be seen as the sum of weighted, shifted, impulse signals ([2 3 4 5] = 2*[1 0 0 0] + 3*[0 1 0 0 ] etc). Hence (since we are dealing with linear time-invariant signals) the z-transform of an arbitary signal can be derived from the z-transforms of the shifted impulses.

x[n] = [4 2 1 3 0 0 0 0]

X(z) = 4 + 2z-1 + z-2 + 3z-3

(note that there's a typo in this example in the book (top of p185) where the exponent of the third term is -3 instead of -2.)

So, now we know what a z-transform is (a way of writing the spectrum of a signal) and we can derive the z-transform for any digital signal we come across. Note that by substituting z = exp(iWk), where k is a vector of digital frequency values, into any z-transform we can get back to the DFT of a signal.

9.2.2. Convolution as Multiplication

We know that we can apply a digital filter either by convolving a time domain signal with the impulse response of the filter or by multiplying the fourier spectra of the signal and the filter. Section 6.6.2 describes how we can also apply a filter by multiplying the z-transforms of the signal and the filter. In 6.6.3 we see that writing the source filter equation in terms of z-transforms:

Y(z)A(z) = B(z)X(z)

allows us to combine the recursive (A(z)) and non-recursive (B(z)) parts of the filter into a single transfer function H(z) which is the z-transform of the impulse response of the filter -- that is, the z-transform of the signal you would get if you passed a unit impulse [1 0 0 0 ...] through the filter. H(z) is a complicated polynomial in z but knowing that we can characterise any filter by a polynomial H(z) is the important point to take away.

Figure 6.16 summarises the relationship between signals, spectra and z-transforms. The impulse response and the spectrum are two ways of characterising the effect of filter; either can be derived from the convolution equation and we can transform between them using the DFT and IDFT.

9.2.3. LPC analysis

Moving now to Chapter 8 we will begin to look at Linear Predictive Coding. LPC is another method of separating out the effects of source and filter from a speech signal; similar in intention to cepstral analysis but using quite different methods. One way of thinking about LPC is as a coding method -- a way of encoding the information in a speech signal into a smaller space for transmission over a restricted channell. LPC encodes a signal by finding a set of weights on earlier signal values that can predict the next signal value:

y[n] = a[1]y[n-1] + a[2]y[n-1] + a[3]y[n-3] + e[n]

If values for a[1..3] can be found such that e[n] is very small for a stretch of speech (say one analysis window), then we can transmit only a[1..3] instead of the signal values in the window. The speech frame can be reconstructed at the other end by using a default e[n] signal and predicting subsequent values from earlier ones.