You are here: Robert Dale's Home Page > Possible Honours and Masters Projects

Robert Dale: Possible Honours and Masters Projects


This page lists some specific honours and masters projects I would be happy to supervise. They are broken down here into three separate areas, corresponding to the three areas in which I carry out research.

Of course, there is usually some scope to tailor projects to the interests of specific students. You might want to read my document on supervision before deciding whether you'd like me as a supervisor. If you'd like to discuss anything here further, just mail me to arrange a chat.

Intelligent Document Processing

Corpus-based Correction of OCR-introduced Spelling Errors

A common way to archive legacy documents is to run them through a scanner to produce a PDF file, to which a searchable text layer is added using optical character recognition (OCR). Unfortunately, OCR is not perfect, so spelling errors are introduced that damage the effectiveness of search techniques.

Using an existing corpus of several thousand scanned academic papers (in the ACL Anthology), this project aims to develop automatic spelling correction techniques that use the corpus itself as a source of evidence for spelling corrections. For example, if the misrecognised string spe11in8 appears in a document, a simple distance metric may find other similar strings, such as spelling, to be much more frequent in the corpus, and on the basis of frequency then choose this as a correction. Of course it gets much more complicated than this, which is why i's interesting ...

Inferring Document Structure

Documents have a physical structure -- typically consisting of pages, columns, and paragraphs -- but they also have a logical structure, consisting of title information, sections, subsections, footnotes, tables and so on. PDF documents are primarily intended for rendering on a screen or a printer, and so are focussed on physical structure; they tend not to contain much information, if any, about the logical structure of the document. But that logical structure can be important for a variety of purposes; for example, knowing the logical structure of a document can assist in information retrieval, information extraction and text summarisation.

The aim of this project is to take a corpus of PDF documents, and to build a system that can automatically extract the logical structure of the document text, so that this can be provided in XML form for a variety of more sophisticated processing stages, or for a more flexible rendering model (for example as a hierarchically unfolding document in a web browser).

Spoken Language Dialog Systems

An Automated Newsreader

Automated newsreaders -- 'talking heads' that read out news stories in synthesized voice -- have been constructed before. These take a textual news source and then use a text-to-speech synthesis engine, in conjuncion with an animated head, to deliver that news in spoken language.

The aim of this project is to build such a system with increased realism, by incorporating both appropriate facial gestures and approptiate intonation in the voice. Watch some newsreaders carefully to see how they use their facial expressions to communicate informaton, and listen to how they use prosody to increase interest in what they are saying. The challenge here is to find techniques that will allow us to derive appropriate audio visual features from a 'flat text' provided as input.

Natural Language Generation

The Automatic Generation of Spoken Stock Market Reports

Stock market data -- information about the prices of stocks and shares -- is a valuable commodity that many people pay mney for in order to receive in a timely fashion. That's ok if you're sitting at your desk with a web browser, or have access to some other internet-enabled device that allows you to access a relevant website. But there are situations where really you'd like to have information provided verbally; and ideally you'd like to have it personalised to your own interests and stock holdings.

The aim of this project is to build a system that interrogates a stock market price database, and in conjunction with a user profile, works out how to construct a text that summarises the relevant information for that user; this text is then delivered via a text-to-speech system, so that the user can access when they are driving or in some other hands-busy eyes-busy context.

Generating News Summaries for SMS Delivery

There are many news services available via the web, but there's a problem when it comes to delivering news to a mobile phone: you only get 160 characters in an SMS message.

The aim of this project is to develop techniques that can analyse a news story and produce a summary that will fit into 160 characters. In many cases the headline will already be short enough, but for that same reason it may not contain much information, so we need to extract more information-rich content from the text of the story, and then find ways to compress it into the available space. This involves using what we might think of as two forms of compression: SMS compression makes use of common abbreviatory conventions to save space, while linguistic compression attempt to analyse the structure of a sentence to determine what parts of that sentence can be dropped without loss in meaning.


Please send comments or queries about this web site to Robert.Dale@mq.edu.au
Last Modified: 4 March 2007