Students

Current students.

Completed students.


Proposed Honours Projects

In general, I'm open to project suggestions if there's something you already have in mind; have a look at my research page to get an idea of what sorts of things I'm interested in. Regarding specific topics from me, here are a few:

SPAM FILTERING

Most statistical spam filters try to detect spam using the content of the email as the basis of the feature set; spam senders can defeat this by by including content in the emails that will fool the filters. It would be interesting to see how integrating stylistic and fluency features could work in a spam filter, as this might be more difficult to counterfeit than content. (After all, if spammers can generate grammatically immaculate text, they will have solved a major problem in language technology.)

GENERATING SCIENTIFIC PAPERS

Three students at MIT have built SCIgen, "a program that generates random Computer Science research papers, including graphs, figures, and citations"; a paper generated by it has actually been accepted at a conference.

This is done using a hand-built context-free grammar. However, it would be interesting to see if the same can be done with a statistical generator (e.g. one that uses n-gram statistics). There are actually some interesting research questions here: Can the sentences be made grammatical? What additional information needs to be used to do this? What kinds of statistical language models work best? What is the quality like relative to the CFG-generated text? Since there is no underlying content to be represented, unlike in ordinary text generation or summarisation, this is a good testbed for these questions.

MACHINE TRANSLATION

See research page for a description of the project that this would be part of.

This would involve investigating a specific language pair and examining issues in machine translation with respect to that pair. Recent work at Johns Hopkins University has been exploring integrating structural approaches (where you design rules for translation) with statistical approaches (where the system "learns" translation). A specific project would be to replicate the preliminary work from Johns Hopkins with a closer language pair (say, English-French), and to evaluate results relative to purely structural or purely statistical approaches. A more general project in this area is also possible. Component parts of such a project would be some subset of:

PARAPHRASE

See research page for a description of the project that this would be part of.

The main part would be to build a system, using an existing broad-coverage parser, together with an existing mathematical optimisation package, that would take a text (e.g. a paper) and fit it to a set of constraints (e.g. a 2000 word limit with sentences of middling complexity). An extension would be to look at discourse-level parsing, such as SPADE, and incorporate that.


Past Teaching


[Mark's home page]


Last updated 22 January 2008