Home PhD Research Short Bio Publications Downloads More About Me
Publications
2008
Choosing the Right Translation: A Syntactically Informed Approach, S.Zwarts and M. Dras. In: Coling 2008, Manchester, To Appear
One style of Multi-Engine Machine Translation architecture involves choosing the best of a set of outputs from different systems. Choosing the best translation from an arbitrary set, even in the presence of human references, is a difficult problem; it may prove better to look at mechanisms for making such choices in more restricted contexts.

In this paper we take a classification-based approach to choosing between candidates from syntactically informed translations; specifically translations where source language sentences are syntactically reordered before being passed to an SMT system. The idea is that using multiple parsers as part of a classifier could help detect syntactic problems in this context that lead to bad translations; these problems could be detected on either the source side---perhaps sentences with difficult or incorrect parses could lead to bad translations---or on the target side---perhaps the output quality could be measured in a more syntactically informed way, looking for syntactic abnormalities.

We show that there is no evidence that the source side information is useful. However, a target-side classifier, when used to identify particularly bad translation candidates, can lead to significant improvements in Bleu score. Improvements are even greater when combined with existing language and alignment model approaches.

2007
Syntax-Based Word Reordering in Phrase-Based Statistical Machine Translation: Why Does it Work?, S.Zwarts and M. Dras. In: MT Summit XI, September 2007, Copenhagen PDF
Most natural language applications have some degree of preprocessing of data: tokenisation, stemming and so on. In the domain of Statistical Machine Translation (SMT) it has been shown that word reordering as a preprocessing step can help the translation process, but it is unclear why. We propose two possible reasons for the observed improvement: (1) that the reordering explicitly matches the syntax of the source language more closely to that of the target language; or (2) that it fits the data better to the mechanisms of phrasal SMT. In previous work from German to English, for example, hand-written language-specific reordering rules both match the German more closely to English syntax, and compress heads and dependants into the PSMT phrasal window. Whether the source of the improvement is (1) or (2) has not been determined, although most other work assumes the former.

To identify the effects of each possible cause, we carry out two sets of experiments. For (1) we reverse the language-dependent syntactic reordering such that heads and dependants are moved apart. For (2), we propose a generic approach to minimising dependency distances in reordering that does not explicitly match target language word order and that does not require language-specific rules; the aim of which, rather than to beat state-of-the-art systems, is to investigate. The results show that (1) and (2) individually do still lead to improvements in translation quality, but each weaker than the original, suggesting that both features are necessary for a strong improvement. A consequence of this is that is possible to gain half the improvement of language-specific rules through one generic one.
Statistical Machine Translation of Australian Aboriginal Languages: Morphological Analysis with Languages of Differing Morphological Richness, S.Zwarts and M. Dras. In: Australasian Language Technology Workshop, December 2007, Melbourne PDF
Morphological analysis is often used during preprocessing in Statistical Machine Translation. Existing work suggests that the benefit would be greater for more highly inflected languages, although to our knowledge this has not been systematically tested on languages with comparable morphology. In this paper, two comparable languages with different amounts of inflection are tested, to see if the benefits of morphology used during the translation process, depends on the morphological richness of the language. For this work we use indigenous Australian languages: most Australian Aboriginal languages are highly inflected, where words can take a considerable number of postfixes when compared to Indo-European languages, and for languages in the same (Pama Nyungan) family, the morphological system works similarly. We show in this preliminary work that morphological analysis clearly benefits the richer of the two languages investigated, but is more equivocal in the case of the other.

2006
Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation., M, van Zaanen, S. Zwarts. In: AI 2006: Advances in Artificial Intelligence, Volume 4304/2006, ISSN: 0302-9743, ISBN: 978-3-540-49787-5, Springer Berlin, December 2006, Pages: 1208-1214 PDF
Lay people discussing machine translation systems often perform a round trip translation, that is translating a text into a foreign language and back, to measure the quality of the system. The idea behind this is that a good system will produce a round trip translation that is exactly (or perhaps very close to) the original text. However, people working with machine translation systems intuitively know that round trip translation is not a good evaluation method. In this article we will show empirically that round trip translation cannot be used as a measure of the quality of a machine translation system. Even when using translations of multiple machine translation systems into account, to reduce the impact of errors of a single system, round trip translation cannot be used to measure machine translation quality.

This Phrase-Based SMT System is Out of Order: Generalised Word Reordering in Machine Translation., S. Zwarts, M. Dras. In: Proceedings of Australasian Language Technology Workshop, December 2006 PDF
Abstracts:
Many natural language processes have some degree of preprocessing of data: tokenisation, stemming and so on. In the domain of Statistical Machine Translation it has been shown that word reordering as a preprocessing step can help the translation process.

Recently, hand-written rules for reordering in German--English translation have shown good results, but this is clearly a labour-intensive and language pair-specific approach. Two possible sources of the observed improvement are that (1) the reordering explicitly matches the syntax of the source language more closely to that of the target language, or that (2) it fits the data better to the mechanisms of phrasal SMT; but it is not clear which. In this paper, we apply a general principle based on dependency distance minimisation to produce reorderings. Our language-independent approach achieves half of the improvement of a reimplementation of the handcrafted approach, and suggests that reason (2) is a possible explanation for why that reordering approach works.
2004
CBR in Dependency-based Machine Translation., S. Zwarts, A. Nijholt, R. op den Akker, M. Poel. In: Proceedings of Konvens 2004, E.Buchberger (ed.), Schriftenreihe der Oesterreichischen Gesellschaft fuer Artificial Intelligence, Band 5, ISBN 3-85027-005-X, Wien, September 2004, 205-208. PDF
Abstract:
A case based reasoning approach is introduced as a learning technique in the domain of machine translation of natural language. In our approach syntactical and semantic features are part of the cases in the case-base. To implement this, dependency analysers of sentences in the source and target languages are used. The case-base is filled with a learning mechanisme that uses a parallel corpus of sentences with their translations. This case-base is used to make new translations.
2003
Using CBR as a learning technique for Machine Translation, 2003, Master Thesis PDF
Abstract:
This report introduces a case-based approach as a learning technique in the domain of antural language translations. The general assumption is that in order to make correct translations bot the syntactical aspects, like grammar, as semantical aspects, meaning (in this research mostly used on word level), must be used. To implement this, dependency analyzers of the languages are used. The case-base is filled in alearning mechanism using a parallel corpus of complete sentences with their translations. This case-base is used to make new translation by transforming new situations into old situations, retrieving the old solutions of the old situations and transforming the old solutions into solutions for the new situations. Here new situations are new sentences to translate, old situations are previously seen situations in the parallel corpus, solutions for the old situations are translations from the parallel corpus and the new solutions is a translation fo the sentence which we wanted to have translated.