Note: The Bleu metric and the F-Measure metric have an implementation available on the
download section
Machine Translation Evaluation
Introduction
One of the most difficult things in Machine Translation is the evaluation of a proposed system/algorithm.
In every part of science it is necessary to prove with numbers what you're claiming. So if a new Machine
Translation algorithm is developed and one claims this is better, it's necessary to put a number to this,
not only so that the claim can be verified that this is actually a better algorithm, but also that we can
see how much better. There is no science without evaluation to
falsify
your theories.
The problem with language is that language is not exact in the way that mathematical
models and theories in physics are. Language has some degree of vagueness, which makes it hard in Natural
Language Processing to put objective numbers to it.
For example with Machine Translation you have the problem that there is not "one" good translation.
Typically, there are many “perfect” translations of a given source sentence. These translations may vary
in word choice or in word order even when they use the same words. And yet humans can clearly distinguish
a good translation from a bad one. [KIS2002] What is a good translation? Even humans disagree on that
The correlation between human judgments of MT quality is surprisingly low[TUR2003] and because humans
are the final measure it is hard to present good measurement for MT, one might even wonder what it actually
means to make good/perfect translation, but that philosophical question is not covered here.
Human Evaluation
Because humans are the golden standard for using language, obviously human evaluation is the holy grail for
evaluation for machine translations. The problem however is that human evaluation is very time consuming and
therefore not always an option.
And when using human evaluation one should take care for maintaining objectivity. Now are most machine evaluation metrics
also based on human references, in case of human evaluation the evaluator is in most of times involved in the
research, however one can take the same approach as with machine evaluation and just use a couple of evaluations
and take the average of these.
Human Evaluation - JHU & Egypt
Human evaluation can be approached in different ways. I would like to present a quick view of human evaluation
has been done in practice and I take the example of the
workshop 1999 at Johns Hopkins University. In their
final report they discus their evaluation technique.
They claim that ideally you have monolingual evaluators
to evaluate the fluency and bilingual evaluators to evaluate the quality of the translated meaning. They used a scale
of 1 to 5 and because they were reviewing more than one Machine Translation Algorithm they shuffled the translations
randomly for every sentence.
Machine Evaluations
If it was possible to make an independent algorithm which would be able to rank a specific machine translation,
then the assumption is that this evaluation algorithm itself is a better algorithm than de translating algorithm.
The problem evaluating is the same problem of translating. However there are some features which can be evaluated
automatically. The fluency of the output sentences can be checked for example by N-gram analysis. If
there is/are a reference translation(s) available, then it is possible to compare the output with the references
and to put a number to the notion of "good translation".
The big advantages of using Machine Evaluation is that the scoring is objective, while human evaluation/scoring will
often differ not only from time to time but much bigger from human to human.
And the last problem with machine evaluation is that is gives a certain feeling of objectivity which is not always
correct. Still a lot of choices like which references, which domain, etc are completely subjective and of influence
of the final outcome. Therefore it is hard to compare two different Machine Translation algorithms objectively
Machine Evaluations - Bleu Metric
(Sometimes called the Blue metric)
The Bleu Metric is an IBM-developed metric and it probably the best known Machine Evaluation
for Machine Translation.
This paper [KIS2002] describes the algorithm.
The central idea is that the closer a machine translation is to a professional
human translation, the better it is. To check how close a candidate translation is to a reference
translation, a n-gram comparison is done between both translations.
However, because the evaluation is based on n-gram comparison with reference sentences, it is possible
to make sentences with completely different meaning by switching words/n-grams and still get
high scores. However, it is unlikely that this will happen unintentionally (and if they happen
intentionally, the system is clearly knows what is good and bad, but picks the wrong one).
Also the opposite can occur, for example when the Machine Translation algorithm consequently
translates a certain constituent to "New South Wales politics" it is penalized heavily when reference texts
mention "politics of New South Wales" when using larger n-grams. The claim is that this effect is cancelled out
enough, when enough human references are used, each with their own different styles. So one of the big downside
of this technique is that in order to make it work, you need a lot of reference texts, which might be
hard to get.
Features: -
Evaluation by reference -
Fluency evaluation (by n-gram) -
Does not use the source
language
Conclusion: Best known and best adopted Machine Evaluation for (machine) translation.
But has the weaknesses of all Machine Evaluations in that the judgement is based not on the fact whether
an algorithm captures and translates the meaning, but how well it scores against references. Still the best to go for if you're looking
for a Machine Evaluation for (machine) translation.
Machine Evaluations - NIST
NIST is an NIST (National Institute of Standards and Technology) developed metric. It is based on the same ideas
as the Bleu Metric of IBM, and it can be seen as an upgrade to this metric.
This paper
describes the metric.
It is also a n-gram counting metric, but the idea is to fix two problem with the Bleu metric:
Firstly, Bleu usage a geometric mean of n-grams. The weights for the different p
n are chosen to be
uniform: w
n=1/N According to NIST this can lead to counterproductive variances due to low co-occurences for the
larger values of N.
Secondly, Bleu treats all n-grams equally. That means that n-grams which occur often and have little information
(for example the bi-gram ["in" "the"]) have as much impact on overall precision as information rich n-grams
(for example the bi-gram ["counter" "productive"] has much more information than the previous bi-gram). The value on
how information rich a n-gram is, is based on how often it occurs. (Which has a negative correlation with
how much information this n-gram carries).
Some smaller corrections on Bleu are: another Brevity Penalty, not-being case insensitive etc.
Features: -
Evaluation by reference -
Fluency evaluation (by n-gram) -
Does not use the source
language
Conclusion: Being based on the Bleu metric it has the same strong and weak points as Bleu.
Horizontal reference text
Vertical candidate text
| E | | | | | • | | | | | |
| D | | | | • | | | | | | |
| C | | | • | | | | | | | • |
| I | | | | | | | | | • | |
| A | • | | | | | | | • | | |
| B | | • | | | | | • | | | |
| C | | | • | | | | | | | • |
| H | | | | | | | | | | |
| | A | B | C | D | E | F | B | A | I | C |
MMS(C,R)=√(1²+2²+4²) ≈ 4.6
Machine Evaluations - F-Measure
The F-Measue is described in
this Paper.
It is a metric developed on the New York university.
In this metric "maximum matching" from graph theory is used. Subsets of co-ocurrences in the candidate and reference text are counted in such a
way that a token is never counted twice. On this matching value a Recall and Precision is defined where Recall is the amount of counted tokens
which also appear in the candidate text and Precision the amount which also appear in the reference text. A reward for longer
matches is introduced as the square root of the squares (but other powers and roots than 2 can be used) of the different lengths. This rewards is bigger when larger matches are found, this takes
care for the "fluent" measure of the translations.
A small example is "borrowed" from TUR2003. Here we see three matches, of length 1,2 and 4. Of course other matches could have been made, but this
is the "maximum match".
Now two measures are defined, Precision and Recall:
Precision(candidate|reference) = MMS(candidate, reference) / |Candidate|
Recall(candidate|reference) = MMS(candidate, reference) / |Reference|
The final F-measure is the harmonic mean of both the precision and the recall. Which is defined as
(2*Precision*Recall)/(Precision+Recall)
Features: -
Precision and Recall definition -
Fluency definition (on MMS) -
MMS calculation is NP hard
Conclusion: TUR2003 is claiming to have higher correlation with this F-Measure than either Bleu or Nist has. It is good not to have an arbitrary (maximum) length of the n-grams,
because this metric automatically rewards larger chunks of text which are identical to (one of the) references texts. Although MMS calculation can be NP-complete when calculating with
powers and roots bigger than 1, there exists greedy algorithms that find 80% to 99% of the time the true maximum [TUR2003:p3].
Machine Evaluations - Sentence level Evaluation
Most Machine Evaluations Techniques tend to correlate with human evaluations only larger texts. Since it is sometimes necessary to do evaluation on shorter amounts of texts and since evaluation on larger texts can always be done by averaging shorter texts, a metric for short text evaluation can be preferable.
In the paper
A Learing Approach to Improving Sentence-Level MT Evaluation, such a methode is proposed. The approach for this technique is to turn the evaluation task into a classification task. Is a given sentence a sentence produces by a human or by a machine? Because this question is very hard to answer, a learning technique is introduced to learn the classification differences. However this means that the evaluation technique finally depends on the training data provide, this has the advantage that as machine translation algorithm change and get better, the new evaluation metric is ready for this. It can just be retrained with new machine produced translation data. However it has the disadvantage that it is hard to compare machine translation effort over time.
The learning is implemented with Support Vector Machines. Evaluating a translation sentence is classifying it, where the distance to the sepeartion boundary of the two classes is a confidence of how strong we are conviced this is a machine produced or a human produced sentence. If new sentences occure in the human-produced-translation zone, the translation succeeds in fooling the classifier and therefor most be of high quality. However if it is clearly a machine made translation the quality is regarded much lower.
Now to use Support Vector Machines the linguistic objects must be representend in a numerical multi-dimension space. Since we also want to use references texts (otherwise we could make very human-like sentences which have nothing to do the text we are translating) there are four features used:
- n-gram precisions, as in BLEU
- The length of the reference texts, this corresponds with the length penaltie of BLEU
- Word error rate, expressed as the minum edit distance between hypothesis and reference on a word level
- Position-independent word error rate, computed by removing the words in the shorter translation from those in the longer ones and returning the minnimum size of the remain set
In their paper [KUL2004], the authors claim to have much higher correspondence with human evaluation of translations than the previous mentioned metrics
Machine Evaluations - Meteor
Meteor is a machine translation evaluation metric developed at Carnegie Mellon University,
This paper describes the metric.
This metric again calculates ngram overlaps between a produced translation and reference texts. In this case unigram overlaps are calculate. It doesn not only count pure unigram overlaps, but also uses
WordNet to calculate this unigram overlap, for example for the cases where the produced translation has chosen synonymous words from the reference text.
Another feature in Meteor is that stemmed words can be used so that when a produced translation choses a slightly different grammatical structure, the metric still spots the same words are used.
Most older metrics, like Bleu, expect that these problems, synonyms/stemming, is resolved by using enough reference texts, in practise however it is very hard to get hold on enough reference texts, so that a lot of evaluation is done on 1 or 2 reference texts.
Meteor has a seperate module to address ordering, which explains why higher ngrams are not used. Basically a reordering penalty is calculated on how many chunks in the produced text need to be moved around to get the reference text.
Meteor is a Precision and Recall Metric.
In their paper they claim to achieve higher correlation with human reviewers than the previously mentioned metrics.
.
Machine Evaluations - String accuracy metrics
Some more crude methods for Machine Translation are string accuracy methods.
This
paper describes these evaluations.
These metrics try to calculate the edit-distance from a candidate sentence to a reference sentence. The Simple String Accuracy method calculates
this by counting the number of insertions (I), the number of deletions (D) and the number of substitutions. The final result is defined as:
SSA=1-((I+D+S)/|Reference Sentence|)
A variation on this theme is the Generation String Accuracy, which introduces a variable M (misplaces) for a deletion and an insertion of the same
word, so that misplaced words are penalized less than in the simple string accuracy:
GSA=1-((M+I+D+S)/|Reference Sentence|)
Features: -
Edit distance as a measure for accuracy
Conclusion: This are simple but crude methodes for automatic machine translation evaluation.
On 18-04-2005 I had a talk about this topic in our Language Technology Group, and
the slides on MT Evaluation are now available.
(I made this presentation with LaTeX beamer package, if you want to know how I have done this, you can ask for the source code)
KIS2002 - Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, Bleu: a Method for Automatic Evaluation of Machine Translation, Proceedings of the 40th
Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318. http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf
KUL2004 - Alex Kulesza and Stuart M. Shieber, A learning approach to improving sentence level MT evaluation, in proceedings of the Tengh Conference on Theoretical and Methodological issues in Machine Translation, Baltimore, 2004 http://www.eecs.harvard.edu/~shieber/Biblio/Papers/kulesza-mt-evaluation04.pdf
TUR2003 - Joseph P. Turian, Luke Shen, and I. Dan Melamed, Evaluation of Machine Translation and its Evaluation a revised version of the paper to be presented at MT Summit IX, New Orleans, LA, 2003
http://nlp.cs.nyu.edu/publication/papers/turian-summit03eval.pdf