Home PhD Research Short Bio Publications Downloads More About Me

How to install
Download and save the Bleu.cpp into a suitable direction.
Invoke the compiler: (for example in a linux enviroment)
g++ -O3 Bleu.cpp -o Bleu


NAME
Bleu - IBM's Bleu metric for evaluating (machine) translation quality

SYNOPSIS
Bleu [options] <Candidate file> <Reference file 1> [Reference file 2..n]

DESCRIPTION
Bleu calculates the Bleu score over a candidate file and at least one reference file.
The score is an harmonic mean of n-gram matches between candidate and reference(s) for different sizes of n.
Smoothing is only performed in cases where there are sententences containing less tokens than n.

Warning: The choice of n is in a way quite arbitrary. As a default n=1 is chosen which is a really bad setting. Usually a value of n=3 or n=4 is used, or higher values for n.

For more information read the Bleu paper published by IBM on the Bleu metric.

OPTIONS
-h Show a short help message and quits
-vTurn on verbose mode
-cTurn on case sensitivity
Bleu as defined by IBM treats tokens as equal when they match ignore the case of the tokens. This is useful when a word in the candidate/reference text is at the sentence start but not in its counter part.
However in some languages cases provide general (linguistic) information and attention to the cases is desired
Use this switch to turn on case sensitivity, default is case insensitive
-nMust be followed by a postive nonzero integer
Set the maximum n-gram size. Default is 1.
Note: 1 is not a really got value, see warning above
-iIgnore first token in every sentence
Sometimes the first token is used for sentence numbering and for the purpose of calculating the Bleu score, the first token must be ignored.
This switch discards the first token on every sentence both in the candidate and in the reference file(s).


File format
The program expects both the candidate file and the reference file in the same file format.
In this file format the kleene star on an otherwise empty line is used as a sentence delimiter.
A file might look like this:
First sample line of our toy corpus
*
Second test line
*
Tip
A well know format otherwise used its the .sgml.ref format, an xml like markup.
To convert from this format to a format accepted by this program is by use grep and sed:

cat file.sgml.ref | grep -v "<refset" | grep -v "<doc" | grep -v "^</" | sed -e "s/<seg>//" | sed -e "s/<\/seg>/\n*/"