Home PhD Research Short Bio Publications Downloads More About Me

How to install
Download and save the PhraseBuilder.py into a suitable direction.
(unix/linux) Make sure the first line containing the line "#!/usr/bin/python" point to where python is installed
(unix/linux) Give it execute rights: chmod ugo+x PhraseBuilder.py
(unix/linux) Now it can be run by ./PhraseBuilder.py <args>
(windows) run it from the command line or in your favourite python enviroment: python PhraseBuilder <args>


NAME
PhraseBuilder - Program to build phrase tables for phrase based statistical machine translation from bilingual aligned corpora

SYNOPSIS
PhraseBuilder <Alignment File 1> <Alignment File 2>

DESCRIPTION
Phrase based statistical machine translation is an paradigm which builds on the statistical machine translation models by IBM.
The big chance however is, that words are not treated individually but as phrases, n-grams of co-occuring words.
One of the problems with statistical machine translation is how the parameter estimation is accquired. For "classical" statistical machine translation the Giza++ toolkit is available.
For phrase based statistical machine translation we need to build the so called "phrase tables". This tool takes alignment files from a Giza++ training phase, which are produced as a by product during the training for their parameter estimation, and calculates the probabilities for phrases.
See The Pharaoh Manual section 2 to read about the theory behind phrase table generation, this implementation is based on section 2.3.2.

PARAMETERS
When run without parameters it shows a short help message and quits.
Otherwise the program needs to Parameters, which both needs to be alignments files from a Giza++ training.
The idea behind generating phrase tables is that we utilise the mutual information of a Giza++ training session from language <A> to language <B> and a session with the language pairs reversed from language <B> to language <A>. So it is necessary to run Giza++ twice, the second time run on the exact same text but with the source and target language swapped. The language direction of the first alignment file is taken.

OUTPUT
Output is produces on the stdout, which can be redirected to a file if needed.
Output is in plain text format usuable in the popular Pharaoh decoder.

NOTE
Phrase length is set to maximum 3, since larger phrases almost doesn't improve translation quality, but increases memory use heavily.
It is possible to set another maximum phrase length by change the value on line 5:
MaxPhraseLength=3
into another value.

EXAMPLE
./PhraseBuilder.py English_French.A3.final French_English.A3.final > English_French_PhraseTable
This produces phrase probabilities usualable for English French Translation, so Φ(English|French) is calculated given that English_French.A3.final contains alignment from French to English