|
Abstract:
|
Reordering has been an important topic in statistical machine translation
(SMT) as long as SMT has been around. State-of-the-art SMT systems such
as Pharaoh (Koehn, 2004a) still employ a simplistic model of the reordering
process to do non-local reordering. This model penalizes any reordering no
matter the words. The reordering is only selected if it leads to a translation
that looks like a much better sentence than the alternative.
Recent developments have, however, seen improvements in translation
quality following from syntax-based reordering. One such development
is the pre-translation approach that adjusts the source sentence to resemble
target language word order prior to translation. This is done based on
rules that are either manually created or automatically learned from word
aligned parallel corpora.
We introduce a novel approach to syntactic reordering. This approach
provides better exploitation of the information in the reordering rules and
eliminates problematic biases of previous approaches. Although the approach
is examined within a pre-translation reordering framework, it easily
extends to other frameworks. Our approach significantly outperforms a
state-of-the-art phrase-based SMT system and previous approaches to pretranslation
reordering, including (Li et al., 2007; Zhang et al., 2007b; Crego
& Mari˜ no, 2007). This is consistent both for a very close language pair,
English-Danish, and a very distant language pair, English-Arabic.
We also propose automatic reordering rule learning based on a rich set
of linguistic information. As opposed to most previous approaches that
extract a large set of rules, our approach produces a small set of predominantly
general rules. These provide a good reflection of the main reordering
issues of a given language pair. We examine the influence of several parameters that may have influence on the quality of the rules learned.
Finally, we provide a new approach for improving automatic word alignment.
This word alignment is used in the above task of automatically learning
reordering rules. Our approach learns from hand aligned data how to
combine several automatic word alignments to one superior word alignment.
The automatic word alignments are created from the same data that
has been preprocessed with different tokenization schemes. Thus utilizing
the different strengths that different tokenization schemes exhibit in word
alignment. We achieve a 38% error reduction for the automatic word alignment |