An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora.pdf
文本预览下载声明
EAMT 2005 Conference Proceedings 1
An Efficient Phrase-to-Phrase Alignment Model for Arbitrarily
Long Phrase and Large Corpora
Ying Zhang Stephan Vogel
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
{joy+,vogel+}@
Abstract. Most statistical machine translation (SMT) systems use phrase-to-phrase
translations to capture local context information, leading to better lexical choices and more
reliable word reordering. Long phrases capture more contexts than short phrases and result
in better translation qualities. On the other hand, the increasing amount of bilingual data
poses serious problems for storing all possible phrases. In this paper, we describe a novel
phrase-to-phrase alignment model which allows for arbitrarily long phrases and works for
very large bilingual corpora. This model is very efficient in both time and space and the
resulting translations are better than the state-of-the-art systems.
1. Introduction
In recent years, various phrase-to-phrase
translation models (Och 1999; Marcu Wong
2002; Koehn 2003; Zhang 2003) have shown
great advantages over the word-based systems
(Brown 1990). We believe that longer phrases
encapsulate more contexts of the words and the
translation qualities are expected to be higher
than that of short phrases. Unfortunately, given
the increasing volume of the parallel bilingual
data for some major languages such as Arabic
and Chinese, storing and loading all possible
phrase translations from the training corpus
becomes more and more expensive by means of
space and time in computation. To keep the
phrasal translation model of a reasonable size,
some models (Koehn 2003) and (Zhang 2003)
limit the length of the phrases to be no more
than 3 words while others (Vogel 2003) sub-
samples the training corpus based on the testing
data to down-scale the problem. In this paper,
we introduce a new strategy to cope with this
problem. Instead of aligning the phr
显示全部