An evaluation exercise for word alignment.pdf
文本预览下载声明
An Evaluation Exercise for Word Alignment
Rada Mihalcea
Department of Computer Science
University of North Texas
Denton, TX 76203
rada@
Ted Pedersen
Department of Computer Science
University of Minnesota
Duluth, MN 55812
tpederse@
Abstract
This paper presents the task definition, re-
sources, participating systems, and compara-
tive results for the shared task on word align-
ment, which was organized as part of the
HLT/NAACL 2003 Workshop on Building and
Using Parallel Texts. The shared task in-
cluded Romanian-English and English-French
sub-tasks, and drew the participation of seven
teams from around the world.
1 Defining a Word Alignment Shared Task
The task of word alignment consists of finding correspon-
dences between words and phrases in parallel texts. As-
suming a sentence aligned bilingual corpus in languages
L1 and L2, the task of a word alignment system is to indi-
cate which word token in the corpus of language L1 cor-
responds to which word token in the corpus of language
L2.
As part of the HLT/NAACL 2003 workshop on ”Build-
ing and Using Parallel Texts: Data Driven Machine
Translation and Beyond”, we organized a shared task on
word alignment, where participating teams were provided
with training and test data, consisting of sentence aligned
parallel texts, and were asked to provide automatically
derived word alignments for all the words in the test set.
Data for two language pairs were provided: (1) English-
French, representing languages with rich resources (20
million word parallel texts), and (2) Romanian-English,
representing languages with scarce resources (1 million
word parallel texts). Similar with the Machine Transla-
tion evaluation exercise organized by NIST
1
, two sub-
tasks were defined, with teams being encouraged to par-
ticipate in both subtasks.
1
/speech/tests/mt/
1. Limited resources, where systems are allowed to use
only the resources provided.
2. Unlimited resources, where systems are allowed to
use any resources in addition to th
显示全部