A Comparative Study of Topic Identification on Newspaper and Email.pdf
文本预览下载声明
A Comparative Study of Topic Identification on Newspaper and E-mail
Brigitte Bigi, Armelle Brun, Jean-Paul Haton, Kamel Sma¨ıli and Imed Zitouni
LORIA/INRIA-Lorraine 615 rue du Jardin Botanique, BP 101,
F-54600 Villers-l`es-Nancy, France
e-mail: fbigi, brun, jph, smaili, zitounig@loria.fr
Abstract speech recognition systems, selecting documents for WEB
engines, etc. Another promising direct application of TID
This paper presents several statistical methods for topic is e-mail routing. This application consists of dispatching
identification on two kinds of textual data: newspaper arti- e-mail messages in accordance with their content. For ex-
cles and e-mails. Five methods are tested on these two cor- ample, a hot-line which receives a large number of e-mails
pora: topic unigrams, cache model, TFIDF classifier, topic per day would like to dispatch them automatically to sev-
perplexity, and weighted model. Our work aims to study eral boxes. Each box corresponds to a specific problem to
these methods by confronting them to very different data. be solved, which can be considered as a topic.
This study is very fruitful for our research. Statistical topic
identification methods depend not only on a corpus, but also
on its type. One of the methods achieves a topic identifica- 2 E-mail topic identification
tion of 80% on a general newspaper corpus but does not
exceed 30% on e-mail corpus. Another method gives the
best result on e-mails, but has not the same behavior on a Routing e-mail messages is a direct application of TID. It
newspaper corpus. We also show in this paper that almost amounts to
显示全部