文档详情

A Comparative Study of Topic Identification on Newspaper and Email.pdf

发布:2015-09-25约2.84万字共4页下载文档
文本预览下载声明
A Comparative Study of Topic Identification on Newspaper and E-mail Brigitte Bigi, Armelle Brun, Jean-Paul Haton, Kamel Sma¨ıli and Imed Zitouni LORIA/INRIA-Lorraine 615 rue du Jardin Botanique, BP 101, F-54600 Villers-l`es-Nancy, France e-mail: fbigi, brun, jph, smaili, zitounig@loria.fr Abstract speech recognition systems, selecting documents for WEB engines, etc. Another promising direct application of TID This paper presents several statistical methods for topic is e-mail routing. This application consists of dispatching identification on two kinds of textual data: newspaper arti- e-mail messages in accordance with their content. For ex- cles and e-mails. Five methods are tested on these two cor- ample, a hot-line which receives a large number of e-mails pora: topic unigrams, cache model, TFIDF classifier, topic per day would like to dispatch them automatically to sev- perplexity, and weighted model. Our work aims to study eral boxes. Each box corresponds to a specific problem to these methods by confronting them to very different data. be solved, which can be considered as a topic. This study is very fruitful for our research. Statistical topic identification methods depend not only on a corpus, but also on its type. One of the methods achieves a topic identifica- 2 E-mail topic identification tion of 80% on a general newspaper corpus but does not exceed 30% on e-mail corpus. Another method gives the best result on e-mails, but has not the same behavior on a Routing e-mail messages is a direct application of TID. It newspaper corpus. We also show in this paper that almost amounts to
显示全部
相似文档