Text Categorization and the Analysis of Lyrics.doc
文本预览下载声明
Jordan Smith
MUMT 611: Music Information Acquisition, Preservation, and Retrieval
Professor Ichiro Fujinaga
30 March 2008
Text Categorization and the Analysis of Lyrics
1. Introduction
The Music Information Retrieval (MIR) and Text Categorization (TC) communities are closely related: they research similar problems, such as automatic classification and similarity estimation; and they use similar techniques to solve them—mainly machine learning (ML) techniques (Sebastiani 2002). They have a shared history, too: in the 1990s, as computing power increased and a growing number of documents (both music and text) became available in digital form, MIR and TC research expanded to address the need to handle these vast quantities of data. The Music Information Retrieval Evaluation eXchange (MIREX) even took its text counterpart—the Text REcognition Conference (TREC)—as its pattern (Pienimki 2006). Ironically, despite their shared concerns and techniques, one topic that lies at the intersection of these fields has remained largely unexplored: lyrics (Maxwell 2007).
This is additionally surprising given the finding that close to 30% of MIR queries use lyrics data (Bainbridge et al. 2003). Compared to audio files, lyrics can be extremely easy to collect: several studies have established tools for the automated collection and cross-referencing of lyric data from online sources (Deleijnse and Korst 2006, Knees et al. 2005). Unlike most audio data, lyrics are also compact, and may be collected and distributed freely and legally. Lyrics are also very reliable ground truth: they are usually a highly accurate transcription of what is uttered in a song, while a MIDI file may be a poor representation of what is played in a song (Logan et al. 2004). Lyrics thus represent a rich and accessible source of data that ought to be studied with some combination of MIR and TC techniques.
This paper provides an overview of text categorization (presuming a knowledge of machine learning). The next s
显示全部