terms and document representation with generalized latent semantic analysis.pdf
文本预览下载声明
Terms and Document Representation with Generalized Latent Semantic
Analysis
Abstract resent orthogonal dimensions which makes an unre-
alistic assumption about the independence of terms
Document indexing and representation of within documents.
term-document relations are very impor- Modifications of the representation space, such
tant issues for document clustering and re- as representing dimensions with distributional term
trieval. In this paper, we present General- clusters (Bekkerman et al., 2003) and expanding the
ized Latent Semantic Analysis as a frame- document and query vectors with synonyms and re-
work for computing semantically moti- lated terms as discussed in (Levow et al., 2005), im-
vated term and document vectors. Our fo- prove the performance on average. However, they
cus on term vectors is motivated by recent also introduce some instability and thus increased
success of co-occurrence based measures variance (Levow et al., 2005). The language mod-
of semantic similarity obtained from very elling approach (Ponte and Croft, 1998; Berger and
large corpora. Our experiments demon- Lafferty, 1999) used in information retrieval uses
strate that GLSA term vectors efficiently bag-of-words document vectors to model document
capture semantic relations between terms and collection based term distributions.
and outperform related approaches on the Since the document vectors are constructed in a
synonymy test. We also show that term-
显示全部