基于LDA主题模型的分布式信息检索集合选择方法.PDF
文本预览下载声明
基于LDA 主题模型的分布式信息检索
集合选择方法
1 1 1 1 2 2
何旭峰 ,陈岭 ,陈根才 ,钱坤 ,吴 勇 ,王敬昌
1 (浙江大学计算机科学与技术学院, 杭州 中国 310027 )
2 (浙江鸿程计算机系统有限公司, 杭州 中国 310009 )
摘 要:针对分布式信息检索时不同集合对最终检索结果贡献度有差异的现象,提出一种基于 LDA 主题模型的集合选择方
法。该方法首先使用基于查询的采样方法获取各集合描述信息;其次,通过建立LDA 主题模型计算查询与文档的主题相关
度;接着,用基于关键词相关度和基于主题相关度结合的方法估计查询与样本集中文档的综合相关度;最后,通过样本集
文档所属的集合信息,估计查询与各集合的相关度,进而选择相关度最高的 M 个集合进行检索。实验部分采用 Rm 、P@n
和MAP 作为评价指标对集合选择方法的性能进行了验证。实验结果表明本文提出方法能更准确的定位到包含相关文档多的
集合,提高了检索结果的召回率和准确率。
关键词:集合选择;分布式信息检索;LDA
中图法分类号:TP 311
A LDA topic model based collection selection method for distributed
information retrieval
1 1 1 1 2 2
HE Xu-feng , CHEN Ling , CHEN Gen-cai ,QIAN Kun ,WU Yong , WANG Jing-chang
1(College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China)
2(ZheJiang Hongcheng Computer System Co.,Ltd., Hangzhou 310009, China)
Abstract: Considering that different collections have different contributions to the final search results, a LDA
topic model based collection selection method was proposed for distributed information retrieval. Firstly, the
method acquired information about the representation of each collection by query-based sampling; secondly, a
method using the LDA topic model was proposed to estimate the relevance between the query and a document;
then, a term-based and topic-based mixed method was used to estimate the relevance between the query and the
documents sampled; Finally, the relevance between the query and collections were estimated with the information
of the collections that the sampled documents belong
显示全部