一种面向大规模微博数据的话题挖掘方法-计算机工程与应用.pdf
文本预览下载声明
32 2014 ,50(22 ) Computer Engineering and Applications 计算机工程与应用
一种面向大规模微博数据的话题挖掘方法
1,2 1,2 1 1
王文帅 ,杜 然 ,程耀东 ,陈 刚
1,2 1,2 1 1
WANG Wenshuai , DU Ran , CHENG Yaodong , CHEN Gang
1.中国科学院 高能物理研究所 计算中心,北京 100049
2.中国科学院大学,北京 100049
1.Computing Center ,Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China
2.University of Chinese Academy of Sciences, Beijing 100049, China
WANG Wenshuai, DU Ran, CHENG Yaodong, et al. Topic mining method on massive microblog data. Computer
Engineering and Applications, 2014, 50 (22 ):32-37.
Abstract :With the daily popularity of microblog, Sina Weibo has become one of the important public access to and dis-
semination of information platform, microblog topic mining has become a current research focuses. This paper proposes a
topic mining method on massive Social Network data. This paper analyzes the large-scale microblog data, uses Bloom
Filter algorithm to eliminate the duplicate data. In view of the special structure of microblog, filter the text. SNLDA, an
improved LDA topic model is proposed in this paper, Gibbs sampling is chosen to deduce the model, which can mine the
microblog topics. The experimental results show that the method can effectively excavate the topics from the large-scale
microblog data.
Key words :microblog; Bloom Filter; Social Network LDA (SNLDA); topic mining
摘 要:随着微博的日趋流行,新浪微博已成为公众获取和传播信息的重要平台之一,针对微博数据的话题挖掘也
成为当前的研究热点。提出一个面向大规模微博数据的话题挖掘方法。首先对大规模微博数据进行分析,基于
Bloom Filter 算法对数据进行去重处理,针对微博的特有结构,对文本进行预处理,提出改进的LDA 主题模型So-
cial Network LDA (SNLDA ),采用吉布斯采样法进行模型推导,挖掘出微博话题。实验结果表明,方法能有效地从
大规模微博数据中挖掘出话题信息。
关键词:微博;Bloom Filter ;社会网络主题模型分析(SNLDA );话题挖掘
文献标志码:A 中图分类号:TP393 doi :10.3778/j.issn. 1002-8331.1404-0042
1 引言 户数就已达到5 亿以上,2013 年第四季度微博日均活跃
近年来社交网站在国内外得到迅猛发展,微博逐渐 用户为
显示全部