文档详情

基于遗传算法的主题爬虫策略改进.pdf

发布:2017-05-31约1.33万字共5页下载文档
文本预览下载声明
27 10 2010 10 : 1006- 9348( 2010) 10- 0087- 04 陈一峰,赵恒凯,余小清, 万旺根 (, 200072) : , , PageRank, , PageRank , , () , , , , 5% : ; ; ; : TP31113 : B Improvement of Focused Crawling Strategy Based on GeneticAlgorithm CHEN i- feng, ZHAO Heng- ka,i U X iao- qing,WANW ang- gen ( School ofCommunication And Information Engineering, ShanghaiUniversity, Shanghai200072, China) ABSTRACT:A mi ing at the subject drifting problem of topic crawling, this paper presents an mi proved strategy. Based on Genetic A lgorithm, this strategy absorbs the idea of the PageRank algorithm and correlation of page, re- sets the fitness function and adjusts the size of correlation parameters ofpage by it. In thisway, the superior gene is selected first and the subject drifting is reducedwhile delivering. Comparedw ith previous strategies based on genetic algorithm, w ithout prejudice to recall the circumstances, the number of pages relevant to the subject can raise more than 5%. K YWORDS: Focused crawler; Pagerank algorithm; Genetic algorithm; W eb information 1 , , , , , , , , , [ 2] [3] PageRank, [ 1] , : , , , , , 2
显示全部
相似文档