基于遗传算法的主题爬虫策略改进.pdf
文本预览下载声明
27 10 2010 10
: 1006- 9348( 2010) 10- 0087- 04
陈一峰,赵恒凯,余小清, 万旺根
(, 200072)
: , ,
PageRank, , PageRank
, , ()
, ,
, , 5%
: ; ; ;
: TP31113 : B
Improvement of Focused Crawling Strategy
Based on GeneticAlgorithm
CHEN i- feng, ZHAO Heng- ka,i U X iao- qing,WANW ang- gen
( School ofCommunication And Information Engineering, ShanghaiUniversity, Shanghai200072, China)
ABSTRACT:A mi ing at the subject drifting problem of topic crawling, this paper presents an mi proved strategy.
Based on Genetic A lgorithm, this strategy absorbs the idea of the PageRank algorithm and correlation of page, re-
sets the fitness function and adjusts the size of correlation parameters ofpage by it. In thisway, the superior gene is
selected first and the subject drifting is reducedwhile delivering. Comparedw ith previous strategies based on genetic
algorithm, w ithout prejudice to recall the circumstances, the number of pages relevant to the subject can raise more
than 5%.
K YWORDS: Focused crawler; Pagerank algorithm; Genetic algorithm; W eb information
1 ,
, ,
, ,
, ,
, , [ 2] [3] PageRank,
[ 1] ,
:
, ,
,
,
, 2
显示全部