文档详情

硕士论文-支持AJAX的互联网搜索引擎爬虫设计与实现.doc

发布:2019-02-18约3.7万字共91页下载文档
文本预览下载声明
浙江大学 硕士学位论文 支持AJAX的互联网搜索引擎爬虫设计与实现 姓名:罗兵 申请学位级别:硕士 专业:计算机应用 导教师:陈刚Abstract Web Crawler is an important component of Search Engine, web developers build applications that are easier to use and more fimctional than traditional Web programs by using AJAX technologies, which create web pages witii Asynchronous JavaScript and XML. AJAX changes the content of web pages dynamically after getting the data from web server by sending die request asynchronously. As a result, the data that the traditional web crawler collects is less than the data presenting in the web browser. We propose a new web crawler -Aj 狀Crawler, which supports AJAX. The AjaxCrawler is composed of crawling web page, analyzing web page, interpreting JavaScript, invoking DOM operation methods,regenerating web page. First, ciawl the web page by HTTP request, second, analyze the page element, not only the links,but also the JavaScript code and file in the page, then, execute the JavaScript code, which include the AJAX request, gets the result from server and invoking DOM operation methods to change the content of web page, at last, regenerate the web page and extract the links. According to the experiment, the content crawled by AjaxCrawler is more than traditional crawler at the same condition. Keywords Search Engine, Web Crawler, AJAX, Web2.0 图目录 TOC \o 1-3 \h \z 图1-1搜索引擎的体系结构⑴ 4 图2-1传统爬虫的工作流程 8 图2-2抓取策略 9 图2-3基于分类器聚焦爬虫体系结构 10 图24基于数据抽取器的聚焦爬虫体系结构 11 图2-5基于用户学习的聚焦爬虫体系结构 12 图2-6系统结构E] 13 图3-1同步交互(上)和异步交互(下)的比较[B】 18 图3-2传统Web应用和基于AJAX的Web应用的比较113〗 19 图3-3网易博客的毎日推荐页面呈现 22 图34网易博客每曰推荐的页面源码片段 22 图3-5支持AJAX的爬虫总体结构 23 图4-1网页分析流程 34 图4-2 JS解释器的结构 36 图4-3 DOM层次结构 38 图44W3C的DOM接口继承关系 39 图4-5 Node节点的方法 39 图4-6提取页面中超链接的流程 40 图5-1 AjaxCrawler抓取的超链接数一网易博客 42 图5-2传统爬虫抓取的超链接数一网易博客 42 图5-3AjaxCmwler抓取的超链接数一新浪博客 43 图5-4传统爬虫抓取的超链接数一新浪博客 43 图5-5 AjaxCrawler抓取的超链接数一百度博客 44 图5-6传统爬虫抓取的超链接数一百度博客 45 图5-7 AjaxCrawler抓取的超链接数一debian 45 图5-8传统爬虫抓取的超链接数一debian 46 图5-9抓取链接数对比 47 图5-10抓取时间对比 47 表目录 TOC \o 1-3 \h \z 表 3-1 XMLH
显示全部
相似文档