基于C++的搜索引擎网络爬虫设计与实现毕业论文绝对精品.doc
文本预览下载声明
搜索引擎网络爬虫设计与实现
摘要
网络中的资源非常丰富,但是如何有效的搜索信息却是一件困难的事情。建立搜索引擎就是解决这个问题的最好方法。本文首先详细介绍了基于英特网的搜索引擎的系统结构,然后是从指定的Web页面中按照进行解析、搜索,并把搜索到的每条进行。的章节中除了详细的阐述技术核心外还结合了实现代码来说明,易于理解。URL搜索器;多线程
Design and Realization of Search Engine Network Spider
Abstract
The resource of network is very rich, but how to search the effective information is a difficult task. The establishment of a search engine is the best way to solve this problem.
This paper first introduces the internet-based search engine structure, and then illustrates how to implement search engine network spiders.
The multi-thread network spider procedure is from the Web page which assigns according to the width priority algorithm connection for analysis and search, and each URL is snatched and preserved, and make the result URL as the new source entrance unceasing crawling on internet to carry out the backgoud automatically.
My paper of network spider mainly applies to the socket technology, the regular expression, the HTTP agreement, the windows network programming technology and other correlation technique, and taking C++ language as implemented language, and passes under VC6.0 debugging.
In the chapter of the spider design and implementation, besides a detailed exposition of the core technology in conjunction with the multi-threaded network spider to illustrate the realization of the code, it is easy to understand. This network spiders is initial URL based on configuration files which can operate on background,using width priority algorithm to crawl down, preserving network programme of target URL.
Keywords Internet search engine; Network spider; URL search programme; Multithreaded
不要删除行尾的分节符,此行不会被打印目录
摘要 I
Abstract II
第1章 绪论 1
1.1 课题背景 1
1.2 搜索引擎的历史和分类 2
1.2.1 搜索引擎的历史 2
1.2.2 搜索引擎的分类 2
1.3 搜索引擎的发展趋势 3
1.4 搜索引擎的组成部分 4
1.5 课题研究的主要内容 4
第2章 网络爬虫的技术要点分析 6
2.1 网络爬虫Spider工作原理 6
2.1.1 Spider 的概念 6
2.1.2 网络爬虫抓取内容分析 6
2.2 HTTP协议 7
2.2.1 HTTP协议的请求 7
2.2.2 HTTP协议的响应 8
2.2.3 HTTP的消息报头 8
2.3 SOCKET套接字 10
2.3.1 什么是SOCKET套接字 10
2.3.2 SOCK
显示全部