基于URL特征的网页分类研究-计算机软件与理论专业论文.docx
文本预览下载声明
摘 要
互联网提供了大量的广泛分布和高动态资源信息,网页信息分散且不方便管理。网页分 类能有效解决这些问题。在网页分类过程中,选取特征是其中重要一环,传统的特征从网 页正文、锚文本、标题等网页文本中选取,这样的选择方式耗时且代价很高。同时,特征 冗余,特征维数过高也是网页分类中常见的问题。如何快速鉴别网页类别同时提高分类精 度以及特征降维成为了急需解决的问题。
论文系统地分析了网页分类的产生背景、发展现状及研究意义,对网页分类的关键技术 进行深入的学习和研究,并在已有研究成果的基础上,主要完成了这些方面的创新:URL 是 网页唯一的标识,直接根据 URL 特征进行网页分类可以省去处理网页正文的时耗。论文分 析了 URL 的结构,提出了 n-gram 方法处理 URL 得到特征,n-gram 法通过分割 URL 得到一系 列字符串,充分利用 URL 上所包含的信息,选用 weka 工具做分类实验。通过选择不同的 n 值对比,得出从提取到分类所需的时间比传统的正文要快很多,并能达到较高的精度。实 验通过 URL 的 n-gram 特征提取法和传统的 URL 特征提取法比较,得出 n-gram 效果比较好。 并且在不要求时间的前提下,n-gram 和正文文本特征相结合效果比单独使用 n-gram 和网页 正文锚文本标题特征有所改进。
关键词:URL;网页分类;特征选择;n-gram
Abstract
The internet provides a great deal of resources and information; however, they are separated and difficult to be managed due to its wide distribution and high dynamics. Web page classification can effectively solve these problems. In the web page classification process, feature selection is one of the most important steps. The traditional features choosen from the text, anchor text, the title page text selection etc, which would consume more time. Meanwhile, feature redundancy and characteristic dimension are also common problems in web page classification. How to quickly identify web category and at the same time, improve the accuracy of classification and features reduced-order became problems needed to be resolved.
This paper systematically analysis the background, development situation and research significance of web page classification, researching the key technology of web page classification, and on the basis of existing research results, mainly completed the following innovation:
The URL is the unique identity of the web page, directly according to the URL characteristics of web page classification web page of text processing can save when consumption. The paper analysis the structure of URL, and puts forward the method to deal URL with using n-gram to get characteristics, the segmentation method of n-gram through getting a series of URL strings, make full u
显示全部