文档详情

基于URL特征的网页分类研究-计算机软件与理论专业论文.docx

发布：2018-12-15约4.83万字共64页下载文档

文本预览下载声明

摘要互联网提供了大量的广泛分布和高动态资源信息，网页信息分散且不方便管理。网页分类能有效解决这些问题。在网页分类过程中，选取特征是其中重要一环，传统的特征从网页正文、锚文本、标题等网页文本中选取，这样的选择方式耗时且代价很高。同时，特征冗余，特征维数过高也是网页分类中常见的问题。如何快速鉴别网页类别同时提高分类精度以及特征降维成为了急需解决的问题。论文系统地分析了网页分类的产生背景、发展现状及研究意义，对网页分类的关键技术进行深入的学习和研究，并在已有研究成果的基础上，主要完成了这些方面的创新：URL 是网页唯一的标识，直接根据 URL 特征进行网页分类可以省去处理网页正文的时耗。论文分析了 URL 的结构，提出了 n-gram 方法处理 URL 得到特征，n-gram 法通过分割 URL 得到一系列字符串，充分利用 URL 上所包含的信息，选用 weka 工具做分类实验。通过选择不同的 n 值对比，得出从提取到分类所需的时间比传统的正文要快很多，并能达到较高的精度。实验通过 URL 的 n-gram 特征提取法和传统的 URL 特征提取法比较，得出 n-gram 效果比较好。并且在不要求时间的前提下，n-gram 和正文文本特征相结合效果比单独使用 n-gram 和网页正文锚文本标题特征有所改进。关键词：URL；网页分类；特征选择；n-gram Abstract The internet provides a great deal of resources and information; however, they are separated and difficult to be managed due to its wide distribution and high dynamics. Web page classification can effectively solve these problems. In the web page classification process, feature selection is one of the most important steps. The traditional features choosen from the text, anchor text, the title page text selection etc, which would consume more time. Meanwhile, feature redundancy and characteristic dimension are also common problems in web page classification. How to quickly identify web category and at the same time, improve the accuracy of classification and features reduced-order became problems needed to be resolved. This paper systematically analysis the background, development situation and research significance of web page classification, researching the key technology of web page classification, and on the basis of existing research results, mainly completed the following innovation: The URL is the unique identity of the web page, directly according to the URL characteristics of web page classification web page of text processing can save when consumption. The paper analysis the structure of URL, and puts forward the method to deal URL with using n-gram to get characteristics, the segmentation method of n-gram through getting a series of URL strings, make full u

显示全部

相似文档