文档详情

基于样本加权的文本聚类算法研究).pdf

发布：2017-09-27约字共7页下载文档

文本预览下载声明

维普资讯情报学报 ISSN1000—0135 JOURNALOFTHE CHINA SOCIEIY F0R SCIENTIFIC ANDTECHNICALINF0RMAH0N ISSN 1000—0135 第27卷第 1期42—48，2008年2月 Vo1．27 NO．1，42—48 February 2OO8 基于样本加权的文本聚类算法研究 ) 章成志师庆辉薛德军 (1．南京大学信息管理系，南京 210093；2．中国学术期刊(光盘版)电子杂志社，北京 100084) 摘要样本加权聚类算法是一种最近才引起人们注意的算法，还存在一些需要解决的问题，例如，聚类对象之间的结构信息对样本加权聚类是否有帮助，如何将结构信息自动转换为样本或对象的权重?针对该问题，本文以学术论文为聚类对象，以K-Means算法为聚类算法基础，利用论文之间的引用关系计算每篇论文的PageRank值，并将其作为权重，提出一种基于样本加权的新的文本聚类算法。实验结果表明，基于论文 PageRank值加权的聚类算法能改善文本聚类效果。该算法可推广到网页的聚类中，利用网页的PageRank进行加权聚类，来改善网页的聚类效果。关键词文本聚类样本加权聚类 PageRank 被引频次 DocumentClusteringAlgorithm BasedonSampleW eighting ZhangChengzhi，ShiQinghuiandXueDejun (1．DepartmentofInformationManagement，NanfingUniversity，Nanjing210093； 2．ChinaAcademicJournal(CD)ElectronicPublishingHouse，Bejiing 100084) Abstract Sampleweightingclusteringalgorithm hasbeennoticedonlyrecently．Therearesomeunsolvedproblems，for exmaple，whetherthestructureinformationmaonghteclusteringobjectsishelpfultosmapleweightingclustering?Howtotransfomr structureinformationintohteweihgtofsmaplesornot?Tosolvehteseproblems，anovelsmapleweightingclusteringalgorihtmis presentedbasedonK-Meansalgorihtm．ThealgorithmUSeSacademicdocumentsashteclusteringobjects．ThePageRankvalueof eachdocumentiscalculatedaccording to thecitedrelationship amonghtem，and itisused ashteweightin thealgorithm ． Experiments show htathteproposedalgorithm isaneffectiveoslutiontoimprovehteperformanceofdocumentclustering，na ditCna be extendedtoW ebpagesclusteringbasedonPageRankvalueofeachWebpage． Keywords documentclustering，smapleweightedclustering，PageRnak，citiedfrequen

显示全部

相似文档