文档详情

基于信息熵降维的混合属性数据流聚类算法.pdf

发布：2017-09-10约1.76万字共4页下载文档

文本预览下载声明

第37卷第 19期计算机工程 2011年 10月、，01．37 NO．19 ComputerEngineering October20l1 · 软件技术与数据库 · 文章编号：10H0 428(2o11)l9—o8 3 文献标识码。A 中圈分类号：TP311 基于信息熵降维的混合属性数据流聚类算法谭建建，郑洪源，丁秋林 (南京航空航天大学信息科学与技术学院，南京 210016) 摘要：现有的数据流聚类算法无法处理高维混合属性的数据流。针对该问题，对 HPStream算法的脱机聚类和联机聚类过程进行改进，利用频度矩阵处理名词属性，通过基于信息熵的名词属性选择方法降低数据维度。实验结果表明，该算法能有效处理混合属性和维度较高的数据集，与HPStream算法相比，聚类精度有 5％～15％的提高。关健诃：数据流挖掘；混合属性；频度矩阵；信息熵；降维 ClusteringAlgorithm forDataStream withHeter0gene0usAttributes Based0nInofrmati0nEntropyDimensionReducti0n TANJian-jian，ZHENGHong-yuan，DINGQiu-lin (CollegeofInformationScienceandTechnology，NanjingUniversityofAeronauticsandAstronautics，Nanjing210016，China) [Abstract]Existeddatastreamclusteringalgorithmscannotdealwiththedatastreamwithhigh—dimensionalheterogeneousattributes．Toaddress theproblem ，thispaperimprovestheoff-lineprocessandtheon—lineprocessofHPStream algorithm，which usesrfequencymatrixtohandlethe categoricalattributesandusestheprincipleofinformationentropytohnadletheproblem ofhighdimension．Experimentalresultsshow thatthe algorithm canmanipulateheterogeneousattributesandhigh—dimensionaldatasets．ComparedwiththeHPStream algorithm，itsclusteringprecision isincreasedby5％～15％． [Keywords]datastreammining；heterogeneousattributes；frequencymatrix；informationentropy；dimensionreduction DOh 10．3969j／．issn．1000—3428．2011．19．026 1 概述 3 概要数据结构设计和名词属性降维近年来，由于计算机技术和通信技术的发展，产生了海本文用到的基本概念和公式如下：量的实时数据流，例如工业自动控制中的控制信息流、传感定义 l(数据流)数据流由一系列无限的、按照时间顺序器网络中的实时信息流。如何从这些数据流中获得有用的知到达的多维实例组成，即实例 x，，x 一，X …在时刻识成为新的研究热点。其中，数据流模型上的聚类技术作为，， … ， rm，…顺序到达。X =[ IB 】=[ ，2，…， I ，数据挖掘的重要方法得到了广泛的研究。本文针对混合属性

显示全部

相似文档