数据挖掘十大经典算法.pdf
theIEEEInternationalConferenceonDataMining
(ICDM)200612C4.5,k-Means,
SVM,Apriori,EM,PageRank,AdaBoost,kNN,NaiveBayes,andCART.
18
1.C4.5
C4.5,ID3.
C4.5ID3ID3
1)
2)
3)
4)
C4.5
2.Thek-meansalgorithmK-Means
k-meansalgorithmnk
kn
3.Supportvectormachines
SupportVectorMachineSV
SVM
C.J.CBurges
vanderWaltBarnard
4.TheApriorialgorithm
Apriori
5.(EM)
EMExpectation–Maximization
probabilistic
LatentVariabl
DataClustering
6.PageRank
PageRankGoogle20019
Google•LarryPagePageRankpage
PageRank
PageRank
“”——
PageRank
——
7.AdaBoost
Adaboost(
)(
)
8.kNN:k-nearestneighborclassification
K(k-NearestNeighborKNN)
k()
9.NaiveBayes
(Decision
TreeModel)NaiveBayesianModelNBC
NBC
NBC
NBC
NBC
NBC
NBC
10.CART:
CART,ClassificationandRegressionTrees
(1)C4.5
,
“”
1)
2)
3)
ID3QuilanC4.5
C4.5ID3ID3.
C4.5ID3ID3
1)
2)
3)
4)
C4.5
C4.5
C4.5,ID3.
.
:
:.
:.
:.
§4.3.2ID3
1.CLS
1)C={E},E,.
2)IFCe
YES.
ELSE,Fi={V1,V2,V3,Vn}
CNC1,C2,C3,,Cn
3)Ci.
2.ID3
1)CW().
2)CLSWDT().
3)CDT(DT).
4)W,W.
5)2)4),.
:
,.
,
P=freq(Cj,S)/|S|;
INFO(S)=-SUM(P*LOG(P));SUM()
j1n.
Gain(X)=Info(X)-Infox(X);
Infox(X)=SUM((|Ti|/|T|)*Info(X);
,ID3,
(Gain(S)).
§4.3.3:ID3
1..
2..
3..
§4.3.4:C4.5ID3:
1.,.
Split_Infox(X)=-SUM((|T|/|Ti|)*LOG(|Ti|/|T|)
);
Gainratio(X)=Gain(X)/SplitInfox(X);
2..
1)
,C4.5ID3
,.
2),?,
3.,.
(2)k-means
k-meansa