数据挖掘基本分类方法解析.ppt
文本预览下载声明
Confidence Interval for Accuracy For large test sets (N 30), acc has a normal distribution with mean p and variance p(1-p)/N Confidence Interval for p: Area = 1 - ? Z?/2 Z1- ? /2 Confidence Interval for Accuracy Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8 Let 1-? = 0.95 (95% confidence) From probability table, Z?/2=1.96 1-? Z 0.99 2.58 0.98 2.33 0.95 1.96 0.90 1.65 N 50 100 500 1000 5000 p(lower) 0.670 0.711 0.763 0.774 0.789 p(upper) 0.888 0.866 0.833 0.824 0.811 Comparing Performance of 2 Models Given two models, say M1 and M2, which is better? M1 is tested on D1 (size=n1), found error rate = e1 M2 is tested on D2 (size=n2), found error rate = e2 Assume D1 and D2 are independent If n1 and n2 are sufficiently large, then Approximate: Comparing Performance of 2 Models To test if performance difference is statistically significant: d = e1 – e2 d ~ N(dt,?t) where dt is the true difference Since D1 and D2 are independent, their variance adds up: At (1-?) confidence level, An Illustrative Example Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 d = |e2 – e1| = 0.1 (2-sided test) At 95% confidence level, Z?/2=1.96= Interval contains 0 = difference may not be statistically significant Comparing Performance of 2 Algorithms Each learning algorithm may produce k models: L1 may produce M11 , M12, …, M1k L2 may produce M21 , M22, …, M2k If models are generated on the same test sets D1,D2, …, Dk (e.g., via cross-validation) For each set: compute dj = e1j – e2j dj has mean dt and variance ?t Estimate: Computing Impurity Measure Split on Refund: Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183 Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551 Gain = 0.9 ? (0.8813 – 0.551) = 0.3303 Missing value Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813 Distribute Instances Refund Y
显示全部