文档详情

南航暑期国际课程大数据可视化第7讲2.ppt

发布:2017-05-24约1.01万字共42页下载文档
文本预览下载声明
* *Clustering weather data ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False F Rainy Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True 4 3 Merge best host and runner-up 5 Consider splitting the best host if merging doesn’t help * *Final hierarchy ID Outlook Temp. Humidity Windy A Sunny Hot High False B Sunny Hot High True C Overcast Hot High False D Rainy Mild High False Oops! a and b are actually very similar Use category utility measure to split or merge nodes * *Example: the iris data (subset) Use category utility measure to split or merge nodes * *Clustering with cutoff Use category utility measure to split or merge nodes * Category utility Category utility: quadratic loss function defined on conditional probabilities: Every instance in different category ? numerator becomes maximum number of attributes vij – value of j-th index of attribute ai, e.g., for vector (5,3) we have attributes a1 and a2, values v11=5, and v21=3) K is the number of categories * *Overfitting-avoidance heuristic If every instance gets put into a different category the numerator becomes (maximal): Where n is number of all possible attribute values. So without k (the number of categories) in the denominator of the CU-formula, every cluster would consist of one instance! Maximum value of CU The information-theoretic definition of category utility The intuition: representing the cost (in bits) of optimally encoding (or transmitting) feature information when it known that the objects to be described belong to category . representing the cost (in bits) of optimally encoding (or transmitting) feature information when it known that the objects to be described does NOT belong to category . r
显示全部
相似文档