韩家炜数据挖掘.ppt
文本预览下载声明
Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign /~hanj ?2006 Jiawei Han and Micheline Kamber, All rights reserved Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records Why Is Data Dirty? Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories: Intrinsic
显示全部