Submitted by: Submitted by kibrit90
Views: 238
Words: 4170
Pages: 17
Category: Other Topics
Date Submitted: 10/23/2013 03:05 AM
IT433 Data Warehousing and Data Mining
— Data Preprocessing —
1
Data Preprocessing
• Why preprocess the data? • Descriptive data summarization • Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
2
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records
3
Why Is Data Dirty?
• Incomplete data may come from
– “Not applicable” data value when collected – Different considerations between the time when the data was collected and when it is analyzed. – Human/hardware/software problems
• Noisy data (incorrect values) may come from
– Faulty data collection instruments – Human or computer error at data entry – Errors in data transmission
• Inconsistent data may come from
– Different data sources – Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning
4
Why Is Data Preprocessing Important?
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.
– Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
5
Multi-Dimensional Measure of Data Quality
• A well-accepted multidimensional view: – Accuracy – Completeness – Consistency – Timeliness – Believability – Value added – Interpretability – Accessibility • Broad categories: – Intrinsic, contextual, representational, and accessibility
6
Major Tasks in Data...