It433 Data Warehousing

Submitted by: Submitted by

Views: 238

Words: 4170

Pages: 17

Category: Other Topics

Date Submitted: 10/23/2013 03:05 AM

Report This Essay

IT433 Data Warehousing and Data Mining

— Data Preprocessing —

1

Data Preprocessing

• Why preprocess the data? • Descriptive data summarization • Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy generation

• Summary

2

Why Data Preprocessing?

• Data in the real world is dirty

– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

• e.g., occupation=“ ”

– noisy: containing errors or outliers

• e.g., Salary=“-10”

– inconsistent: containing discrepancies in codes or names

• e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records

3

Why Is Data Dirty?

• Incomplete data may come from

– “Not applicable” data value when collected – Different considerations between the time when the data was collected and when it is analyzed. – Human/hardware/software problems

• Noisy data (incorrect values) may come from

– Faulty data collection instruments – Human or computer error at data entry – Errors in data transmission

• Inconsistent data may come from

– Different data sources – Functional dependency violation (e.g., modify some linked data)

• Duplicate records also need data cleaning

4

Why Is Data Preprocessing Important?

• No quality data, no quality mining results!

– Quality decisions must be based on quality data

• e.g., duplicate or missing data may cause incorrect or even misleading statistics.

– Data warehouse needs consistent integration of quality data

• Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

5

Multi-Dimensional Measure of Data Quality

• A well-accepted multidimensional view: – Accuracy – Completeness – Consistency – Timeliness – Believability – Value added – Interpretability – Accessibility • Broad categories: – Intrinsic, contextual, representational, and accessibility

6

Major Tasks in Data...