Data Mining Project

Submitted by: Submitted by

Views: 10

Words: 1775

Pages: 8

Category: Science and Technology

Date Submitted: 09/09/2015 09:53 AM

Report This Essay

Data Mining Project

DSCI 601

Classification

K-NN

The data set given to analyze was “Undecided Voter” in which I was to determine the classification of 10 additional voters based on a sample of 500. The data set was partitioned at a 60% training set and a 40% validation set to limit the possibility of over fitting. The sample set given had eight variables undecided voter, age, homeowner, female, married, income, years of education, and attendance of church. Before entering the data into XLMiner I had to clean up the raw data and develop binary variables for four of the variables (homeowner, female, married, church). The data was normalized to eliminate the possibility of skewing the results caused from the large income variable. Using a cutoff rate of 50% the data is required to match at least that amount before being classified to a particular group.

The data is evaluated using the lowest overall percent error, 13.5%, from the validation set in this example a value of 3 for k is determined to be the most effective. If there are numerous low percent error rates then the lowest value of k is to be chosen.

Reviewing the Confusion Matrix at k = 3, we achieve an overall percent error of 13.5. Our sample data is more accurate at classifying if a new sample is a class 0, “not undecided”. Of the 76 samples tested we achieved a 25% error rate.

The final output for the ten samples to be classified as Undecided (1) or Not Undecided (0) is: 0,1,0,0,1,0,0,0,0,0.

Classification Tree

Using the same “Undecided Voter” data set partitioned at 60% training and 40% validation we will analyze the output of XLMiner’s Single Tree Classification method. This data is normalized to eliminate skewed data caused by larger variables and will be reviewed as the minimum error tree to prevent over fitting.

The Confusion Matrix for the validation set is shown below with an overall error rate of 7%.

Below is the Minimum Error Tree with the first new record to be...