Monday, October 21, 2013

More on the Kaggle SciKit Learn Competition

I have been off MOOCing. I have completed several Coursera MOOC classes and parts of several others. When there are so many good courses available, you have to be careful not to overextend yourself. I have just finished Dr. Peng's Computing for Data Analysis course. I'll be starting Dr. Leek's Data Analysis course next week. Both courses use R.

In preparation for Dr. Leek's class, I decided to take another look at the Kaggle SciKit Learn data set. In my March 19 post I wrote, "The data set from Kaggle is well structured. There are 40 features and 999 training examples. The feature data is all continuous and there are no missing values."  Then I proceeded to run machine learning algorithms on the entire data set. 

This really isn't the best way to handle this type of problem, so I wanted to go back and start from the beginning. 

When you have a data set this big, it is very hard to get a feel for what is going on. Here are three things that make sense to do right away.

R has a very nice summary command that gives the mean, median and quartile statistics of every column in the data set. Here's the output from the first four variables:



Note that the data did not have any labels. When R read in the data to a data frame, it automatically assigned variable names. There isn't much interesting in these first four variables. There are no missing values, the ranges are fairly similar and the data all seems to be centered around zero. However, the summary statistics do show some variables that are very different from this. Here are two other variables from the set.

       V5                          
 Min.   :-16.4219  
 1st Qu.: -1.6760  
 Median :  0.8919 
 Mean   :  1.1374 
 3rd Qu.:  3.8832  
 Max.   : 17.5653 

      V13              
 Min.   :-14.679  
 1st Qu.: -5.047  
 Median : -2.120  
 Mean   : -1.988  
 3rd Qu.:  1.059  
 Max.   : 12.186    
Here we can see that the range of these variables is much larger that the first four. Additionally, these variables are not centered at zero.

Next, I'll look at the distribution of the variables. You could go cross eyed trying to check the distribution of all 40 features. But the summary data indicates that the data is well behaved. Here is the histogram of four different variables in the data set. I show Variable 1 since it is fairly representative of the majority of the variables in the set. I show Variables 5, 13 and 24 since they are the variables with the highest variability. The red line is the mean and the blue line is the median. Note that the variables are not on the same scale. I couldn't get R to put them all together otherwise. But there is no obvious skewness or outliers.

Finally, I look at the correlation matrix. The best way to look at this is with some type of color image that shows the correlation values between the variables. I made this plot with the lattice package.

Most of the linear correlations are positive. The scale in the positive direction only goes up to 0.6. There do not seem to be any obvious strong correlations in the data.

This is the information that convinced me to go straight to analysis of the data. Clearly, I did not take into account the "curse of dimensionality". My next step is to reduce the number of variables that I use in the analysis.