Monday, October 21, 2013

More on the Kaggle SciKit Learn Competition

I have been off MOOCing. I have completed several Coursera MOOC classes and parts of several others. When there are so many good courses available, you have to be careful not to overextend yourself. I have just finished Dr. Peng's Computing for Data Analysis course. I'll be starting Dr. Leek's Data Analysis course next week. Both courses use R.

In preparation for Dr. Leek's class, I decided to take another look at the Kaggle SciKit Learn data set. In my March 19 post I wrote, "The data set from Kaggle is well structured. There are 40 features and 999 training examples. The feature data is all continuous and there are no missing values."  Then I proceeded to run machine learning algorithms on the entire data set. 

This really isn't the best way to handle this type of problem, so I wanted to go back and start from the beginning. 

When you have a data set this big, it is very hard to get a feel for what is going on. Here are three things that make sense to do right away.

R has a very nice summary command that gives the mean, median and quartile statistics of every column in the data set. Here's the output from the first four variables:



Note that the data did not have any labels. When R read in the data to a data frame, it automatically assigned variable names. There isn't much interesting in these first four variables. There are no missing values, the ranges are fairly similar and the data all seems to be centered around zero. However, the summary statistics do show some variables that are very different from this. Here are two other variables from the set.

       V5                          
 Min.   :-16.4219  
 1st Qu.: -1.6760  
 Median :  0.8919 
 Mean   :  1.1374 
 3rd Qu.:  3.8832  
 Max.   : 17.5653 

      V13              
 Min.   :-14.679  
 1st Qu.: -5.047  
 Median : -2.120  
 Mean   : -1.988  
 3rd Qu.:  1.059  
 Max.   : 12.186    
Here we can see that the range of these variables is much larger that the first four. Additionally, these variables are not centered at zero.

Next, I'll look at the distribution of the variables. You could go cross eyed trying to check the distribution of all 40 features. But the summary data indicates that the data is well behaved. Here is the histogram of four different variables in the data set. I show Variable 1 since it is fairly representative of the majority of the variables in the set. I show Variables 5, 13 and 24 since they are the variables with the highest variability. The red line is the mean and the blue line is the median. Note that the variables are not on the same scale. I couldn't get R to put them all together otherwise. But there is no obvious skewness or outliers.

Finally, I look at the correlation matrix. The best way to look at this is with some type of color image that shows the correlation values between the variables. I made this plot with the lattice package.

Most of the linear correlations are positive. The scale in the positive direction only goes up to 0.6. There do not seem to be any obvious strong correlations in the data.

This is the information that convinced me to go straight to analysis of the data. Clearly, I did not take into account the "curse of dimensionality". My next step is to reduce the number of variables that I use in the analysis.



Friday, July 12, 2013

Constucting a crude ensemble method in R and tuning a SVM

I'm still working to improve my Kaggle sci kit learn score. As I've said in a previous post, I think I need to use an ensemble method. The question is, what method?

I tried a decision tree adaboost, but it doesn't improve on the test error given by the SVM model.

I decided to try fitting many different models and combining them in some way. I used the sonar data for this since it is a much more manageable size than the kaggle competition data.

The first thing I decided to try was a straight linear combination. First, I ran all of the different models (svm, decision tree, knn, logitistic regression and naive bayes). I generated the predicted y values for each of these models and added them together. This is not an easy thing to do in R. Remember I said in a previous post that R recognizes categorical variables? Unfortunately, once R has decided that a variable is categorical, it refuses to do math on it. It doesn't "make sense". You have to change the factor back to a numeric value. This is not at all intuitive and can actually give you unexpected results. For example, when I tried to change one of the predicted vectors back to numeric, I got 0. Plain old vanilla constant 0.

Here is the code that gave me consistent results. Assume that your predicted response is in the vector y_pred. Then

numeric_y_pred <- as.numeric(as.character(y_pred))

gives you a vector that you can add.

But a linear combination of these models gave me a test error of 14.1%. This is better than most of the models, but not as good as the SVM.

So I decided to use the error to weight the vectors. I used the same formula as in the adaboost code:
 alpha <- .5*log((1-error)/error). I multiplied each predicted vector by it's respective alpha value. When I did this, I got a test error of 9%. Eureka, I thought.

However, this did not translate to the kaggle data. The tuned SVM in R had a test error of 13.3% but the ensemble had a test error of 14%.

Speaking of the tuned SVM model, I had a very difficult time with this. Every time I tried to run it, I got an error. I finally tweaked the code and got it to run. The code that I used is not the same as the code in the R help. I don't know if the package was updated and the help was not. Here is the code that worked for me:

fit7 <- tune.svm(x,y,gamma=10^(-3:-1),cost=10^(1:5))

I submitted the tuned SVM model from R to the kaggle competition and it did not perform as well as the tuned SVM from Python.

Back to the digital drawing board.

Using sample weights to fit the SVM model in Python


In order to write an adaboost code for a model, you need to be able to fit the model using sample weights and to generate the probability distribution of the outcomes. As far as I know, R doesn't have a SVM model that does this, but sci kit learn does.

Unfortunately, I was not able to make it work.

I used the sonar data from David Mease's class. First, I preprocessed the x training data so the mean and standard deviation are 0 and 1 respectively. Then I used the API to scale the test data. When I ran the SVM without sample weights, the training error was 0 and the test error was 10.3%. Then I constructed a weight vector so that the weights are all 1/N. N in this case is 130.

When I fitted the data with the sample weights, the training error and test error were both awful. The training error was 49.3% and the test error was 42.3%. The test error was less than the training error. Clearly something is not right here. Someone from the sci kit learn list serv suggests that I scale C by the length of the training data, but that didn't make any difference.

So this is not a useable option right now. If I find out what is going wrong, I'll update the blog.


The code for this post can be found on github at Link.

Sunday, June 23, 2013

R vs Python for machine learning

In my last post, I talked about tuning an svm for the Kaggle competition. I submitted my tuned svm. My score on the leaderboard is .90350. Not only am I 198 on the leaderboard and sinking fast, but I didn't even reach the SVM Benchmark score. Additionally, the top person is at a score of .99031.

I figured that only an ensemble method would get me to a higher score and I started to experiment with these methods. I never managed to come up with an ensemble that even matched my original submission.  While I did this, I noticed some things about sci kit learn in Python that made me start to think about looking for other tools.

I decided to try R and Rapid Miner.  Rapid Miner has not been a successful experience. I can't seem to get passed the set up repository/import data stage. I have had much more success with R. Most of this is due to a wonderful set of videos by David Mease. If you are interested in learning R for data mining and machine learning, his videos are pure gold. There are 13 videos on Youtube. Not only does he show you how to use R, but he has all the example data sets online so that you can play along. He also does a wonderful job of explaining what benchmarks to use.

David uses a subset of a well known sonar data set. He uses 130 observations in the training set and 78 observations in the test set. There are 60 features. He goes over several methods with the same data set. I still have one more video, but so far he has covered decision trees, svm and k nearest neighbors. He uses k nearest neighbors with n=1 as a benchmark. This is the default in R. For this data set, it gives a missclassification rate of 21%. This is better that the decision tree misclassification rate which is about 30%. But the svm should be able to beat the untuned k nearest neighbors.

I used this same sonar data set to compare results in R and Python.

k nearest neighbors
Missclassification rate for R: 21%
Missclassification rate for Python: could not get this. I set the n_neighbors=1, but I got this error:


C:\Python27\lib\site-packages\sklearn\neighbors\classification.py:131: NeighborsWarning: kneighbors: neighbor k+1 and neighbor k have the same distance: results will be dependent on data order.
  neigh_dist, neigh_ind = self.kneighbors(X)

The default distance in k nearest neighbors is the Euclidian distance. The data should be scaled so that the variances of each variable are equal. R does this automatically. Python requires you to scale the data yourself.

Decision Tree

The following table shows the results I got:


Depth
R training accuracy
R test accuracy
Python training accuracy
Python test accuracy
1
.7769
.7179
.7769
.7179
2
.80
.7051
.8077
.7051
3
.8615
.6538
.8923
.6667
4
.8846
.6923
.9385
.7179
5
.8846
.6923
.9846
.7436
6
N/A

1.0
.7308


Note that the results are the same for a max depth of 1 and 2. As the max depth increases, it looks like sci kit learn gives the better results. However, the test accuracy stays fairly flat for both models while the Python model training accuracy increase to 1.0. It certainly looks like max depth 4 and 5 in Python have overfit the data. It would be nice to compare a picture of the two trees. The tree in R is quite easy to generate. Python requires some graphics modules that are fairly involved to use. At least, they were for me. I couldn't get either one to work. The R model won't fit max depth 6 because of overfitting issues.

Support Vector Machines

The first thing I did is run a default support vector machine in R and Python. Both programs use an rbf kernel as default.

R scales the data and uses cost=1 and gamma=1/number of features as default values. The untuned svm gives a missclassification error of 1.5% for the training data and  about 13% for the test data.

Python doesn't scale the data and neither did I. (Maybe this is not a fair comparison but it is an extra step in sci kit learn that isn't required in R.) The Python default values are C=1 (cost=1) and gamma=0. This untuned svm gives a missclassification error of about 30% for the training data and about 36% for the test data. 

I've already talked in a previous post about how the Python grid search crashes my computer. R has a procedure for tuning the svm, but it produces an error when I try to run it.

In addition to the questions I have about how sci kit learn models fit the data, there is the additional problem of categorical data. R usually recognizes categorical data. If it doesn't, you can set a variable to be categorical and R will know how to handle it. Python requires you to transform your own categorical data and it is a klugy process. There is a module called OneHotEncoder. But you can't run this module unless you transform all of your text data to numeric.

I still have a lot to learn about machine learning in R. But from I've seen so far, I think I'll stick to R when I want to run a machine learning algorithm.