Saturday, January 4, 2014

Analyzing Kaggle sci kit learn data using the r caret package

In previous posts, I have struggled with improving my score on the Kaggle sci kit learn competition. Using the r caret package, I have managed to get an accuracy score of 0.92772. Not only does this beat my previous best of 0.90350, but it gets me above the SVM benchmark.

What is the r caret package? I'll use the author Max Kuhn's own words to explain it.

"The caret package, short for classi cation and regression training, contains numerous
tools for developing predictive models using the rich set of models available in R. The
package focuses on simplifying model training and tuning across a wide variety of modeling
techniques. It also includes methods for pre-processing training data, calculating variable
importance, and model visualizations. "[1]

He has some excellent documentation (better than the standard r documentation although I link to that here also) and some easy to understand examples.

Standard R documentation
A very nice easy to understand website
A paper on the package by Max Kuhn

Below is the webpage created by the R markdown file on this project. It shows the code as well as the results.


Analyzing Kaggle sci kit learn competition data set using caret package.

Analyzing Kaggle sci kit learn competition data set using caret package.

This file shows the steps and the code I used to analyze the data set. Score on this model is .92772 on the Kaggle leaderboard.
Summary I used the r package caret. I preprocessed the data, split it into training and test sets, did feature selection using random forests, then used the smaller data set in an svm model.
Details
I downloaded the training set from the Kaggle website.
library(caret)
## Loading required package: cluster
## Loading required package: foreach
## Loading required package: lattice
## Loading required package: plyr
## Loading required package: reshape2
setwd("C:/Users/numbersmom/Dropbox/kaggle sci kit competition/data")
data <- read.csv("train.csv", header = FALSE)  #reads in all data, class label is last column
label <- data[, 41]  #creates a factor vector for the indexing
label <- as.factor(label)
features <- data[, 1:40]
set.seed(1)
index.tr <- createDataPartition(label, p = 3/4, list = FALSE)  #creates an index of rows
# creates the train and test feature set.
train.f <- features[index.tr, ]
test.f <- features[-index.tr, ]
# creates the train and test label
train.label <- label[index.tr]
test.label <- label[-index.tr]
One of the really nice things about the caret package is that has a very easy way to check for and remove highly features. It turns out that there were not highly correlated variables in this data set.
# Remove correlations higher than .90
featureCorr <- cor(train.f)
highCorr <- findCorrelation(featureCorr, 0.9)
# no high correlations in this data set
This code centers and scales the training data, then uses this information to transform the test data. I will also use it to transform the test data set that I use to submit the solution to Kaggle.
# Center and scale data using CARET package preprocessing
xTrans <- preProcess(train.f, method = c("center", "scale"))
train.f <- predict(xTrans, train.f)
test.f <- predict(xTrans, test.f)
I used the random forest model to do feature selection.
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
rfeFuncs <- rfFuncs
rfeFuncs$summary <- twoClassSummary
rfe.control <- rfeControl(functions = rfeFuncs, method = "repeatedcv", repeats = 4, 
    verbose = FALSE, returnResamp = "final")
rfeProfile <- rfe(train.f, train.label, sizes = 10:15, rfeControl = rfe.control)
## Warning: Metric 'Accuracy' is not created by the summary function; 'ROC' will be used instead
## Warning: executing %dopar% sequentially: no parallel backend registered
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
plot(rfeProfile, pch = 19, type = "o", col = "blue", main = "Feature selection")
plot of chunk unnamed-chunk-4
train.best.f <- train.f[, predictors(rfeProfile)]
test.best.f <- test.f[, predictors(rfeProfile)]
The plot shows that the best ROC is achieved when using 14 variables. I created new training and test sets that use only the best 14 features.
Next, I'll train an SVM model on the new training set.
library(kernlab)
set.seed(2)
rbfSVM2 <- train(x = train.best.f, y = train.label, method = "svmRadial", tuneLength = 8, 
    trControl = trainControl(method = "repeatedcv", repeats = 5), metric = "Kappa")
## Loading required package: class
print(rbfSVM2, printCall = FALSE)
## 751 samples
##  14 predictors
##   2 classes: '-1', '1' 
## 
## No pre-processing
## Resampling: Cross-Validation (10 fold, repeated 5 times) 
## 
## Summary of sample sizes: 676, 676, 675, 677, 676, 675, ... 
## 
## Resampling results across tuning parameters:
## 
##   C    Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.2  0.9       0.8    0.03         0.07    
##   0.5  0.9       0.8    0.03         0.06    
##   1    0.9       0.8    0.03         0.05    
##   2    0.9       0.9    0.02         0.05    
##   4    0.9       0.9    0.02         0.04    
##   8    0.9       0.9    0.03         0.05    
##   20   0.9       0.9    0.03         0.06    
##   30   0.9       0.8    0.03         0.05    
## 
## Tuning parameter 'sigma' was held constant at a value of 0.07041
## Kappa was used to select the optimal model using  the largest value.
## The final values used for the model were C = 4 and sigma = 0.07.
The sigma value is calculated using the kernlab package and help constant. The only tuning parameter is C.
Here's a plot of the results:
plot(rbfSVM2, pch = 19, ylim = c(0.8, 1), main = "Best 14 features")
plot of chunk unnamed-chunk-6
I checked the accuracy of the model using the held out test data.
test.pred <- predict(rbfSVM2, test.best.f)
test.acc <- sum(test.pred == test.label)/length(test.label)
In this last set of code, I load the full test set and generate a prediction, then write it to a file. You must be careful to repeat all of the transformations on your data.
setwd("C:/Users/numbersmom/Dropbox/kaggle sci kit competition/data")
test.unlabel <- read.csv("test.csv", header = FALSE)
test.unlabel <- predict(xTrans, test.unlabel)  #center and scale
test.unlabel <- test.unlabel[, predictors(rfeProfile)]  #use 14 best features
pred.unlabel <- predict(rbfSVM2, test.unlabel)
Reference [1] Kuhn, Max, Building Predictive Models in R Using the caret Package, Journal of Statistical Software, Nov 2008, Vol 28, Issue 5

Monday, October 21, 2013

More on the Kaggle SciKit Learn Competition

I have been off MOOCing. I have completed several Coursera MOOC classes and parts of several others. When there are so many good courses available, you have to be careful not to overextend yourself. I have just finished Dr. Peng's Computing for Data Analysis course. I'll be starting Dr. Leek's Data Analysis course next week. Both courses use R.

In preparation for Dr. Leek's class, I decided to take another look at the Kaggle SciKit Learn data set. In my March 19 post I wrote, "The data set from Kaggle is well structured. There are 40 features and 999 training examples. The feature data is all continuous and there are no missing values."  Then I proceeded to run machine learning algorithms on the entire data set. 

This really isn't the best way to handle this type of problem, so I wanted to go back and start from the beginning. 

When you have a data set this big, it is very hard to get a feel for what is going on. Here are three things that make sense to do right away.

R has a very nice summary command that gives the mean, median and quartile statistics of every column in the data set. Here's the output from the first four variables:



Note that the data did not have any labels. When R read in the data to a data frame, it automatically assigned variable names. There isn't much interesting in these first four variables. There are no missing values, the ranges are fairly similar and the data all seems to be centered around zero. However, the summary statistics do show some variables that are very different from this. Here are two other variables from the set.

       V5                          
 Min.   :-16.4219  
 1st Qu.: -1.6760  
 Median :  0.8919 
 Mean   :  1.1374 
 3rd Qu.:  3.8832  
 Max.   : 17.5653 

      V13              
 Min.   :-14.679  
 1st Qu.: -5.047  
 Median : -2.120  
 Mean   : -1.988  
 3rd Qu.:  1.059  
 Max.   : 12.186    
Here we can see that the range of these variables is much larger that the first four. Additionally, these variables are not centered at zero.

Next, I'll look at the distribution of the variables. You could go cross eyed trying to check the distribution of all 40 features. But the summary data indicates that the data is well behaved. Here is the histogram of four different variables in the data set. I show Variable 1 since it is fairly representative of the majority of the variables in the set. I show Variables 5, 13 and 24 since they are the variables with the highest variability. The red line is the mean and the blue line is the median. Note that the variables are not on the same scale. I couldn't get R to put them all together otherwise. But there is no obvious skewness or outliers.

Finally, I look at the correlation matrix. The best way to look at this is with some type of color image that shows the correlation values between the variables. I made this plot with the lattice package.

Most of the linear correlations are positive. The scale in the positive direction only goes up to 0.6. There do not seem to be any obvious strong correlations in the data.

This is the information that convinced me to go straight to analysis of the data. Clearly, I did not take into account the "curse of dimensionality". My next step is to reduce the number of variables that I use in the analysis.



Friday, July 12, 2013

Constucting a crude ensemble method in R and tuning a SVM

I'm still working to improve my Kaggle sci kit learn score. As I've said in a previous post, I think I need to use an ensemble method. The question is, what method?

I tried a decision tree adaboost, but it doesn't improve on the test error given by the SVM model.

I decided to try fitting many different models and combining them in some way. I used the sonar data for this since it is a much more manageable size than the kaggle competition data.

The first thing I decided to try was a straight linear combination. First, I ran all of the different models (svm, decision tree, knn, logitistic regression and naive bayes). I generated the predicted y values for each of these models and added them together. This is not an easy thing to do in R. Remember I said in a previous post that R recognizes categorical variables? Unfortunately, once R has decided that a variable is categorical, it refuses to do math on it. It doesn't "make sense". You have to change the factor back to a numeric value. This is not at all intuitive and can actually give you unexpected results. For example, when I tried to change one of the predicted vectors back to numeric, I got 0. Plain old vanilla constant 0.

Here is the code that gave me consistent results. Assume that your predicted response is in the vector y_pred. Then

numeric_y_pred <- as.numeric(as.character(y_pred))

gives you a vector that you can add.

But a linear combination of these models gave me a test error of 14.1%. This is better than most of the models, but not as good as the SVM.

So I decided to use the error to weight the vectors. I used the same formula as in the adaboost code:
 alpha <- .5*log((1-error)/error). I multiplied each predicted vector by it's respective alpha value. When I did this, I got a test error of 9%. Eureka, I thought.

However, this did not translate to the kaggle data. The tuned SVM in R had a test error of 13.3% but the ensemble had a test error of 14%.

Speaking of the tuned SVM model, I had a very difficult time with this. Every time I tried to run it, I got an error. I finally tweaked the code and got it to run. The code that I used is not the same as the code in the R help. I don't know if the package was updated and the help was not. Here is the code that worked for me:

fit7 <- tune.svm(x,y,gamma=10^(-3:-1),cost=10^(1:5))

I submitted the tuned SVM model from R to the kaggle competition and it did not perform as well as the tuned SVM from Python.

Back to the digital drawing board.

Using sample weights to fit the SVM model in Python


In order to write an adaboost code for a model, you need to be able to fit the model using sample weights and to generate the probability distribution of the outcomes. As far as I know, R doesn't have a SVM model that does this, but sci kit learn does.

Unfortunately, I was not able to make it work.

I used the sonar data from David Mease's class. First, I preprocessed the x training data so the mean and standard deviation are 0 and 1 respectively. Then I used the API to scale the test data. When I ran the SVM without sample weights, the training error was 0 and the test error was 10.3%. Then I constructed a weight vector so that the weights are all 1/N. N in this case is 130.

When I fitted the data with the sample weights, the training error and test error were both awful. The training error was 49.3% and the test error was 42.3%. The test error was less than the training error. Clearly something is not right here. Someone from the sci kit learn list serv suggests that I scale C by the length of the training data, but that didn't make any difference.

So this is not a useable option right now. If I find out what is going wrong, I'll update the blog.


The code for this post can be found on github at Link.