Saturday, January 4, 2014

Analyzing Kaggle sci kit learn data using the r caret package

In previous posts, I have struggled with improving my score on the Kaggle sci kit learn competition. Using the r caret package, I have managed to get an accuracy score of 0.92772. Not only does this beat my previous best of 0.90350, but it gets me above the SVM benchmark.

What is the r caret package? I'll use the author Max Kuhn's own words to explain it.

"The caret package, short for classi cation and regression training, contains numerous
tools for developing predictive models using the rich set of models available in R. The
package focuses on simplifying model training and tuning across a wide variety of modeling
techniques. It also includes methods for pre-processing training data, calculating variable
importance, and model visualizations. "[1]

He has some excellent documentation (better than the standard r documentation although I link to that here also) and some easy to understand examples.

Standard R documentation
A very nice easy to understand website
A paper on the package by Max Kuhn

Below is the webpage created by the R markdown file on this project. It shows the code as well as the results.

Analyzing Kaggle sci kit learn competition data set using caret package.

Analyzing Kaggle sci kit learn competition data set using caret package.

This file shows the steps and the code I used to analyze the data set. Score on this model is .92772 on the Kaggle leaderboard.
Summary I used the r package caret. I preprocessed the data, split it into training and test sets, did feature selection using random forests, then used the smaller data set in an svm model.
I downloaded the training set from the Kaggle website.
## Loading required package: cluster
## Loading required package: foreach
## Loading required package: lattice
## Loading required package: plyr
## Loading required package: reshape2
setwd("C:/Users/numbersmom/Dropbox/kaggle sci kit competition/data")
data <- read.csv("train.csv", header = FALSE)  #reads in all data, class label is last column
label <- data[, 41]  #creates a factor vector for the indexing
label <- as.factor(label)
features <- data[, 1:40]
set.seed(1) <- createDataPartition(label, p = 3/4, list = FALSE)  #creates an index of rows
# creates the train and test feature set.
train.f <- features[, ]
test.f <- features[, ]
# creates the train and test label
train.label <- label[]
test.label <- label[]
One of the really nice things about the caret package is that has a very easy way to check for and remove highly features. It turns out that there were not highly correlated variables in this data set.
# Remove correlations higher than .90
featureCorr <- cor(train.f)
highCorr <- findCorrelation(featureCorr, 0.9)
# no high correlations in this data set
This code centers and scales the training data, then uses this information to transform the test data. I will also use it to transform the test data set that I use to submit the solution to Kaggle.
# Center and scale data using CARET package preprocessing
xTrans <- preProcess(train.f, method = c("center", "scale"))
train.f <- predict(xTrans, train.f)
test.f <- predict(xTrans, test.f)
I used the random forest model to do feature selection.
## Type 'citation("pROC")' for a citation.
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##     cov, smooth, var
rfeFuncs <- rfFuncs
rfeFuncs$summary <- twoClassSummary
rfe.control <- rfeControl(functions = rfeFuncs, method = "repeatedcv", repeats = 4, 
    verbose = FALSE, returnResamp = "final")
rfeProfile <- rfe(train.f, train.label, sizes = 10:15, rfeControl = rfe.control)
## Warning: Metric 'Accuracy' is not created by the summary function; 'ROC' will be used instead
## Warning: executing %dopar% sequentially: no parallel backend registered
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
plot(rfeProfile, pch = 19, type = "o", col = "blue", main = "Feature selection")
plot of chunk unnamed-chunk-4 <- train.f[, predictors(rfeProfile)] <- test.f[, predictors(rfeProfile)]
The plot shows that the best ROC is achieved when using 14 variables. I created new training and test sets that use only the best 14 features.
Next, I'll train an SVM model on the new training set.
rbfSVM2 <- train(x =, y = train.label, method = "svmRadial", tuneLength = 8, 
    trControl = trainControl(method = "repeatedcv", repeats = 5), metric = "Kappa")
## Loading required package: class
print(rbfSVM2, printCall = FALSE)
## 751 samples
##  14 predictors
##   2 classes: '-1', '1' 
## No pre-processing
## Resampling: Cross-Validation (10 fold, repeated 5 times) 
## Summary of sample sizes: 676, 676, 675, 677, 676, 675, ... 
## Resampling results across tuning parameters:
##   C    Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.2  0.9       0.8    0.03         0.07    
##   0.5  0.9       0.8    0.03         0.06    
##   1    0.9       0.8    0.03         0.05    
##   2    0.9       0.9    0.02         0.05    
##   4    0.9       0.9    0.02         0.04    
##   8    0.9       0.9    0.03         0.05    
##   20   0.9       0.9    0.03         0.06    
##   30   0.9       0.8    0.03         0.05    
## Tuning parameter 'sigma' was held constant at a value of 0.07041
## Kappa was used to select the optimal model using  the largest value.
## The final values used for the model were C = 4 and sigma = 0.07.
The sigma value is calculated using the kernlab package and help constant. The only tuning parameter is C.
Here's a plot of the results:
plot(rbfSVM2, pch = 19, ylim = c(0.8, 1), main = "Best 14 features")
plot of chunk unnamed-chunk-6
I checked the accuracy of the model using the held out test data.
test.pred <- predict(rbfSVM2,
test.acc <- sum(test.pred == test.label)/length(test.label)
In this last set of code, I load the full test set and generate a prediction, then write it to a file. You must be careful to repeat all of the transformations on your data.
setwd("C:/Users/numbersmom/Dropbox/kaggle sci kit competition/data")
test.unlabel <- read.csv("test.csv", header = FALSE)
test.unlabel <- predict(xTrans, test.unlabel)  #center and scale
test.unlabel <- test.unlabel[, predictors(rfeProfile)]  #use 14 best features
pred.unlabel <- predict(rbfSVM2, test.unlabel)
Reference [1] Kuhn, Max, Building Predictive Models in R Using the caret Package, Journal of Statistical Software, Nov 2008, Vol 28, Issue 5


  1. Hi! Thank you for this very useful example using the caret package. It helps me a lot! However, wouldn't it be more "logical" to use RFE based on a SVM model instead of using one based on the random forest? Actually, I'm looking for a RFE algorithm based on SVM…

  2. Thanks for your comment. It is my very first one!

    This blog post was specifically about using the caret package. The caret package doesn't have an SVM based rfe.

    I haven't researched this in r. If you can't find anything, you can always use the scikit learn pipeline in Python. Here is the specific website: