What is the r caret package? I'll use the author Max Kuhn's own words to explain it.
"The caret package, short for classication and regression training, contains numeroustools for developing predictive models using the rich set of models available in R. Thepackage focuses on simplifying model training and tuning across a wide variety of modelingtechniques. It also includes methods for pre-processing training data, calculating variableimportance, and model visualizations. "[1]
He has some excellent documentation (better than the standard r documentation although I link to that here also) and some easy to understand examples.
Standard R documentation
A very nice easy to understand website
A paper on the package by Max Kuhn
Below is the webpage created by the R markdown file on this project. It shows the code as well as the results.
Analyzing Kaggle sci kit learn competition data set using caret package.
This file shows the steps and the code I used to analyze the data set. Score on this model is .92772 on the Kaggle leaderboard.Summary I used the r package caret. I preprocessed the data, split it into training and test sets, did feature selection using random forests, then used the smaller data set in an svm model.
Details
I downloaded the training set from the Kaggle website.
library(caret)
## Loading required package: cluster
## Loading required package: foreach
## Loading required package: lattice
## Loading required package: plyr
## Loading required package: reshape2
setwd("C:/Users/numbersmom/Dropbox/kaggle sci kit competition/data")
data <- read.csv("train.csv", header = FALSE) #reads in all data, class label is last column
label <- data[, 41] #creates a factor vector for the indexing
label <- as.factor(label)
features <- data[, 1:40]
set.seed(1)
index.tr <- createDataPartition(label, p = 3/4, list = FALSE) #creates an index of rows
# creates the train and test feature set.
train.f <- features[index.tr, ]
test.f <- features[-index.tr, ]
# creates the train and test label
train.label <- label[index.tr]
test.label <- label[-index.tr]
One of the really nice things about the caret package is that has a very easy way to check for and remove highly features. It turns out that there were not highly correlated variables in this data set. # Remove correlations higher than .90
featureCorr <- cor(train.f)
highCorr <- findCorrelation(featureCorr, 0.9)
# no high correlations in this data set
This code centers and scales the training data, then uses this information to transform the test data. I will also use it to transform the test data set that I use to submit the solution to Kaggle.# Center and scale data using CARET package preprocessing
xTrans <- preProcess(train.f, method = c("center", "scale"))
train.f <- predict(xTrans, train.f)
test.f <- predict(xTrans, test.f)
I used the random forest model to do feature selection.library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
rfeFuncs <- rfFuncs
rfeFuncs$summary <- twoClassSummary
rfe.control <- rfeControl(functions = rfeFuncs, method = "repeatedcv", repeats = 4,
verbose = FALSE, returnResamp = "final")
rfeProfile <- rfe(train.f, train.label, sizes = 10:15, rfeControl = rfe.control)
## Warning: Metric 'Accuracy' is not created by the summary function; 'ROC' will be used instead
## Warning: executing %dopar% sequentially: no parallel backend registered
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
plot(rfeProfile, pch = 19, type = "o", col = "blue", main = "Feature selection")
train.best.f <- train.f[, predictors(rfeProfile)]
test.best.f <- test.f[, predictors(rfeProfile)]
The plot shows that the best ROC is achieved when using 14 variables. I created new training and test sets that use only the best 14 features.Next, I'll train an SVM model on the new training set.
library(kernlab)
set.seed(2)
rbfSVM2 <- train(x = train.best.f, y = train.label, method = "svmRadial", tuneLength = 8,
trControl = trainControl(method = "repeatedcv", repeats = 5), metric = "Kappa")
## Loading required package: class
print(rbfSVM2, printCall = FALSE)
## 751 samples
## 14 predictors
## 2 classes: '-1', '1'
##
## No pre-processing
## Resampling: Cross-Validation (10 fold, repeated 5 times)
##
## Summary of sample sizes: 676, 676, 675, 677, 676, 675, ...
##
## Resampling results across tuning parameters:
##
## C Accuracy Kappa Accuracy SD Kappa SD
## 0.2 0.9 0.8 0.03 0.07
## 0.5 0.9 0.8 0.03 0.06
## 1 0.9 0.8 0.03 0.05
## 2 0.9 0.9 0.02 0.05
## 4 0.9 0.9 0.02 0.04
## 8 0.9 0.9 0.03 0.05
## 20 0.9 0.9 0.03 0.06
## 30 0.9 0.8 0.03 0.05
##
## Tuning parameter 'sigma' was held constant at a value of 0.07041
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were C = 4 and sigma = 0.07.
The sigma value is calculated using the kernlab package and help constant. The only tuning parameter is C.Here's a plot of the results:
plot(rbfSVM2, pch = 19, ylim = c(0.8, 1), main = "Best 14 features")
I checked the accuracy of the model using the held out test data.
test.pred <- predict(rbfSVM2, test.best.f)
test.acc <- sum(test.pred == test.label)/length(test.label)
In this last set of code, I load the full test set and generate a prediction, then write it to a file. You must be careful to repeat all of the transformations on your data.setwd("C:/Users/numbersmom/Dropbox/kaggle sci kit competition/data")
test.unlabel <- read.csv("test.csv", header = FALSE)
test.unlabel <- predict(xTrans, test.unlabel) #center and scale
test.unlabel <- test.unlabel[, predictors(rfeProfile)] #use 14 best features
pred.unlabel <- predict(rbfSVM2, test.unlabel)
Reference
[1] Kuhn, Max, Building Predictive Models in R Using the caret Package, Journal of Statistical Software, Nov 2008, Vol 28, Issue 5