Showing posts with label SVM. Show all posts

Saturday, January 4, 2014

Analyzing Kaggle sci kit learn data using the r caret package

In previous posts, I have struggled with improving my score on the Kaggle sci kit learn competition. Using the r caret package, I have managed to get an accuracy score of 0.92772. Not only does this beat my previous best of 0.90350, but it gets me above the SVM benchmark.

What is the r caret package? I'll use the author Max Kuhn's own words to explain it.

"The caret package, short for classication and regression training, contains numerous

tools for developing predictive models using the rich set of models available in R. The

package focuses on simplifying model training and tuning across a wide variety of modeling

techniques. It also includes methods for pre-processing training data, calculating variable
importance, and model visualizations. "[1]

He has some excellent documentation (better than the standard r documentation although I link to that here also) and some easy to understand examples.

Standard R documentation
A very nice easy to understand website
A paper on the package by Max Kuhn

Below is the webpage created by the R markdown file on this project. It shows the code as well as the results.

Analyzing Kaggle sci kit learn competition data set using caret package.

Analyzing Kaggle sci kit learn competition data set using caret package.

This file shows the steps and the code I used to analyze the data set. Score on this model is .92772 on the Kaggle leaderboard.
Summary I used the r package caret. I preprocessed the data, split it into training and test sets, did feature selection using random forests, then used the smaller data set in an svm model.
Details
I downloaded the training set from the Kaggle website.

library(caret)

## Loading required package: cluster
## Loading required package: foreach
## Loading required package: lattice
## Loading required package: plyr
## Loading required package: reshape2

setwd("C:/Users/numbersmom/Dropbox/kaggle sci kit competition/data")
data <- read.csv("train.csv", header = FALSE)  #reads in all data, class label is last column
label <- data[, 41]  #creates a factor vector for the indexing
label <- as.factor(label)
features <- data[, 1:40]
set.seed(1)
index.tr <- createDataPartition(label, p = 3/4, list = FALSE)  #creates an index of rows
# creates the train and test feature set.
train.f <- features[index.tr, ]
test.f <- features[-index.tr, ]
# creates the train and test label
train.label <- label[index.tr]
test.label <- label[-index.tr]

One of the really nice things about the caret package is that has a very easy way to check for and remove highly features. It turns out that there were not highly correlated variables in this data set.

# Remove correlations higher than .90
featureCorr <- cor(train.f)
highCorr <- findCorrelation(featureCorr, 0.9)
# no high correlations in this data set

This code centers and scales the training data, then uses this information to transform the test data. I will also use it to transform the test data set that I use to submit the solution to Kaggle.

# Center and scale data using CARET package preprocessing
xTrans <- preProcess(train.f, method = c("center", "scale"))
train.f <- predict(xTrans, train.f)
test.f <- predict(xTrans, test.f)

I used the random forest model to do feature selection.

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

rfeFuncs <- rfFuncs
rfeFuncs$summary <- twoClassSummary
rfe.control <- rfeControl(functions = rfeFuncs, method = "repeatedcv", repeats = 4, 
    verbose = FALSE, returnResamp = "final")
rfeProfile <- rfe(train.f, train.label, sizes = 10:15, rfeControl = rfe.control)

## Warning: Metric 'Accuracy' is not created by the summary function; 'ROC' will be used instead
## Warning: executing %dopar% sequentially: no parallel backend registered

## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.

plot(rfeProfile, pch = 19, type = "o", col = "blue", main = "Feature selection")

train.best.f <- train.f[, predictors(rfeProfile)]
test.best.f <- test.f[, predictors(rfeProfile)]

The plot shows that the best ROC is achieved when using 14 variables. I created new training and test sets that use only the best 14 features.
Next, I'll train an SVM model on the new training set.

library(kernlab)
set.seed(2)
rbfSVM2 <- train(x = train.best.f, y = train.label, method = "svmRadial", tuneLength = 8, 
    trControl = trainControl(method = "repeatedcv", repeats = 5), metric = "Kappa")

## Loading required package: class

print(rbfSVM2, printCall = FALSE)

## 751 samples
##  14 predictors
##   2 classes: '-1', '1' 
## 
## No pre-processing
## Resampling: Cross-Validation (10 fold, repeated 5 times) 
## 
## Summary of sample sizes: 676, 676, 675, 677, 676, 675, ... 
## 
## Resampling results across tuning parameters:
## 
##   C    Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.2  0.9       0.8    0.03         0.07    
##   0.5  0.9       0.8    0.03         0.06    
##   1    0.9       0.8    0.03         0.05    
##   2    0.9       0.9    0.02         0.05    
##   4    0.9       0.9    0.02         0.04    
##   8    0.9       0.9    0.03         0.05    
##   20   0.9       0.9    0.03         0.06    
##   30   0.9       0.8    0.03         0.05    
## 
## Tuning parameter 'sigma' was held constant at a value of 0.07041
## Kappa was used to select the optimal model using  the largest value.
## The final values used for the model were C = 4 and sigma = 0.07.

The sigma value is calculated using the kernlab package and help constant. The only tuning parameter is C.
Here's a plot of the results:

plot(rbfSVM2, pch = 19, ylim = c(0.8, 1), main = "Best 14 features")

I checked the accuracy of the model using the held out test data.

test.pred <- predict(rbfSVM2, test.best.f)
test.acc <- sum(test.pred == test.label)/length(test.label)

In this last set of code, I load the full test set and generate a prediction, then write it to a file. You must be careful to repeat all of the transformations on your data.

setwd("C:/Users/numbersmom/Dropbox/kaggle sci kit competition/data")
test.unlabel <- read.csv("test.csv", header = FALSE)
test.unlabel <- predict(xTrans, test.unlabel)  #center and scale
test.unlabel <- test.unlabel[, predictors(rfeProfile)]  #use 14 best features
pred.unlabel <- predict(rbfSVM2, test.unlabel)

Reference [1] Kuhn, Max, Building Predictive Models in R Using the caret Package, Journal of Statistical Software, Nov 2008, Vol 28, Issue 5

Friday, July 12, 2013

Using sample weights to fit the SVM model in Python

In order to write an adaboost code for a model, you need to be able to fit the model using sample weights and to generate the probability distribution of the outcomes. As far as I know, R doesn't have a SVM model that does this, but sci kit learn does.

Unfortunately, I was not able to make it work.

I used the sonar data from David Mease's class. First, I preprocessed the x training data so the mean and standard deviation are 0 and 1 respectively. Then I used the API to scale the test data. When I ran the SVM without sample weights, the training error was 0 and the test error was 10.3%. Then I constructed a weight vector so that the weights are all 1/N. N in this case is 130.

When I fitted the data with the sample weights, the training error and test error were both awful. The training error was 49.3% and the test error was 42.3%. The test error was less than the training error. Clearly something is not right here. Someone from the sci kit learn list serv suggests that I scale C by the length of the training data, but that didn't make any difference.

So this is not a useable option right now. If I find out what is going wrong, I'll update the blog.

The code for this post can be found on github at Link.

Sunday, June 23, 2013

R vs Python for machine learning

In my last post, I talked about tuning an svm for the Kaggle competition. I submitted my tuned svm. My score on the leaderboard is .90350. Not only am I 198 on the leaderboard and sinking fast, but I didn't even reach the SVM Benchmark score. Additionally, the top person is at a score of .99031.

I figured that only an ensemble method would get me to a higher score and I started to experiment with these methods. I never managed to come up with an ensemble that even matched my original submission. While I did this, I noticed some things about sci kit learn in Python that made me start to think about looking for other tools.

I decided to try R and Rapid Miner. Rapid Miner has not been a successful experience. I can't seem to get passed the set up repository/import data stage. I have had much more success with R. Most of this is due to a wonderful set of videos by David Mease. If you are interested in learning R for data mining and machine learning, his videos are pure gold. There are 13 videos on Youtube. Not only does he show you how to use R, but he has all the example data sets online so that you can play along. He also does a wonderful job of explaining what benchmarks to use.

David uses a subset of a well known sonar data set. He uses 130 observations in the training set and 78 observations in the test set. There are 60 features. He goes over several methods with the same data set. I still have one more video, but so far he has covered decision trees, svm and k nearest neighbors. He uses k nearest neighbors with n=1 as a benchmark. This is the default in R. For this data set, it gives a missclassification rate of 21%. This is better that the decision tree misclassification rate which is about 30%. But the svm should be able to beat the untuned k nearest neighbors.

I used this same sonar data set to compare results in R and Python.

k nearest neighbors
Missclassification rate for R: 21%
Missclassification rate for Python: could not get this. I set the n_neighbors=1, but I got this error:

C:\Python27\lib\site-packages\sklearn\neighbors\classification.py:131: NeighborsWarning: kneighbors: neighbor k+1 and neighbor k have the same distance: results will be dependent on data order.

neigh_dist, neigh_ind = self.kneighbors(X)

The default distance in k nearest neighbors is the Euclidian distance. The data should be scaled so that the variances of each variable are equal. R does this automatically. Python requires you to scale the data yourself.

Decision Tree

The following table shows the results I got:

Depth	R training accuracy	R test accuracy	Python training accuracy	Python test accuracy
1	.7769	.7179	.7769	.7179
2	.80	.7051	.8077	.7051
3	.8615	.6538	.8923	.6667
4	.8846	.6923	.9385	.7179
5	.8846	.6923	.9846	.7436
6	N/A		1.0	.7308

Note that the results are the same for a max depth of 1 and 2. As the max depth increases, it looks like sci kit learn gives the better results. However, the test accuracy stays fairly flat for both models while the Python model training accuracy increase to 1.0. It certainly looks like max depth 4 and 5 in Python have overfit the data. It would be nice to compare a picture of the two trees. The tree in R is quite easy to generate. Python requires some graphics modules that are fairly involved to use. At least, they were for me. I couldn't get either one to work. The R model won't fit max depth 6 because of overfitting issues.

Support Vector Machines

The first thing I did is run a default support vector machine in R and Python. Both programs use an rbf kernel as default.

R scales the data and uses cost=1 and gamma=1/number of features as default values. The untuned svm gives a missclassification error of 1.5% for the training data and about 13% for the test data.

Python doesn't scale the data and neither did I. (Maybe this is not a fair comparison but it is an extra step in sci kit learn that isn't required in R.) The Python default values are C=1 (cost=1) and gamma=0. This untuned svm gives a missclassification error of about 30% for the training data and about 36% for the test data.

I've already talked in a previous post about how the Python grid search crashes my computer. R has a procedure for tuning the svm, but it produces an error when I try to run it.

In addition to the questions I have about how sci kit learn models fit the data, there is the additional problem of categorical data. R usually recognizes categorical data. If it doesn't, you can set a variable to be categorical and R will know how to handle it. Python requires you to transform your own categorical data and it is a klugy process. There is a module called OneHotEncoder. But you can't run this module unless you transform all of your text data to numeric.

I still have a lot to learn about machine learning in R. But from I've seen so far, I think I'll stick to R when I want to run a machine learning algorithm.

Sunday, April 7, 2013

Tuning C and gamma in the SVM model

In the previous post, the best SVM model for the Kaggle data is an rbf kernel. I'd like to find the best parameters for C and gamma. I did do a grid search with dictionary values for the parameters. I would have liked to have scikit learn search for the best parameters for me and I tried this code:

C_range = 10.0 ** np.arange(-2, 9)

gamma_range = 10.0 ** np.arange(-5, 4)

param_grid = dict(gamma=gamma_range, C=C_range)

cv = StratifiedKFold(cl, n_folds=3)

grid = GridSearchCV(SVC(), param_grid=param_grid, cv=cv)

This code should search for the best values for C in the range .01 to 1,000,000,000 and gamma in the range .00001 to 1000. Unfortunately, this just makes my computer crash and I never got an answer.

But I found a reference here that says an exhaustive grid search is time consuming. It suggests to use a coarse search first and then a fine search when you are in the correct region.

So starting from C=10 and gamma = .01, I refined my search and here are the values that I got:

C gamma score

10 .01 .898

9 .0095 .898

8 .009 .901

7 .0085 .901

6 .0085 .901

5.5 .0085 .901

5.4 .0085 .901

5.3 .0086 .901

5.28 .0086 .901

You can see that started by using dictionary values for C with a step of 1 on each side of 10 and a step of .005 on each side of gamma. As the grid search stabilized, I narrowed the step on C to .1 and gamma to .0005. Gamma was very stable and I didn't change the step size. I arbitrarily stopped changing C when the step size reached 0.01.

When I ran these parameters using my 70/30 split on the data, I got a score of .9166. This is a 3.6% improvement on the previous score of .9133.

Reference: A Practical Guide to Support Vector Classification retrieved from http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

Wednesday, March 27, 2013

More on the Data Science competition in Kaggle

In my last post, I talked about the Data Science competition in Kaggle. In that post, I ran an optimized SVM model with a gaussian kernel. In this post, I'll go a little further into depth about the data and models.

I characterized the data as "well structured". I have already mentioned that the data is continuous with no missing values. I used a combination of numpy and pandas to look for missing values, check the mean and standard deviations of each feature, produce histograms to look for data skewing and outliers and a correlation matrix to see if there were any features that had strong linear correlations. These are not specific statistical tests. But this process gave me a good feel for the data and whether I needed any preprocessing.

Once I determined that I had a good data set, I proceeded to modeling. Since there are no categorical features, I decided not to run any kind of decision tree analysis. Since the response value is a classifier, I started with logistic regression and a linear SVM. Each of these gave a score of .797.

At this point, I decided to try a grid search. Here's the description from the user's guide: GridSearchCV implements a “fit” method and a “predict” method like any classifier except that the parameters of the classifier used to predict is optimized by cross-validation.

Here's the code:

param_grid={'C':[.01,.1,1.0,10.0,100.0],'gamma':[.1,.01,.001,.0001],'kernel':['linear','rbf']}

svr=svm.SVC()

grid=grid_search.GridSearchCV(svr,param_grid)

grid.fit(x_train,y_train)

print "The best classifier is:", grid.best_estimator_

print "The best score is ", grid.best_score_

print "The best parameters are ", grid.best_params_

And here's the results:

The best classifier is: SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.01, kernel=rbf, max_iter=-1, probability=False, shrinking=True,
tol=0.001, verbose=False)
The best score is 0.898426323319
The best parameters are {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.01}

Ironically, I had already come up with this optimized model just plugging in values. This is not a quick process. I can't give you the exact amount of time that this takes because I just go off and do something else while it is running. Note that the score is not quite as high as my model in the last post. I'm guessing this is because I split the data and used 70% for training and 30% for testing. I believe this model uses a cross validation which means it only used 70% of the data for cross validation.

I also ran a nearest neighbor model. Here's the code and the results:

from sklearn.neighbors import KNeighborsClassifier

neigh=KNeighborsClassifier()

neigh.fit(x_train,y_train)

y_pred3=neigh.predict(x_test)

neigh_score=neigh.score(x_test,y_test)

print "The score from K neighbors is", neigh_score

cm3=confusion_matrix(y_test,y_pred3)

print "This is the confusion matrix with for K neighbors",(cm3)

The score from K neighbors is 0.883333333333
This is the confusion matrix with for K neighbors [[133 22]
[ 13 132]]

The score for the K neighbors classifier is almost as high as the optimized SVM with the rbf kernel.

I'd be very interested to hear what others are finding as they analyze this set.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Math + Statistics + Python