Math + Statistics + Python

Monday, October 21, 2013

More on the Kaggle SciKit Learn Competition

I have been off MOOCing. I have completed several Coursera MOOC classes and parts of several others. When there are so many good courses available, you have to be careful not to overextend yourself. I have just finished Dr. Peng's Computing for Data Analysis course. I'll be starting Dr. Leek's Data Analysis course next week. Both courses use R.

In preparation for Dr. Leek's class, I decided to take another look at the Kaggle SciKit Learn data set. In my March 19 post I wrote, "The data set from Kaggle is well structured. There are 40 features and 999 training examples. The feature data is all continuous and there are no missing values." Then I proceeded to run machine learning algorithms on the entire data set.

This really isn't the best way to handle this type of problem, so I wanted to go back and start from the beginning.

When you have a data set this big, it is very hard to get a feel for what is going on. Here are three things that make sense to do right away.

R has a very nice summary command that gives the mean, median and quartile statistics of every column in the data set. Here's the output from the first four variables:

Note that the data did not have any labels. When R read in the data to a data frame, it automatically assigned variable names. There isn't much interesting in these first four variables. There are no missing values, the ranges are fairly similar and the data all seems to be centered around zero. However, the summary statistics do show some variables that are very different from this. Here are two other variables from the set.

Min. :-16.4219

1st Qu.: -1.6760

Median : 0.8919

Mean : 1.1374

3rd Qu.: 3.8832

Max. : 17.5653

V13

Min. :-14.679

1st Qu.: -5.047

Median : -2.120

Mean : -1.988

3rd Qu.: 1.059

Max. : 12.186

Here we can see that the range of these variables is much larger that the first four. Additionally, these variables are not centered at zero.

Next, I'll look at the distribution of the variables. You could go cross eyed trying to check the distribution of all 40 features. But the summary data indicates that the data is well behaved. Here is the histogram of four different variables in the data set. I show Variable 1 since it is fairly representative of the majority of the variables in the set. I show Variables 5, 13 and 24 since they are the variables with the highest variability. The red line is the mean and the blue line is the median. Note that the variables are not on the same scale. I couldn't get R to put them all together otherwise. But there is no obvious skewness or outliers.

Finally, I look at the correlation matrix. The best way to look at this is with some type of color image that shows the correlation values between the variables. I made this plot with the lattice package.

Most of the linear correlations are positive. The scale in the positive direction only goes up to 0.6. There do not seem to be any obvious strong correlations in the data.

This is the information that convinced me to go straight to analysis of the data. Clearly, I did not take into account the "curse of dimensionality". My next step is to reduce the number of variables that I use in the analysis.

Friday, July 12, 2013

Constucting a crude ensemble method in R and tuning a SVM

I'm still working to improve my Kaggle sci kit learn score. As I've said in a previous post, I think I need to use an ensemble method. The question is, what method?

I tried a decision tree adaboost, but it doesn't improve on the test error given by the SVM model.

I decided to try fitting many different models and combining them in some way. I used the sonar data for this since it is a much more manageable size than the kaggle competition data.

The first thing I decided to try was a straight linear combination. First, I ran all of the different models (svm, decision tree, knn, logitistic regression and naive bayes). I generated the predicted y values for each of these models and added them together. This is not an easy thing to do in R. Remember I said in a previous post that R recognizes categorical variables? Unfortunately, once R has decided that a variable is categorical, it refuses to do math on it. It doesn't "make sense". You have to change the factor back to a numeric value. This is not at all intuitive and can actually give you unexpected results. For example, when I tried to change one of the predicted vectors back to numeric, I got 0. Plain old vanilla constant 0.

Here is the code that gave me consistent results. Assume that your predicted response is in the vector y_pred. Then

numeric_y_pred <- as.numeric(as.character(y_pred))

gives you a vector that you can add.

But a linear combination of these models gave me a test error of 14.1%. This is better than most of the models, but not as good as the SVM.

So I decided to use the error to weight the vectors. I used the same formula as in the adaboost code:
alpha <- .5*log((1-error)/error). I multiplied each predicted vector by it's respective alpha value. When I did this, I got a test error of 9%. Eureka, I thought.

However, this did not translate to the kaggle data. The tuned SVM in R had a test error of 13.3% but the ensemble had a test error of 14%.

Speaking of the tuned SVM model, I had a very difficult time with this. Every time I tried to run it, I got an error. I finally tweaked the code and got it to run. The code that I used is not the same as the code in the R help. I don't know if the package was updated and the help was not. Here is the code that worked for me:

fit7 <- tune.svm(x,y,gamma=10^(-3:-1),cost=10^(1:5))

I submitted the tuned SVM model from R to the kaggle competition and it did not perform as well as the tuned SVM from Python.

Back to the digital drawing board.

Using sample weights to fit the SVM model in Python

In order to write an adaboost code for a model, you need to be able to fit the model using sample weights and to generate the probability distribution of the outcomes. As far as I know, R doesn't have a SVM model that does this, but sci kit learn does.

Unfortunately, I was not able to make it work.

I used the sonar data from David Mease's class. First, I preprocessed the x training data so the mean and standard deviation are 0 and 1 respectively. Then I used the API to scale the test data. When I ran the SVM without sample weights, the training error was 0 and the test error was 10.3%. Then I constructed a weight vector so that the weights are all 1/N. N in this case is 130.

When I fitted the data with the sample weights, the training error and test error were both awful. The training error was 49.3% and the test error was 42.3%. The test error was less than the training error. Clearly something is not right here. Someone from the sci kit learn list serv suggests that I scale C by the length of the training data, but that didn't make any difference.

So this is not a useable option right now. If I find out what is going wrong, I'll update the blog.

The code for this post can be found on github at Link.

Sunday, June 23, 2013

R vs Python for machine learning

In my last post, I talked about tuning an svm for the Kaggle competition. I submitted my tuned svm. My score on the leaderboard is .90350. Not only am I 198 on the leaderboard and sinking fast, but I didn't even reach the SVM Benchmark score. Additionally, the top person is at a score of .99031.

I figured that only an ensemble method would get me to a higher score and I started to experiment with these methods. I never managed to come up with an ensemble that even matched my original submission. While I did this, I noticed some things about sci kit learn in Python that made me start to think about looking for other tools.

I decided to try R and Rapid Miner. Rapid Miner has not been a successful experience. I can't seem to get passed the set up repository/import data stage. I have had much more success with R. Most of this is due to a wonderful set of videos by David Mease. If you are interested in learning R for data mining and machine learning, his videos are pure gold. There are 13 videos on Youtube. Not only does he show you how to use R, but he has all the example data sets online so that you can play along. He also does a wonderful job of explaining what benchmarks to use.

David uses a subset of a well known sonar data set. He uses 130 observations in the training set and 78 observations in the test set. There are 60 features. He goes over several methods with the same data set. I still have one more video, but so far he has covered decision trees, svm and k nearest neighbors. He uses k nearest neighbors with n=1 as a benchmark. This is the default in R. For this data set, it gives a missclassification rate of 21%. This is better that the decision tree misclassification rate which is about 30%. But the svm should be able to beat the untuned k nearest neighbors.

I used this same sonar data set to compare results in R and Python.

k nearest neighbors
Missclassification rate for R: 21%
Missclassification rate for Python: could not get this. I set the n_neighbors=1, but I got this error:

C:\Python27\lib\site-packages\sklearn\neighbors\classification.py:131: NeighborsWarning: kneighbors: neighbor k+1 and neighbor k have the same distance: results will be dependent on data order.

neigh_dist, neigh_ind = self.kneighbors(X)

The default distance in k nearest neighbors is the Euclidian distance. The data should be scaled so that the variances of each variable are equal. R does this automatically. Python requires you to scale the data yourself.

Decision Tree

The following table shows the results I got:

Depth	R training accuracy	R test accuracy	Python training accuracy	Python test accuracy
1	.7769	.7179	.7769	.7179
2	.80	.7051	.8077	.7051
3	.8615	.6538	.8923	.6667
4	.8846	.6923	.9385	.7179
5	.8846	.6923	.9846	.7436
6	N/A		1.0	.7308

Note that the results are the same for a max depth of 1 and 2. As the max depth increases, it looks like sci kit learn gives the better results. However, the test accuracy stays fairly flat for both models while the Python model training accuracy increase to 1.0. It certainly looks like max depth 4 and 5 in Python have overfit the data. It would be nice to compare a picture of the two trees. The tree in R is quite easy to generate. Python requires some graphics modules that are fairly involved to use. At least, they were for me. I couldn't get either one to work. The R model won't fit max depth 6 because of overfitting issues.

Support Vector Machines

The first thing I did is run a default support vector machine in R and Python. Both programs use an rbf kernel as default.

R scales the data and uses cost=1 and gamma=1/number of features as default values. The untuned svm gives a missclassification error of 1.5% for the training data and about 13% for the test data.

Python doesn't scale the data and neither did I. (Maybe this is not a fair comparison but it is an extra step in sci kit learn that isn't required in R.) The Python default values are C=1 (cost=1) and gamma=0. This untuned svm gives a missclassification error of about 30% for the training data and about 36% for the test data.

I've already talked in a previous post about how the Python grid search crashes my computer. R has a procedure for tuning the svm, but it produces an error when I try to run it.

In addition to the questions I have about how sci kit learn models fit the data, there is the additional problem of categorical data. R usually recognizes categorical data. If it doesn't, you can set a variable to be categorical and R will know how to handle it. Python requires you to transform your own categorical data and it is a klugy process. There is a module called OneHotEncoder. But you can't run this module unless you transform all of your text data to numeric.

I still have a lot to learn about machine learning in R. But from I've seen so far, I think I'll stick to R when I want to run a machine learning algorithm.

Sunday, April 7, 2013

Tuning C and gamma in the SVM model

In the previous post, the best SVM model for the Kaggle data is an rbf kernel. I'd like to find the best parameters for C and gamma. I did do a grid search with dictionary values for the parameters. I would have liked to have scikit learn search for the best parameters for me and I tried this code:

C_range = 10.0 ** np.arange(-2, 9)

gamma_range = 10.0 ** np.arange(-5, 4)

param_grid = dict(gamma=gamma_range, C=C_range)

cv = StratifiedKFold(cl, n_folds=3)

grid = GridSearchCV(SVC(), param_grid=param_grid, cv=cv)

This code should search for the best values for C in the range .01 to 1,000,000,000 and gamma in the range .00001 to 1000. Unfortunately, this just makes my computer crash and I never got an answer.

But I found a reference here that says an exhaustive grid search is time consuming. It suggests to use a coarse search first and then a fine search when you are in the correct region.

So starting from C=10 and gamma = .01, I refined my search and here are the values that I got:

C gamma score

10 .01 .898

9 .0095 .898

8 .009 .901

7 .0085 .901

6 .0085 .901

5.5 .0085 .901

5.4 .0085 .901

5.3 .0086 .901

5.28 .0086 .901

You can see that started by using dictionary values for C with a step of 1 on each side of 10 and a step of .005 on each side of gamma. As the grid search stabilized, I narrowed the step on C to .1 and gamma to .0005. Gamma was very stable and I didn't change the step size. I arbitrarily stopped changing C when the step size reached 0.01.

When I ran these parameters using my 70/30 split on the data, I got a score of .9166. This is a 3.6% improvement on the previous score of .9133.

Reference: A Practical Guide to Support Vector Classification retrieved from http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

Wednesday, March 27, 2013

More on the Data Science competition in Kaggle

In my last post, I talked about the Data Science competition in Kaggle. In that post, I ran an optimized SVM model with a gaussian kernel. In this post, I'll go a little further into depth about the data and models.

I characterized the data as "well structured". I have already mentioned that the data is continuous with no missing values. I used a combination of numpy and pandas to look for missing values, check the mean and standard deviations of each feature, produce histograms to look for data skewing and outliers and a correlation matrix to see if there were any features that had strong linear correlations. These are not specific statistical tests. But this process gave me a good feel for the data and whether I needed any preprocessing.

Once I determined that I had a good data set, I proceeded to modeling. Since there are no categorical features, I decided not to run any kind of decision tree analysis. Since the response value is a classifier, I started with logistic regression and a linear SVM. Each of these gave a score of .797.

At this point, I decided to try a grid search. Here's the description from the user's guide: GridSearchCV implements a “fit” method and a “predict” method like any classifier except that the parameters of the classifier used to predict is optimized by cross-validation.

Here's the code:

param_grid={'C':[.01,.1,1.0,10.0,100.0],'gamma':[.1,.01,.001,.0001],'kernel':['linear','rbf']}

svr=svm.SVC()

grid=grid_search.GridSearchCV(svr,param_grid)

grid.fit(x_train,y_train)

print "The best classifier is:", grid.best_estimator_

print "The best score is ", grid.best_score_

print "The best parameters are ", grid.best_params_

And here's the results:

The best classifier is: SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.01, kernel=rbf, max_iter=-1, probability=False, shrinking=True,
tol=0.001, verbose=False)
The best score is 0.898426323319
The best parameters are {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.01}

Ironically, I had already come up with this optimized model just plugging in values. This is not a quick process. I can't give you the exact amount of time that this takes because I just go off and do something else while it is running. Note that the score is not quite as high as my model in the last post. I'm guessing this is because I split the data and used 70% for training and 30% for testing. I believe this model uses a cross validation which means it only used 70% of the data for cross validation.

I also ran a nearest neighbor model. Here's the code and the results:

from sklearn.neighbors import KNeighborsClassifier

neigh=KNeighborsClassifier()

neigh.fit(x_train,y_train)

y_pred3=neigh.predict(x_test)

neigh_score=neigh.score(x_test,y_test)

print "The score from K neighbors is", neigh_score

cm3=confusion_matrix(y_test,y_pred3)

print "This is the confusion matrix with for K neighbors",(cm3)

The score from K neighbors is 0.883333333333
This is the confusion matrix with for K neighbors [[133 22]
[ 13 132]]

The score for the K neighbors classifier is almost as high as the optimized SVM with the rbf kernel.

I'd be very interested to hear what others are finding as they analyze this set.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Tuesday, March 19, 2013

Kaggle Data Science competition

Kaggle.com is sponsoring another learning competition for machine learning. This one specifically mentions using scikit-sklearn in Python. See the competition details here.

It is amazing how much more is available in scikits just since I have been writing this blog. Recently, I have switched to using Python(x,y) which is a distribution which includes everything you need for machine learning. And it's specifically for Windows!! See the information on this distribution here. You do have to be careful about the plug in though. Specifically, the latest version of scikit-sklearn is .13.1. The version that downloads with Python(x,y) is .12. You'll have to update it. Don't ask me how. I took lots of wrong turns, finally figured it out but probably can't reproduce it.

The data set from Kaggle is well structured. There are 40 features and 999 training examples. The feature data is all continuous and there are no missing values. I was able to write code that gives me the SVM standard score on the leaderboard: .913.

Someday I'll have time to figure out how to use github and I'll post my code there. For now, here's what I have:

import csv as csv

import numpy as np

import pandas as pd

import scipy as sp

import matplotlib.pyplot as plt

# Reading in training data for Kaggle sci kit competition

csv_file_object=csv.reader(open('C:/Users/numbersmom/Dropbox/kaggle sci kit competition/train.csv'))

header=csv_file_object.next()

records=[]

for row in csv_file_object:records.append(row)

records=np.array(records)

records=records.astype(np.float)

csv_file_object=csv.reader(open('C:/Users/numbersmom/Dropbox/kaggle sci kit competition/train_label.csv'))

header=csv_file_object.next()

cl=[]

for row in csv_file_object:cl.append(row)

cl=np.array(cl)

cl=cl.astype(np.int8)

cl=cl.reshape(999,)

tr_ex=np.size(cl)

#Need to use 70% of the data for training and 30% for testing

n_train=int(.7*tr_ex)

x_train,x_test=records[:n_train,:],records[n_train:,:]

y_train,y_test=cl[:n_train],cl[n_train:]

#SVM code

from sklearn import svm

from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix

# I tried different models, but this one with c=10 and gamma=.01 gives

# gives the SVM benchmark score.

clf=svm.SVC(C=10.0,gamma=.01,kernel='rbf',probability=True)

clf.fit(x_train,y_train)

print clf.n_support_

y_pred1=clf.predict(x_test)

gau_score=clf.score(x_test,y_test)

print"This is the score for rbf model",gau_score

cm1=confusion_matrix(y_test,y_pred1)

print "This is the confusion matrix for rbf model",(cm1)

print "finished"

The confusion matrix looks like this:

pred 0 pred 1

act0 141 14

act1 12 133

There's lots of other stuff I can try to get that number higher. You can check out the helpful users guide to get more information.