Math + Statistics + Python: July 2013

I'm still working to improve my Kaggle sci kit learn score. As I've said in a previous post, I think I need to use an ensemble method. The question is, what method?

I tried a decision tree adaboost, but it doesn't improve on the test error given by the SVM model.

I decided to try fitting many different models and combining them in some way. I used the sonar data for this since it is a much more manageable size than the kaggle competition data.

The first thing I decided to try was a straight linear combination. First, I ran all of the different models (svm, decision tree, knn, logitistic regression and naive bayes). I generated the predicted y values for each of these models and added them together. This is not an easy thing to do in R. Remember I said in a previous post that R recognizes categorical variables? Unfortunately, once R has decided that a variable is categorical, it refuses to do math on it. It doesn't "make sense". You have to change the factor back to a numeric value. This is not at all intuitive and can actually give you unexpected results. For example, when I tried to change one of the predicted vectors back to numeric, I got 0. Plain old vanilla constant 0.

Here is the code that gave me consistent results. Assume that your predicted response is in the vector y_pred. Then

numeric_y_pred <- as.numeric(as.character(y_pred))

gives you a vector that you can add.

But a linear combination of these models gave me a test error of 14.1%. This is better than most of the models, but not as good as the SVM.

So I decided to use the error to weight the vectors. I used the same formula as in the adaboost code:
alpha <- .5*log((1-error)/error). I multiplied each predicted vector by it's respective alpha value. When I did this, I got a test error of 9%. Eureka, I thought.

However, this did not translate to the kaggle data. The tuned SVM in R had a test error of 13.3% but the ensemble had a test error of 14%.

Speaking of the tuned SVM model, I had a very difficult time with this. Every time I tried to run it, I got an error. I finally tweaked the code and got it to run. The code that I used is not the same as the code in the R help. I don't know if the package was updated and the help was not. Here is the code that worked for me:

fit7 <- tune.svm(x,y,gamma=10^(-3:-1),cost=10^(1:5))

I submitted the tuned SVM model from R to the kaggle competition and it did not perform as well as the tuned SVM from Python.

Back to the digital drawing board.

In order to write an adaboost code for a model, you need to be able to fit the model using sample weights and to generate the probability distribution of the outcomes. As far as I know, R doesn't have a SVM model that does this, but sci kit learn does.

Unfortunately, I was not able to make it work.

I used the sonar data from David Mease's class. First, I preprocessed the x training data so the mean and standard deviation are 0 and 1 respectively. Then I used the API to scale the test data. When I ran the SVM without sample weights, the training error was 0 and the test error was 10.3%. Then I constructed a weight vector so that the weights are all 1/N. N in this case is 130.

When I fitted the data with the sample weights, the training error and test error were both awful. The training error was 49.3% and the test error was 42.3%. The test error was less than the training error. Clearly something is not right here. Someone from the sci kit learn list serv suggests that I scale C by the length of the training data, but that didn't make any difference.

So this is not a useable option right now. If I find out what is going wrong, I'll update the blog.

The code for this post can be found on github at Link.

Math + Statistics + Python

Friday, July 12, 2013

Constucting a crude ensemble method in R and tuning a SVM

Using sample weights to fit the SVM model in Python

About Me