I'm still working to improve my Kaggle sci kit learn score. As I've said in a previous post, I think I need to use an ensemble method. The question is, what method?
I tried a decision tree adaboost, but it doesn't improve on the test error given by the SVM model.
I decided to try fitting many different models and combining them in some way. I used the sonar data for this since it is a much more manageable size than the kaggle competition data.
The first thing I decided to try was a straight linear combination. First, I ran all of the different models (svm, decision tree, knn, logitistic regression and naive bayes). I generated the predicted y values for each of these models and added them together. This is not an easy thing to do in R. Remember I said in a previous post that R recognizes categorical variables? Unfortunately, once R has decided that a variable is categorical, it refuses to do math on it. It doesn't "make sense". You have to change the factor back to a numeric value. This is not at all intuitive and can actually give you unexpected results. For example, when I tried to change one of the predicted vectors back to numeric, I got 0. Plain old vanilla constant 0.
Here is the code that gave me consistent results. Assume that your predicted response is in the vector y_pred. Then
numeric_y_pred <- as.numeric(as.character(y_pred))
gives you a vector that you can add.
But a linear combination of these models gave me a test error of 14.1%. This is better than most of the models, but not as good as the SVM.
So I decided to use the error to weight the vectors. I used the same formula as in the adaboost code:
alpha <- .5*log((1-error)/error). I multiplied each predicted vector by it's respective alpha value. When I did this, I got a test error of 9%. Eureka, I thought.
However, this did not translate to the kaggle data. The tuned SVM in R had a test error of 13.3% but the ensemble had a test error of 14%.
Speaking of the tuned SVM model, I had a very difficult time with this. Every time I tried to run it, I got an error. I finally tweaked the code and got it to run. The code that I used is not the same as the code in the R help. I don't know if the package was updated and the help was not. Here is the code that worked for me:
fit7 <- tune.svm(x,y,gamma=10^(-3:-1),cost=10^(1:5))
I submitted the tuned SVM model from R to the kaggle competition and it did not perform as well as the tuned SVM from Python.
Back to the digital drawing board.