Sunday, April 7, 2013

Tuning C and gamma in the SVM model

In the previous post, the best SVM model for the Kaggle data is an rbf kernel. I'd like to find the best parameters for C and gamma. I did do a grid search with dictionary values for the parameters. I would have liked to have scikit learn search for the best parameters for me and I tried this code:

C_range = 10.0 ** np.arange(-2, 9)
gamma_range = 10.0 ** np.arange(-5, 4)
param_grid = dict(gamma=gamma_range, C=C_range)
cv = StratifiedKFold(cl, n_folds=3)
grid = GridSearchCV(SVC(), param_grid=param_grid, cv=cv)

This code should search for the best values for C in the range .01 to 1,000,000,000 and gamma in the range .00001 to 1000. Unfortunately, this just makes my computer crash and I never got an answer. 

But I found a reference here that says an exhaustive grid search is time consuming. It suggests to use a coarse search first and then a fine search when you are in the correct region.

So starting from C=10 and gamma = .01, I refined my search and here are the values that I got:

C           gamma           score
10            .01                .898
 9           .0095              .898
 8            .009               .901
 7            .0085             .901
 6            .0085             .901
5.5          .0085             .901
5.4          .0085             .901
5.3          .0086             .901
5.28        .0086             .901

You can see that started by using dictionary values for C with a step of 1 on each side of 10 and a step of .005 on each side of gamma. As the grid search stabilized, I narrowed the step on C to .1 and gamma to .0005. Gamma was very stable and I didn't change the step size. I arbitrarily stopped changing C when the step size reached 0.01.

When I ran these parameters using my 70/30 split on the data, I got a score of .9166. This is a 3.6% improvement on the previous score of .9133.

Reference: A Practical Guide to Support Vector Classification retrieved from http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf