Math + Statistics + Python: April 2013

In the previous post, the best SVM model for the Kaggle data is an rbf kernel. I'd like to find the best parameters for C and gamma. I did do a grid search with dictionary values for the parameters. I would have liked to have scikit learn search for the best parameters for me and I tried this code:

C_range = 10.0 ** np.arange(-2, 9)

gamma_range = 10.0 ** np.arange(-5, 4)

param_grid = dict(gamma=gamma_range, C=C_range)

cv = StratifiedKFold(cl, n_folds=3)

grid = GridSearchCV(SVC(), param_grid=param_grid, cv=cv)

This code should search for the best values for C in the range .01 to 1,000,000,000 and gamma in the range .00001 to 1000. Unfortunately, this just makes my computer crash and I never got an answer.

But I found a reference here that says an exhaustive grid search is time consuming. It suggests to use a coarse search first and then a fine search when you are in the correct region.

So starting from C=10 and gamma = .01, I refined my search and here are the values that I got:

C gamma score

10 .01 .898

9 .0095 .898

8 .009 .901

7 .0085 .901

6 .0085 .901

5.5 .0085 .901

5.4 .0085 .901

5.3 .0086 .901

5.28 .0086 .901

You can see that started by using dictionary values for C with a step of 1 on each side of 10 and a step of .005 on each side of gamma. As the grid search stabilized, I narrowed the step on C to .1 and gamma to .0005. Gamma was very stable and I didn't change the step size. I arbitrarily stopped changing C when the step size reached 0.01.

When I ran these parameters using my 70/30 split on the data, I got a score of .9166. This is a 3.6% improvement on the previous score of .9133.

Reference: A Practical Guide to Support Vector Classification retrieved from http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

Math + Statistics + Python

Sunday, April 7, 2013

Tuning C and gamma in the SVM model

About Me