Friday, July 12, 2013
Using sample weights to fit the SVM model in Python
In order to write an adaboost code for a model, you need to be able to fit the model using sample weights and to generate the probability distribution of the outcomes. As far as I know, R doesn't have a SVM model that does this, but sci kit learn does.
Unfortunately, I was not able to make it work.
I used the sonar data from David Mease's class. First, I preprocessed the x training data so the mean and standard deviation are 0 and 1 respectively. Then I used the API to scale the test data. When I ran the SVM without sample weights, the training error was 0 and the test error was 10.3%. Then I constructed a weight vector so that the weights are all 1/N. N in this case is 130.
When I fitted the data with the sample weights, the training error and test error were both awful. The training error was 49.3% and the test error was 42.3%. The test error was less than the training error. Clearly something is not right here. Someone from the sci kit learn list serv suggests that I scale C by the length of the training data, but that didn't make any difference.
So this is not a useable option right now. If I find out what is going wrong, I'll update the blog.
The code for this post can be found on github at Link.