Wednesday, March 27, 2013

More on the Data Science competition in Kaggle

In my last post, I talked about the Data Science competition in Kaggle. In that post, I ran an optimized SVM model with a gaussian kernel. In this post, I'll go a little further into depth about the data and models.

I characterized the data as "well structured". I have already mentioned that the data is continuous with no missing values. I used a combination of numpy and pandas to look for missing values, check the mean and standard deviations of each feature, produce histograms to look for data skewing and outliers and a correlation matrix to see if there were any features that had strong linear correlations. These are not specific statistical tests. But this process gave me a good feel for the data and whether I needed any preprocessing.

Once I determined that I had a good data set, I proceeded to modeling. Since there are no categorical features, I decided not to run any kind of decision tree analysis. Since the response value is a classifier, I started with logistic regression and a linear SVM. Each of these gave a score of .797.

At this point, I decided to try a grid search. Here's the description from the user's guide: GridSearchCV implements a “fit” method and a “predict” method like any classifier except that the parameters of the classifier used to predict is optimized by cross-validation.

Here's the code:

param_grid={'C':[.01,.1,1.0,10.0,100.0],'gamma':[.1,.01,.001,.0001],'kernel':['linear','rbf']}
svr=svm.SVC()
grid=grid_search.GridSearchCV(svr,param_grid)
grid.fit(x_train,y_train)
print "The best classifier is:", grid.best_estimator_
print "The best score is ", grid.best_score_
print "The best parameters are ", grid.best_params_

And here's the results:

The best classifier is: SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.01, kernel=rbf, max_iter=-1, probability=False, shrinking=True,
  tol=0.001, verbose=False)
The best score is  0.898426323319
The best parameters are  {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.01}

Ironically, I had already come up with this optimized model just plugging in values. This is not a quick process. I can't give you the exact amount of time that this takes because I just go off and do something else while it is running. Note that the score is not quite as high as my model in the last post. I'm guessing this is because I split the data and used 70% for training and 30% for testing. I believe this model uses a cross validation which means it only used 70% of the data for cross validation.

I also ran a nearest neighbor model. Here's the code and the results:

from sklearn.neighbors import KNeighborsClassifier
neigh=KNeighborsClassifier()
neigh.fit(x_train,y_train)
y_pred3=neigh.predict(x_test)
neigh_score=neigh.score(x_test,y_test)
print "The score from K neighbors is", neigh_score
cm3=confusion_matrix(y_test,y_pred3)
print "This is the confusion matrix with for K neighbors",(cm3)

The score from K neighbors is 0.883333333333
This is the confusion matrix with for K neighbors [[133  22]
 [ 13 132]]

The score for the K neighbors classifier is almost as high as the optimized SVM with the rbf kernel.

I'd be very interested to hear what others are finding as they analyze this set.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

No comments:

Post a Comment