Tuesday, March 19, 2013

Kaggle Data Science competition

Kaggle.com is sponsoring another learning competition for machine learning. This one specifically mentions using scikit-sklearn in Python. See the competition details here.

It is amazing how much more is available in scikits just since I have been writing this blog. Recently, I have switched to using Python(x,y) which is a distribution which includes everything you need for machine learning. And it's specifically for Windows!! See the information on this distribution here. You do have to be careful about the plug in though. Specifically, the latest version of scikit-sklearn is .13.1. The version that downloads with Python(x,y) is .12. You'll have to update it. Don't ask me how. I took lots of wrong turns, finally figured it out but probably can't reproduce it.

The data set from Kaggle is well structured. There are 40 features and 999 training examples. The feature data is all continuous and there are no missing values. I was able to write code that gives me the SVM standard score on the leaderboard: .913.

Someday I'll have time to figure out how to use github and I'll post my code there. For now, here's what I have:

import csv as csv
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
# Reading in training data for Kaggle sci kit competition
csv_file_object=csv.reader(open('C:/Users/numbersmom/Dropbox/kaggle sci kit competition/train.csv'))
header=csv_file_object.next()
records=[]
for row in csv_file_object:records.append(row)
records=np.array(records)
records=records.astype(np.float)
csv_file_object=csv.reader(open('C:/Users/numbersmom/Dropbox/kaggle sci kit competition/train_label.csv'))
header=csv_file_object.next()
cl=[]
for row in csv_file_object:cl.append(row)
cl=np.array(cl)
cl=cl.astype(np.int8)
cl=cl.reshape(999,)
tr_ex=np.size(cl)

#Need to use 70% of the data for training and 30% for testing
n_train=int(.7*tr_ex)
x_train,x_test=records[:n_train,:],records[n_train:,:]
y_train,y_test=cl[:n_train],cl[n_train:]

#SVM code

from sklearn import svm
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
# I tried different models, but this one with c=10 and gamma=.01 gives
# gives the SVM benchmark score.
clf=svm.SVC(C=10.0,gamma=.01,kernel='rbf',probability=True)
clf.fit(x_train,y_train)
print clf.n_support_
y_pred1=clf.predict(x_test)
gau_score=clf.score(x_test,y_test)
print"This is the score for rbf model",gau_score
cm1=confusion_matrix(y_test,y_pred1)
print "This is the confusion matrix for rbf model",(cm1)
print "finished"

The confusion matrix looks like this: 

          pred 0         pred 1
act0    141             14
act1     12             133

There's lots of other stuff I can try to get that number higher. You can check out the helpful users guide to get more information.

No comments:

Post a Comment