Math + Statistics + Python: Kaggle Data Science competition

Kaggle.com is sponsoring another learning competition for machine learning. This one specifically mentions using scikit-sklearn in Python. See the competition details here.

It is amazing how much more is available in scikits just since I have been writing this blog. Recently, I have switched to using Python(x,y) which is a distribution which includes everything you need for machine learning. And it's specifically for Windows!! See the information on this distribution here. You do have to be careful about the plug in though. Specifically, the latest version of scikit-sklearn is .13.1. The version that downloads with Python(x,y) is .12. You'll have to update it. Don't ask me how. I took lots of wrong turns, finally figured it out but probably can't reproduce it.

The data set from Kaggle is well structured. There are 40 features and 999 training examples. The feature data is all continuous and there are no missing values. I was able to write code that gives me the SVM standard score on the leaderboard: .913.

Someday I'll have time to figure out how to use github and I'll post my code there. For now, here's what I have:

import csv as csv

import numpy as np

import pandas as pd

import scipy as sp

import matplotlib.pyplot as plt

# Reading in training data for Kaggle sci kit competition

csv_file_object=csv.reader(open('C:/Users/numbersmom/Dropbox/kaggle sci kit competition/train.csv'))

header=csv_file_object.next()

records=[]

for row in csv_file_object:records.append(row)

records=np.array(records)

records=records.astype(np.float)

csv_file_object=csv.reader(open('C:/Users/numbersmom/Dropbox/kaggle sci kit competition/train_label.csv'))

header=csv_file_object.next()

cl=[]

for row in csv_file_object:cl.append(row)

cl=np.array(cl)

cl=cl.astype(np.int8)

cl=cl.reshape(999,)

tr_ex=np.size(cl)

#Need to use 70% of the data for training and 30% for testing

n_train=int(.7*tr_ex)

x_train,x_test=records[:n_train,:],records[n_train:,:]

y_train,y_test=cl[:n_train],cl[n_train:]

#SVM code

from sklearn import svm

from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix

# I tried different models, but this one with c=10 and gamma=.01 gives

# gives the SVM benchmark score.

clf=svm.SVC(C=10.0,gamma=.01,kernel='rbf',probability=True)

clf.fit(x_train,y_train)

print clf.n_support_

y_pred1=clf.predict(x_test)

gau_score=clf.score(x_test,y_test)

print"This is the score for rbf model",gau_score

cm1=confusion_matrix(y_test,y_pred1)

print "This is the confusion matrix for rbf model",(cm1)

print "finished"

The confusion matrix looks like this:

pred 0 pred 1

act0 141 14

act1 12 133

There's lots of other stuff I can try to get that number higher. You can check out the helpful users guide to get more information.

Math + Statistics + Python

Tuesday, March 19, 2013

Kaggle Data Science competition

No comments:

Post a Comment

About Me