Math + Statistics + Python: logistic regression

Showing posts with label logistic regression. Show all posts

Wednesday, March 27, 2013

More on the Data Science competition in Kaggle

In my last post, I talked about the Data Science competition in Kaggle. In that post, I ran an optimized SVM model with a gaussian kernel. In this post, I'll go a little further into depth about the data and models.

I characterized the data as "well structured". I have already mentioned that the data is continuous with no missing values. I used a combination of numpy and pandas to look for missing values, check the mean and standard deviations of each feature, produce histograms to look for data skewing and outliers and a correlation matrix to see if there were any features that had strong linear correlations. These are not specific statistical tests. But this process gave me a good feel for the data and whether I needed any preprocessing.

Once I determined that I had a good data set, I proceeded to modeling. Since there are no categorical features, I decided not to run any kind of decision tree analysis. Since the response value is a classifier, I started with logistic regression and a linear SVM. Each of these gave a score of .797.

At this point, I decided to try a grid search. Here's the description from the user's guide: GridSearchCV implements a “fit” method and a “predict” method like any classifier except that the parameters of the classifier used to predict is optimized by cross-validation.

Here's the code:

param_grid={'C':[.01,.1,1.0,10.0,100.0],'gamma':[.1,.01,.001,.0001],'kernel':['linear','rbf']}

svr=svm.SVC()

grid=grid_search.GridSearchCV(svr,param_grid)

grid.fit(x_train,y_train)

print "The best classifier is:", grid.best_estimator_

print "The best score is ", grid.best_score_

print "The best parameters are ", grid.best_params_

And here's the results:

The best classifier is: SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.01, kernel=rbf, max_iter=-1, probability=False, shrinking=True,
tol=0.001, verbose=False)
The best score is 0.898426323319
The best parameters are {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.01}

Ironically, I had already come up with this optimized model just plugging in values. This is not a quick process. I can't give you the exact amount of time that this takes because I just go off and do something else while it is running. Note that the score is not quite as high as my model in the last post. I'm guessing this is because I split the data and used 70% for training and 30% for testing. I believe this model uses a cross validation which means it only used 70% of the data for cross validation.

I also ran a nearest neighbor model. Here's the code and the results:

from sklearn.neighbors import KNeighborsClassifier

neigh=KNeighborsClassifier()

neigh.fit(x_train,y_train)

y_pred3=neigh.predict(x_test)

neigh_score=neigh.score(x_test,y_test)

print "The score from K neighbors is", neigh_score

cm3=confusion_matrix(y_test,y_pred3)

print "This is the confusion matrix with for K neighbors",(cm3)

The score from K neighbors is 0.883333333333
This is the confusion matrix with for K neighbors [[133 22]
[ 13 132]]

The score for the K neighbors classifier is almost as high as the optimized SVM with the rbf kernel.

I'd be very interested to hear what others are finding as they analyze this set.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Saturday, November 17, 2012

How do I wrap my head around this concept of logistic regression?

Logistic regression is used when you have a classification problem. A classification problem has dependent variables that can only be either 0 or 1. For example, a dealership may collect data on the sales process and the dependent variable will be either a sale is made (0) or its not made (1).

Of course, this type of data does not work very well with a traditional linear regression because the distribution of the dependent variable is not normal. But linear regression is a good place to start with this discussion because it gives me a reference point to show you what I understood about linear regression that I didn't understand about logistic regression.

When you run a linear regression with a set of data, you get a regression equation. The general form of a regression equation with one independent variable is

where a0 and a1 are the coefficients. (This is just a different form of the slope intercept form where a0 is the y intercept and a1 is the slope). It is intuitively obvious how to use this equation. If your model is good, you can substitute in an x value and output a prediction for y.

The first problem in PS#1 in the machine learning class that requires a program is this:

Implement2 Newton’s method for optimizing ℓ(θ), and apply it to
fit a logistic regression model to the data. Initialize Newton’s method with θ = ~0 (the
vector of all zeros). What are the coefficients θ resulting from your fit? (Remember
to include the intercept term.)

And here is the generalization of Newton's method in the notes:

where thetas are the coefficients, H is the Hessian matrix and the last set of symbols represent the derivative of the log likelihood function. This is not my tidy little regression equation where I can put in x values and get back a y value. And, really, how could it be since the y value is just 0 and 1? We are not in Kansas anymore, Toto. Not to mention that, while in theory I understood what the H matrix and log likelihood derivative vector are, in practice it was very difficult generate concrete equations to use in the program.

I am embarrassed to say that it took me an incredibly long time to answer these questions. In my defense, the resources on the web are really hard to understand. Did you scroll down far enough so see where part of the information is written as a debate? But here is what I figured out. Once you get the coefficient values (code will come in another post), you can calculate the value of the sigmoid function, h.

This is a cumulative probability function. If the value of this function is less than 0.5, the output value is 0. If the value is greater than 0.5, the output value is 1. See how simple that is? This generates a nice plot when there are two variables because h=0.5 when the value of the exponent on e is 0. Your equation for theta reduces to

which can be solved for x2

Now you can plot this equation with your data and get this

Not sure how to plot something like this is Python? Don't worry, I'll reveal all my code.

Wednesday, November 14, 2012

The beginning and my current project

My current project is to gain knowledge on machine learning. I am working my way through the free Stanford course on Machine learning. The course recommends Matlab or Octave. I don't have access to Matlab and cannot for the life of me figure out how to download Octave and make it work on my machine. So I've decided to use Python. Python is a free, open source programming language. I already have some experience with this language. It can be used to solve math and statistics problems especially with Numpy, Scipy and Matplotlib.

What I have been finding as I take this course is that I have enough theoretical math and statistics knowledge to understand what is being taught. But as an inexperienced programmer, it is very hard to do the assignments. The Numpy and Matplotlib manuals are in draft form and it is sometimes very hard to figure out how to do things. I did find several bloggers who have posted on using Python for different applications. I found one blogger who had several posts on the problems from the Standford course. Take a look at this post. Unfortunately, this code did not work for me. It never does. People will write code that solves a problem that I want to solve. But when I try to use the code, it doesn't work and then I don't know how to fix it. I think this is because I am so inexperienced, I can't tell which steps were skipped. That's why I think that other newbie, inexperienced programmers may benefit from my explanations.

So that's why I'm here. My next post will be on how to turn the abstract mathematical matrix equations in Newton's method (Problem set #1) into concrete equations that can be programmed for Python. Future posts will deal with how *%&@ hard it is to use matricies in Python.

By the way, I learned to program Python by taking this course. Then I increased my knowledge by using Python to solve the math problems posted at Project Euler. If you're a math geek and you haven't tried this site, you are really missing something.