Wednesday, January 9, 2013

Stuck

Just when you think you have it all figured out.....everything comes undone.

I had just finished all the lectures on Support Vector Machines and I thought I had a good handle on the concept (30000 ft view) and the details (street view). I opened up Problem set 2 only to find that all of the data files are in Matlab format. Not only that, but you have to import a library to solve for the Lagrange multipliers. What a gyp!! You mean we aren't even going to write the code? (I now know how naive that was.) I closed out the problem set with a vague thought of googling converting Matlab data files to Python data files.

Meanwhile, I continued to work on writing a program for Support Vector Machines. The concept of Support Vector Machines (what kind of name is that?) is really interesting. I'm going to attempt a general explanation with no mathematics.

When I did the Newton's method problem, I was looking for a line that cuts the data into two parts. If the data from PS1 represents survived and not survived, then the idea is to have all survived on one side of the line and not survived on the other side. And that's what happens with the data from PS1with about 89% of the data. In a real data set, if you had any outliers, you could add a fudge factor to the model that would give less weight to the outliers. I'm not kidding. That's what the statisticians do. They don't call it a fudge factor. They call it regularization. So it you have outliers and they are far from the line, you can weight them less than the points closer to the line.

Newton's method works well for a linear model with a small amount of features and data points. But when things get large and possible nonlinear, you need something else. 

Support Vector Machines use really complicated mathematics. The idea is to find a boundary that separates the data. Now, the boundary doesn't have to be a line. It can be an oval like in topography maps. Or it can be something more complicated. But if the data behaves well with your model, then the boundary should separate your data. For example, if you boundary is an oval, all survived data should be inside the oval and all not survived should be outside the oval. This boundary is defined by the data points closest to the boundary. For example, if you have a data point at the exact center of the oval, it is safely in the survived region.  I don't need to know anything else about that point. But the points at the outside edge of the inside area become very important. These points define where exactly the border is drawn. And since only these points define that boundary, they are the only points that need to be used in the model. They are call the support vector machines.

Here are some pictures of plots of support vector machines.

Of course, to solve a math problem, we first have to write an equation, then solve. It turns out that the equation that defines this problem is too hard to solve as is. Without getting into details and a lot of hand waving, the problem can be rewritten as something that can be solved with calculus and Lagrange multipliers. Then substituted back into the original equation to get the boundary. I can tell that your eyes have already glazed over. But let me tell you why this is important. It turns out that using Calculus and Lagrange multipliers allows us to turn the original problem into a convex problem that can be solved using convex optimization software. (Think about the parabola that you learned about in algebra: remember that we could find the maximum or minimum of this by using some formulas. This is just a more complex version of that.) The problem is that the only convex optimization software that I found for Python is CVXOPT and it doesn't work with, you guessed it, Windows. At least I can't get it to work. Here are the installation instructions. 

See this instruction:
tar -xvf blas.tgz
 
This is Linux and the command does not work in Vista. I can unzip this file, but I don't trust any of the instructions.

It turns out that there is a version of Python that already has this installed. It's called Sage and, you guessed it, it doesn't work on Windows. You have to set up a Virtual Box. I guess that is the next order of business.

No comments:

Post a Comment