Math + Statistics + Python: January 2013

Just when you think you have it all figured out.....everything comes undone.

I had just finished all the lectures on Support Vector Machines and I thought I had a good handle on the concept (30000 ft view) and the details (street view). I opened up Problem set 2 only to find that all of the data files are in Matlab format. Not only that, but you have to import a library to solve for the Lagrange multipliers. What a gyp!! You mean we aren't even going to write the code? (I now know how naive that was.) I closed out the problem set with a vague thought of googling converting Matlab data files to Python data files.

Meanwhile, I continued to work on writing a program for Support Vector Machines. The concept of Support Vector Machines (what kind of name is that?) is really interesting. I'm going to attempt a general explanation with no mathematics.

When I did the Newton's method problem, I was looking for a line that cuts the data into two parts. If the data from PS1 represents survived and not survived, then the idea is to have all survived on one side of the line and not survived on the other side. And that's what happens with the data from PS1with about 89% of the data. In a real data set, if you had any outliers, you could add a fudge factor to the model that would give less weight to the outliers. I'm not kidding. That's what the statisticians do. They don't call it a fudge factor. They call it regularization. So it you have outliers and they are far from the line, you can weight them less than the points closer to the line.

Newton's method works well for a linear model with a small amount of features and data points. But when things get large and possible nonlinear, you need something else.

Support Vector Machines use really complicated mathematics. The idea is to find a boundary that separates the data. Now, the boundary doesn't have to be a line. It can be an oval like in topography maps. Or it can be something more complicated. But if the data behaves well with your model, then the boundary should separate your data. For example, if you boundary is an oval, all survived data should be inside the oval and all not survived should be outside the oval. This boundary is defined by the data points closest to the boundary. For example, if you have a data point at the exact center of the oval, it is safely in the survived region. I don't need to know anything else about that point. But the points at the outside edge of the inside area become very important. These points define where exactly the border is drawn. And since only these points define that boundary, they are the only points that need to be used in the model. They are call the support vector machines.

Here are some pictures of plots of support vector machines.

Of course, to solve a math problem, we first have to write an equation, then solve. It turns out that the equation that defines this problem is too hard to solve as is. Without getting into details and a lot of hand waving, the problem can be rewritten as something that can be solved with calculus and Lagrange multipliers. Then substituted back into the original equation to get the boundary. I can tell that your eyes have already glazed over. But let me tell you why this is important. It turns out that using Calculus and Lagrange multipliers allows us to turn the original problem into a convex problem that can be solved using convex optimization software. (Think about the parabola that you learned about in algebra: remember that we could find the maximum or minimum of this by using some formulas. This is just a more complex version of that.) The problem is that the only convex optimization software that I found for Python is CVXOPT and it doesn't work with, you guessed it, Windows. At least I can't get it to work. Here are the installation instructions.

See this instruction:

tar -xvf blas.tgz

This is Linux and the command does not work in Vista. I can unzip this file, but I don't trust any of the instructions.

It turns out that there is a version of Python that already has this installed. It's called Sage and, you guessed it, it doesn't work on Windows. You have to set up a Virtual Box. I guess that is the next order of business.

For those of your following along, you know that I have been using Python instead of Matlab or Octave for the Machine learning course.

It turns out that Python is a base for interactive computing. It's like buying the base model of a car: it comes with the standard equipment and not much else. It will get you from Point A to Point B. But it you want to do something fancy, you have to add on.

For interactive data analysis, there a modules which you can import into Python that make scientific computing easier. You can add in Numpy (numerical python) which gives you access to arrays. You can add in Pandas which gives you access to dataframes. Dataframes allow you to treat data as if it is in a spreadsheet. This makes is much easier to summarize the data. I'll do a separate post on dataframes later.

With each new module that you add in, there are new data structures and commands to learn. This makes it incredibly frustrating for a newbie like me.

So when a friend loaned me Wes McKinney's Python for Data Analysis book, I was thrilled. I figured I could just follow along and learn everything I need to know. Of course, life is never that easy as I found out when I got to Chapter 3. In Chapter 3, Mr. McKinney starts using IPython. In order to keep using the book, I had to install this on my computer which uses Windows Vista. It turns out this is a big problem because all of the instructions for downloading IPython on your computer are written assuming you are using a Linux based system.

I have finally gotten IPython working on my computer, but it took a lot of research and finagling to do it. In order to help you, I'll try to walk you through the steps.

The completely unhelpful documentation for installation can be found at ipython.org. Click on the link and read the documentation. The only thing I understood when I read that is that I need Python version 2.6 or higher already installed on my computer. I had already installed Python 2.7 so that I could use Numpy, Matplotlib and Pandas. But what are easy_install and pip? The documentation doesn't explain and there is no further information when you click on pypi.

I did find a blog (this is usually the best source for a newbie) that explains it all. Click on this link to get the instructions. Now that you have done all that, you are ready to use IPython and the interactive notebook. You can see a picture of it here.

Here's how I start up the notebook. It's not perfect, but it gets me where I want to go.

Click on the Windows icon circle.
Type cmd in the search box.
The window with the command prompt will open.
You must change the directory. Type cd c:\Python27\scripts
When the command line prompt comes back, type ipython notebook --pylab=inline
This opens up the notebook and allows you to get plots in the notebook and not a separate window. The only problem that I have is that it opens up the notebook in Explorer and it really doesn't work. I just copy the IP address into Firefox and it works for me.

I have just finished all the lectures for Support Vector Machines, so I will be working on the next problem set.

Math + Statistics + Python

Wednesday, January 9, 2013

Stuck

Tuesday, January 1, 2013

Getting Ipython to work in the Windows environment

About Me