Math + Statistics + Python: 2012

Sunday, December 16, 2012

Using matplotlib to plot the answer to ps1

Now that you know how to solve the first problem in Problem set 1, you have to graph the answer. I used matplotlib. The website for this plotting module can be found here.

Here's a copy of code to plot the data for ps 1:

#Plots x1 vs x2 for Problem 1 of the Stanford machine learning class

from pylab import *
import numpy as np
import numpy.linalg
from math import *

x1,x2=np.loadtxt('q1x.dat',unpack=True)
y=np.loadtxt('q1y.dat',unpack=True)

# we need to show when h=.5 as a line to separate the data. h=.5 when
# theta transpose x =0. so 0=theta0 +theta1*x1+theta2*x2
# solving for x2 give -theta0/theta2-theta1/theta2*x1

a=np.array([min(x1),max(x1),1,0.01])
b=-(-2.6205)/1.1719-0.7604/1.1719*a
plot(a,b)
# Use different colors and markers to plot x1 vs x2 if y=0 or y=1

for i in range(0,98):
    if y[i]==0:
        plot(x1[i],x2[i],'ro')
    if y[i]==1:
        plot(x1[i],x2[i],'bx')

xlabel('x1')
ylabel('x2')

show()

I used a for x1 and b as the calculated value for x2. Plotting a vs b gives that beautiful straight line across the plot.

One of my previous posts talked about expecting the unexpected in numpy. It happens again in this little short program. This line:
a=np.array([min(x1),max(x1),1,0.01])
was not what I though it was. I thought this gave me a list of values between the minimum of x1 and the maximum of x1. I was wrong. Here is what this actually gives:
print a
[ 0.57079941 7.7054006   1.          0.01      ]

If you want a list of values, you need to use this code:

a=np.linspace(min(z),max(z),25)

This gives 25 values between the minimum and maximum of your data set.

The rest of the code is very straight forward. Just set up a loop. If y=1, the data plots as a red circle. If y=0, the data plots as a blue x.

I know I have posted this graph before but it is so pretty, I'm going to post it again:

Wednesday, December 5, 2012

Code for the Newton's method problem

As promised, here is my code for Newton's method. Note that I am using Numpy but not pandas. A friend has lent me this book so I am currently working on learning what pandas can do for me.

The first thing you have to do is import numpy. I import as np so all of my numpy commands begin with np. For some reason, the linear algebra module does not load unless I specifically call it with an import command. I need this to invert the Hessian matrix.
import numpy as np
import numpy.linalg
from math import *

Here I read in the data. The size of x along axis=0 (rows) gives the number of training examples. In order to get the intercept term, I have to put a column of ones in front of the x data. In order to get the column of ones where I wanted it, I had to use the insert command. Then I checked the number of columns.
x=np.loadtxt('q1x.dat')
m=np.size(x,0) #number of training examples
x=np.insert(x,[0],[1],axis=1)
n=np.size(x,1) #number of columns

This piece of code sets up a vector of zeros for theta to start the iterations. I do the same thing in the code below for the Hessian matrix and gradient vector. Recall that everything in Python starts at zero so I have an n x 1 vector for the gradient instead of n+1 x 1 like in Matlab.
theta=np.zeros(n)

y=np.loadtxt('q1y.dat')

Note the reshape command for theta when I am calculating h. For some reason, I couldn't get theta to transpose.
# set up for 10 iterations
for i in range(0,10):
    H=np.zeros((n,n))
    grad=np.zeros(n)
    print "Theta just before h is ", theta
    h=1/(1+e**(np.dot(x,theta.reshape(n,1))))
    ts=np.dot(x,theta.reshape(n,1))

This part of the code is pretty self explanatory and looks very much like the matlab code from the answer sheet. However, note the outer command in the calculation for H. Let's examine this a little more closely.

Here's an example. In this example, a is an example of the lprime vector with one variable (ignore the intercept) and 4 training examples. b is the same vector. Notice they are both 4x1, so you cannot use matrix multiplication.
a=np.array([[1],[2],[3],[4]])
>>> print a
[[1]
[2]
[3]
[4]]
>>> b=np.array([[1],[2],[3],[4]])
>>> print b
[[1]
[2]
[3]
[4]]
>>> c=np.dot(a,b)

Traceback (most recent call last):
File "<pyshell#23>", line 1, in <module>
    c=np.dot(a,b)
ValueError: objects are not aligned

However, look at what happens when you use the outer command:
>>> d=np.outer(a,b)
>>> print d
[[ 1 2 3 4]
[ 2 4 6 8]
[ 3 6 9 12]
[ 4 8 12 16]]

You get a 4x4 matrix. Where did this come from? The outer product gives you

out[i, j] = a[i] * b[j]

This is what you need for Hessian matrix. It took me a really, really long time to find this. You're welcome.
So here is the rest of the code:
# This calculates the values for lprime vector which will be summed below
    for j in range(0,m):

        grad=grad+(x[j]*(y[j]-h[j]))

        H=H+h[j]*(1-h[j])*np.outer(x[j],x[j])

    Hinv=np.linalg.inv(H)
    theta=theta-np.dot(Hinv,grad)

The next post will show the code for the plot.

Saturday, November 24, 2012

Expect the unexpected with Numpy

Before you can do any of the problem sets, you have to be able to load the data. The data for parts b and c of problem 1 of Problem set 1 are located here and here.

There are a many different ways to get the data to use in your program. I'm just going to explore two of them in this blog post. The first is loadtxt which reads from a .dat file. The second is csv reader which reads data from an Excel file saved in a comma separated values format.

Loadtxt

Here is the link to the command in the Numpy manual. It seems pretty straight forward. Since I imported numpy as np, all my command start with np. So the command is:

a=np.loadtxt('filename.dat')

But look what happens when you try to use the file name on the web:
z=np.loadtxt('http://cs229.stanford.edu/ps/ps1/q1y.dat')

Traceback (most recent call last):
File "<pyshell#21>", line 1, in <module>
    z=np.loadtxt('http://cs229.stanford.edu/ps/ps1/q1y.dat')
File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 693, in loadtxt
    fh = iter(open(fname, 'U'))
IOError: [Errno 22] invalid mode ('U') or filename: 'http://cs229.stanford.edu/ps/ps1/q1y.dat'

Since I couldn't use this, I copied the data into notepad and saved it as a .dat file. Be sure to save it into your Python directory. Otherwise, you will have to specify the path. In this first example, I'm using the 'unpack' feature to separate x1 and x2 into separate 99x1 vectors instead of a single 99x2 matrix. Or so I thought. Watch what happens when I ask for some information on each vector x1.

x1,x2=np.loadtxt('q1x.dat',unpack=True)

>>> np.size(x1)
99
No problem here.
>>> np.shape(x1)
(99,)
What is this?

Now watch what happens when I load in the same data without unpacking:
x=np.loadtxt('q1x.dat')
>>> np.size(x)
198
>>> np.shape(x)
(99, 2)

Now I thought maybe it was just because I was using the 'unpack' command.
z=np.loadtxt('q1y.dat')
Remember the y data is just one column.
np.size(z)
99
>>> np.shape(z)
(99,)

Even though I didn't use the unpack command, I still get a shape of (99,).

What's the problem with this? Let me demonstrate when you try to transpose an array.

Here's an array to demonstrate my problem. First, I create a 3 x 1 np array by using the following command:

b=np.array([[1],[2],[3]])
>>> print b
[[1]
[2]
[3]]

Now I transpose the 3 x 1 matrix to a 1 x 3 matrix using the command np.transpose:

>>> print np.transpose(b)
[[1 2 3]]

Let me try the same command for x1:
>>> d=np.transpose(x1)
>>> np.shape(d)
(99,)

It doesn't transpose. In order to make this vector into something that can be transposed or even used in matrix multiplication, I have to actually give it the shape (99,1)

e=x1.reshape(99,1)
>>> np.shape(e)
(99, 1)

I'm not sure what is happening here. I thought that numpy sees x1 as a list.
Look at this example:
b=(1,2,3)
np.shape(b)
(3,)

But look what happens when I try to reshape:

c=b.reshape(3,1)

Traceback (most recent call last):
File "<pyshell#39>", line 1, in <module>
    c=b.reshape(3,1)
AttributeError: 'tuple' object has no attribute 'reshape'

I've searched and searched through the Numpy documentation and different problem solving sites. But I have not seen anyone address this problem. I have found a way around it which I used in my code. I'll give the full code with explanations in the next post.

CSV

If you want to put your data in an Excel file, you can copy and paste the data, then save as a CSV file.

Here's the code I used for a training set from another problem:

import csv as csv

csv_file_object = csv.reader(open('C:/Users/numbersmom/Desktop/trainver1.csv'))
header = csv_file_object.next()
# Read the data from a CSV file. use the header to set up the array.
k=[]
for row in csv_file_object: k.append(row)
k = np.array(k)
# Data comes in as strings. Must change the type to float
data = k.astype(np.float)

This looks a little complicated, so let me explain each part.
The first command is the CSV reader. Note that the file is on my desktop, not in the Python27 directory, so I have to specify the path. If you are not sure what the path is, you can always right click on the file to get the properties. This is what is listed under properties for my file: C:\Users\numbersmom\Desktop. Note that the slashes are backslashes but you must use forward slashes for specifying the file name for Python. It's little things like this that will drive you crazy.

The next four commands do the following: read in all the data in one long list including the headers, uses the headers to break the list into columns, then puts the data in a numpy array. The data comes in as strings unless you specify float. Since I am not sure how to do this, I specified all data as float after the fact. My array is now called data.

In this particular set of data, the first column (column 0 for Python) is the dependent variable and the other columns are in independent variables. Since this post is now way too long, I'll wait until another post to show how to specify specific columns in calculations.

Saturday, November 17, 2012

How do I wrap my head around this concept of logistic regression?

Logistic regression is used when you have a classification problem. A classification problem has dependent variables that can only be either 0 or 1. For example, a dealership may collect data on the sales process and the dependent variable will be either a sale is made (0) or its not made (1).

Of course, this type of data does not work very well with a traditional linear regression because the distribution of the dependent variable is not normal. But linear regression is a good place to start with this discussion because it gives me a reference point to show you what I understood about linear regression that I didn't understand about logistic regression.

When you run a linear regression with a set of data, you get a regression equation. The general form of a regression equation with one independent variable is

where a0 and a1 are the coefficients. (This is just a different form of the slope intercept form where a0 is the y intercept and a1 is the slope). It is intuitively obvious how to use this equation. If your model is good, you can substitute in an x value and output a prediction for y.

The first problem in PS#1 in the machine learning class that requires a program is this:

Implement2 Newton’s method for optimizing ℓ(θ), and apply it to
fit a logistic regression model to the data. Initialize Newton’s method with θ = ~0 (the
vector of all zeros). What are the coefficients θ resulting from your fit? (Remember
to include the intercept term.)

And here is the generalization of Newton's method in the notes:

where thetas are the coefficients, H is the Hessian matrix and the last set of symbols represent the derivative of the log likelihood function. This is not my tidy little regression equation where I can put in x values and get back a y value. And, really, how could it be since the y value is just 0 and 1? We are not in Kansas anymore, Toto. Not to mention that, while in theory I understood what the H matrix and log likelihood derivative vector are, in practice it was very difficult generate concrete equations to use in the program.

I am embarrassed to say that it took me an incredibly long time to answer these questions. In my defense, the resources on the web are really hard to understand. Did you scroll down far enough so see where part of the information is written as a debate? But here is what I figured out. Once you get the coefficient values (code will come in another post), you can calculate the value of the sigmoid function, h.

This is a cumulative probability function. If the value of this function is less than 0.5, the output value is 0. If the value is greater than 0.5, the output value is 1. See how simple that is? This generates a nice plot when there are two variables because h=0.5 when the value of the exponent on e is 0. Your equation for theta reduces to

which can be solved for x2

Now you can plot this equation with your data and get this

Not sure how to plot something like this is Python? Don't worry, I'll reveal all my code.

Wednesday, November 14, 2012

The beginning and my current project

My current project is to gain knowledge on machine learning. I am working my way through the free Stanford course on Machine learning. The course recommends Matlab or Octave. I don't have access to Matlab and cannot for the life of me figure out how to download Octave and make it work on my machine. So I've decided to use Python. Python is a free, open source programming language. I already have some experience with this language. It can be used to solve math and statistics problems especially with Numpy, Scipy and Matplotlib.

What I have been finding as I take this course is that I have enough theoretical math and statistics knowledge to understand what is being taught. But as an inexperienced programmer, it is very hard to do the assignments. The Numpy and Matplotlib manuals are in draft form and it is sometimes very hard to figure out how to do things. I did find several bloggers who have posted on using Python for different applications. I found one blogger who had several posts on the problems from the Standford course. Take a look at this post. Unfortunately, this code did not work for me. It never does. People will write code that solves a problem that I want to solve. But when I try to use the code, it doesn't work and then I don't know how to fix it. I think this is because I am so inexperienced, I can't tell which steps were skipped. That's why I think that other newbie, inexperienced programmers may benefit from my explanations.

So that's why I'm here. My next post will be on how to turn the abstract mathematical matrix equations in Newton's method (Problem set #1) into concrete equations that can be programmed for Python. Future posts will deal with how *%&@ hard it is to use matricies in Python.

By the way, I learned to program Python by taking this course. Then I increased my knowledge by using Python to solve the math problems posted at Project Euler. If you're a math geek and you haven't tried this site, you are really missing something.