Saturday, November 24, 2012

Expect the unexpected with Numpy


Before you can do any of the problem sets, you have to be able to load the data. The data for parts b and c of problem 1 of  Problem set 1 are located here and here.

There are a many different ways to get the data to use in your program. I'm just going to explore two of them in this blog post. The first is loadtxt which reads from a .dat file. The second is csv reader which reads data from an Excel file saved in a comma separated values format.

Loadtxt

Here is the link to the command in the Numpy manual. It seems pretty straight forward. Since I imported numpy as np, all my command start with np. So the command is:

a=np.loadtxt('filename.dat')

 But look what happens when you try to use the file name on the web:
z=np.loadtxt('http://cs229.stanford.edu/ps/ps1/q1y.dat')

Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    z=np.loadtxt('http://cs229.stanford.edu/ps/ps1/q1y.dat')
  File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 693, in loadtxt
    fh = iter(open(fname, 'U'))
IOError: [Errno 22] invalid mode ('U') or filename: 'http://cs229.stanford.edu/ps/ps1/q1y.dat'

Since I couldn't use this, I copied the data into notepad and saved it as a .dat file. Be sure to save it into your Python directory. Otherwise, you will have to specify the path. In this first example, I'm using the 'unpack' feature to separate x1 and x2 into separate 99x1 vectors instead of a single 99x2 matrix. Or so I thought. Watch what happens when I ask for some information on each vector x1.

x1,x2=np.loadtxt('q1x.dat',unpack=True)

>>> np.size(x1)
99
No problem here.
>>> np.shape(x1)
(99,)
What is this?

Now watch what happens when I load in the same data without unpacking:
x=np.loadtxt('q1x.dat')
>>> np.size(x)
198
>>> np.shape(x)
(99, 2)

Now I thought maybe it was just because I was using the 'unpack' command.
z=np.loadtxt('q1y.dat')
Remember the y data is just one column.
np.size(z)
99
>>> np.shape(z)
(99,)

Even though I didn't use the unpack command, I still get a shape of (99,).

What's the problem with this? Let me demonstrate when you try to transpose an array.

Here's an array to demonstrate my problem. First, I create a 3 x 1 np array by using the following command:

b=np.array([[1],[2],[3]])
>>> print b
[[1]
 [2]
 [3]]


Now I transpose the 3 x 1 matrix to a 1 x 3 matrix using the command np.transpose:

>>> print np.transpose(b)
[[1 2 3]]


Let me try the same command for x1:
>>> d=np.transpose(x1)
>>> np.shape(d)
(99,)


It doesn't transpose. In order to make this vector into something that can be transposed or even used in matrix multiplication, I have to actually give it the shape (99,1)

 e=x1.reshape(99,1)
>>> np.shape(e)
(99, 1)

I'm not sure what is happening here. I thought that numpy sees x1 as a list.
Look at this example:
 b=(1,2,3)
np.shape(b)
(3,)

But look what happens when I try to reshape:

 c=b.reshape(3,1)

Traceback (most recent call last):
  File "<pyshell#39>", line 1, in <module>
    c=b.reshape(3,1)
AttributeError: 'tuple' object has no attribute 'reshape'

I've searched and searched through the Numpy documentation and different problem solving sites. But I have not seen anyone address this problem. I have found a way around it which I used in my code. I'll give the full code with explanations in the next post.

CSV

If you want to put your data in an Excel file, you can copy and paste the data, then save as a CSV file.

Here's the code I used for a training set from another problem:

import csv as csv

csv_file_object = csv.reader(open('C:/Users/numbersmom/Desktop/trainver1.csv'))
header = csv_file_object.next()
# Read the data from a CSV file. use the header to set up the array.      
k=[]     
for row in csv_file_object:  k.append(row) 
k = np.array(k)
# Data comes in as strings. Must change the type to float
data = k.astype(np.float)

This looks a little complicated, so let me explain each part.
The first command is the CSV reader. Note that the file is on my desktop, not in the Python27 directory, so I have to specify the path. If you are not sure what the path is, you can always right click on the file to get the properties. This is what is listed under properties for my file: C:\Users\numbersmom\Desktop. Note that the slashes are backslashes but you must use forward slashes for specifying the file name for Python. It's little things like this that will drive you crazy.

The next four commands do the following: read in all the data in one long list including the headers, uses the headers to break the list into columns, then puts the data in a numpy array. The data comes in as strings unless you specify float. Since I am not sure how to do this, I specified all data as float after the fact. My array is now called data.

In this particular set of data, the first column (column 0 for Python) is the dependent variable and the other columns are in independent variables. Since this post is now way too long, I'll wait until another post to show how to specify specific columns in calculations.

No comments:

Post a Comment