Sunday, June 23, 2013

R vs Python for machine learning

In my last post, I talked about tuning an svm for the Kaggle competition. I submitted my tuned svm. My score on the leaderboard is .90350. Not only am I 198 on the leaderboard and sinking fast, but I didn't even reach the SVM Benchmark score. Additionally, the top person is at a score of .99031.

I figured that only an ensemble method would get me to a higher score and I started to experiment with these methods. I never managed to come up with an ensemble that even matched my original submission.  While I did this, I noticed some things about sci kit learn in Python that made me start to think about looking for other tools.

I decided to try R and Rapid Miner.  Rapid Miner has not been a successful experience. I can't seem to get passed the set up repository/import data stage. I have had much more success with R. Most of this is due to a wonderful set of videos by David Mease. If you are interested in learning R for data mining and machine learning, his videos are pure gold. There are 13 videos on Youtube. Not only does he show you how to use R, but he has all the example data sets online so that you can play along. He also does a wonderful job of explaining what benchmarks to use.

David uses a subset of a well known sonar data set. He uses 130 observations in the training set and 78 observations in the test set. There are 60 features. He goes over several methods with the same data set. I still have one more video, but so far he has covered decision trees, svm and k nearest neighbors. He uses k nearest neighbors with n=1 as a benchmark. This is the default in R. For this data set, it gives a missclassification rate of 21%. This is better that the decision tree misclassification rate which is about 30%. But the svm should be able to beat the untuned k nearest neighbors.

I used this same sonar data set to compare results in R and Python.

k nearest neighbors
Missclassification rate for R: 21%
Missclassification rate for Python: could not get this. I set the n_neighbors=1, but I got this error:

C:\Python27\lib\site-packages\sklearn\neighbors\ NeighborsWarning: kneighbors: neighbor k+1 and neighbor k have the same distance: results will be dependent on data order.
  neigh_dist, neigh_ind = self.kneighbors(X)

The default distance in k nearest neighbors is the Euclidian distance. The data should be scaled so that the variances of each variable are equal. R does this automatically. Python requires you to scale the data yourself.

Decision Tree

The following table shows the results I got:

R training accuracy
R test accuracy
Python training accuracy
Python test accuracy


Note that the results are the same for a max depth of 1 and 2. As the max depth increases, it looks like sci kit learn gives the better results. However, the test accuracy stays fairly flat for both models while the Python model training accuracy increase to 1.0. It certainly looks like max depth 4 and 5 in Python have overfit the data. It would be nice to compare a picture of the two trees. The tree in R is quite easy to generate. Python requires some graphics modules that are fairly involved to use. At least, they were for me. I couldn't get either one to work. The R model won't fit max depth 6 because of overfitting issues.

Support Vector Machines

The first thing I did is run a default support vector machine in R and Python. Both programs use an rbf kernel as default.

R scales the data and uses cost=1 and gamma=1/number of features as default values. The untuned svm gives a missclassification error of 1.5% for the training data and  about 13% for the test data.

Python doesn't scale the data and neither did I. (Maybe this is not a fair comparison but it is an extra step in sci kit learn that isn't required in R.) The Python default values are C=1 (cost=1) and gamma=0. This untuned svm gives a missclassification error of about 30% for the training data and about 36% for the test data. 

I've already talked in a previous post about how the Python grid search crashes my computer. R has a procedure for tuning the svm, but it produces an error when I try to run it.

In addition to the questions I have about how sci kit learn models fit the data, there is the additional problem of categorical data. R usually recognizes categorical data. If it doesn't, you can set a variable to be categorical and R will know how to handle it. Python requires you to transform your own categorical data and it is a klugy process. There is a module called OneHotEncoder. But you can't run this module unless you transform all of your text data to numeric.

I still have a lot to learn about machine learning in R. But from I've seen so far, I think I'll stick to R when I want to run a machine learning algorithm.


  1. I believe there are many more pleasurable opportunities ahead for
    individuals that looked at your site.Besant technology offer Python training in Bangalore

  2. And indeed, I’m just always astounded concerning the remarkable
    things served by you. Some four facts on this page are undeniably the
    most effective I’ve had.

    Selenium Training in Bangalore

  3. I simply wanted to thank you so much again. I am not sure the things that I might have gone through without the type of hints revealed by you regarding that situation.

    Java Training in Bangalore|

  4. I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.

    Java training in bangalore

  5. I’m experiencing some small security issues with my latest blog, and I’d like to find something safer. Do you have any suggestions? DevOps Training in Bangalore

  6. AWS as a career is a sunshine career, i.e., it is just about to take off with a huge potential in near future.
    AWS Training in Bangalore Any graduate with qualified training on AWS can aspire for a career in AWs. Most of the companies recruit from the following qualifications, B.Sc, B.Com, BCA, B.Tech, MCA, M.Sc, and M.Tech. Besant technologies has a mature training plan with comprehensive coverage on all testing topics backed by qualified/certified/experienced trainers.

  7. This is an amazing blog! Do keep up the good work. By the way, if you are looking for an accounting company, check out the best business incorporation consultant in Singapore today!

  8. I simply wanted to thank you so much again. I am not sure the things that I might have gone through without the type of hints revealed by you regarding that situation.
    Besant technologies Marathahalli

  9. hi admin i have read your blog.It was interesting.Keep it up. get more Inventory Verification | Vendor Reconciliation | Customer Helpdesk

  10. It has been simply incredibly generous with you to provide openly what exactly many individuals would’ve marketed for an eBook to end up making some cash for their end, primarily given that you could have tried it in the event you wanted.

    Selenium Training in Rajaji Nagar

  11. Great post you shared, you have now become top of my list. You were unknown to me before but have found your content to be fantastic.
    Continuous Transaction Monitoring
    Duplicate Invoice Audit