Sunday, June 23, 2013

R vs Python for machine learning

In my last post, I talked about tuning an svm for the Kaggle competition. I submitted my tuned svm. My score on the leaderboard is .90350. Not only am I 198 on the leaderboard and sinking fast, but I didn't even reach the SVM Benchmark score. Additionally, the top person is at a score of .99031.

I figured that only an ensemble method would get me to a higher score and I started to experiment with these methods. I never managed to come up with an ensemble that even matched my original submission.  While I did this, I noticed some things about sci kit learn in Python that made me start to think about looking for other tools.

I decided to try R and Rapid Miner.  Rapid Miner has not been a successful experience. I can't seem to get passed the set up repository/import data stage. I have had much more success with R. Most of this is due to a wonderful set of videos by David Mease. If you are interested in learning R for data mining and machine learning, his videos are pure gold. There are 13 videos on Youtube. Not only does he show you how to use R, but he has all the example data sets online so that you can play along. He also does a wonderful job of explaining what benchmarks to use.

David uses a subset of a well known sonar data set. He uses 130 observations in the training set and 78 observations in the test set. There are 60 features. He goes over several methods with the same data set. I still have one more video, but so far he has covered decision trees, svm and k nearest neighbors. He uses k nearest neighbors with n=1 as a benchmark. This is the default in R. For this data set, it gives a missclassification rate of 21%. This is better that the decision tree misclassification rate which is about 30%. But the svm should be able to beat the untuned k nearest neighbors.

I used this same sonar data set to compare results in R and Python.

k nearest neighbors
Missclassification rate for R: 21%
Missclassification rate for Python: could not get this. I set the n_neighbors=1, but I got this error:


C:\Python27\lib\site-packages\sklearn\neighbors\classification.py:131: NeighborsWarning: kneighbors: neighbor k+1 and neighbor k have the same distance: results will be dependent on data order.
  neigh_dist, neigh_ind = self.kneighbors(X)

The default distance in k nearest neighbors is the Euclidian distance. The data should be scaled so that the variances of each variable are equal. R does this automatically. Python requires you to scale the data yourself.

Decision Tree

The following table shows the results I got:


Depth
R training accuracy
R test accuracy
Python training accuracy
Python test accuracy
1
.7769
.7179
.7769
.7179
2
.80
.7051
.8077
.7051
3
.8615
.6538
.8923
.6667
4
.8846
.6923
.9385
.7179
5
.8846
.6923
.9846
.7436
6
N/A

1.0
.7308


Note that the results are the same for a max depth of 1 and 2. As the max depth increases, it looks like sci kit learn gives the better results. However, the test accuracy stays fairly flat for both models while the Python model training accuracy increase to 1.0. It certainly looks like max depth 4 and 5 in Python have overfit the data. It would be nice to compare a picture of the two trees. The tree in R is quite easy to generate. Python requires some graphics modules that are fairly involved to use. At least, they were for me. I couldn't get either one to work. The R model won't fit max depth 6 because of overfitting issues.

Support Vector Machines

The first thing I did is run a default support vector machine in R and Python. Both programs use an rbf kernel as default.

R scales the data and uses cost=1 and gamma=1/number of features as default values. The untuned svm gives a missclassification error of 1.5% for the training data and  about 13% for the test data.

Python doesn't scale the data and neither did I. (Maybe this is not a fair comparison but it is an extra step in sci kit learn that isn't required in R.) The Python default values are C=1 (cost=1) and gamma=0. This untuned svm gives a missclassification error of about 30% for the training data and about 36% for the test data. 

I've already talked in a previous post about how the Python grid search crashes my computer. R has a procedure for tuning the svm, but it produces an error when I try to run it.

In addition to the questions I have about how sci kit learn models fit the data, there is the additional problem of categorical data. R usually recognizes categorical data. If it doesn't, you can set a variable to be categorical and R will know how to handle it. Python requires you to transform your own categorical data and it is a klugy process. There is a module called OneHotEncoder. But you can't run this module unless you transform all of your text data to numeric.

I still have a lot to learn about machine learning in R. But from I've seen so far, I think I'll stick to R when I want to run a machine learning algorithm.

33 comments:

  1. I believe there are many more pleasurable opportunities ahead for
    individuals that looked at your site.Besant technology offer Python training in Bangalore

    ReplyDelete
  2. And indeed, I’m just always astounded concerning the remarkable
    things served by you. Some four facts on this page are undeniably the
    most effective I’ve had.


    Selenium Training in Bangalore

    ReplyDelete
  3. I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.

    Java training in bangalore

    ReplyDelete
  4. I’m experiencing some small security issues with my latest blog, and I’d like to find something safer. Do you have any suggestions? DevOps Training in Bangalore

    ReplyDelete
  5. AWS as a career is a sunshine career, i.e., it is just about to take off with a huge potential in near future.
    AWS Training in Bangalore Any graduate with qualified training on AWS can aspire for a career in AWs. Most of the companies recruit from the following qualifications, B.Sc, B.Com, BCA, B.Tech, MCA, M.Sc, and M.Tech. Besant technologies has a mature training plan with comprehensive coverage on all testing topics backed by qualified/certified/experienced trainers.

    ReplyDelete
  6. I simply wanted to thank you so much again. I am not sure the things that I might have gone through without the type of hints revealed by you regarding that situation.
    Besant technologies Marathahalli

    ReplyDelete
  7. hi admin i have read your blog.It was interesting.Keep it up. get more Inventory Verification | Vendor Reconciliation | Customer Helpdesk

    ReplyDelete
  8. It has been simply incredibly generous with you to provide openly what exactly many individuals would’ve marketed for an eBook to end up making some cash for their end, primarily given that you could have tried it in the event you wanted.

    Selenium Training in Rajaji Nagar

    ReplyDelete
  9. Great post you shared, you have now become top of my list. You were unknown to me before but have found your content to be fantastic.
    Continuous Transaction Monitoring
    Duplicate Invoice Audit

    ReplyDelete
  10. Your new valuable key points imply much a person like me and extremely more to my office workers. With thanks; from every one of us.
    Big Data Analytics Online Training

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. This is beyond doubt a blog significant to follow. You’ve dig up a great deal to say about this topic, and so much awareness. I believe that you recognize how to construct people pay attention to what you have to pronounce, particularly with a concern that’s so vital. I am pleased to suggest this blog.
    python training Course in chennai | python training in Bangalore | Python training institute in kalyan nagar

    ReplyDelete
  13. I always enjoy reading quality articles by an individual who is obviously knowledgeable on their chosen subject. Ill be watching this post with much interest. Keep up the great work, I will be back

    Java training in Annanagar | Java training in Chennai

    Java training in Chennai | Java training in Electronic city

    ReplyDelete
  14. I prefer to study this kind of material. Nicely written information in this post, the quality of content is fine and the conclusion is lovely. Things are very open and intensely clear explanation of issues
    Data Science course in kalyan nagar | Data Science course in OMR

    Data Science course in chennai | Data science course in velachery

    Data science course in jaya nagar | Data science training in tambaram

    ReplyDelete
  15. All the points you described so beautiful. Every time i read your i blog and i am so surprised that how you can write so well.
    aws online training

    data science with python online training

    data science online training

    rpa online training

    ReplyDelete