Press enter to see results or esc to cancel.

Learning Resources For Big Data: Part 2

This is second in the series of blog posts related to learning materials centered around Big Data. Our first post dealt with some mathematical tools related to Machine Learning. This second post focuses on the essential tools in the Data scientist’s toolbox: the R programming language, Matlab, Python, SAS and Julia.

 So you have learned about regression (If not, click here.) You’ve also learned about various classification schemes. You are familiar with the idea of clustering. You understand the concept of supervised and unsupervised learning.

But it seems awfully hard to do all the necessary computations by hand. Or even hard code them! After all, do you really want to code the SVM algorithm by yourself? (It might be fun at first, but it’s a bit like reinventing the wheel.)

 

 

To that end, there is a whole suite of ready-made tools available:  the R programming language, Matlab, Python, SAS and Julia. We will briefly discuss the pros and cons of each.

 

The R programming language

 

R. You might have heard of it. It’s all over the news. And for good reason too. It’s easy to use. It is also the most popular language used by data miners, scientists and the like. It has a vast community of developers, and best of all, it’s open source!

Here are some of the pros and cons:

 

Pros

  • Cross-platform

  • Easy to install

  • Easy to integrate with other languages

  • Large developer community (Click here to enter one such community. Don’t worry. The inhabitants don’t bite. Not too hard anyway. 😉 )

  • Vast and rapidly growing number of packages (found here)

 

Cons

  • Slow. Especially when looping. This is because R is an interpreted language

  • Indices start from 1. (Sacrilege!) As you know, in most computer languages, indices start from 0. This could lead to subtle unnoticed bugs

  • Poor memory performance. Due to its memory based computation architecture, R cannot handle huge data sets. It is good for adhoc analysis only

 

Matlab

 

 

 

Matlab or matrix laboratory is a high-level technical computing language. It is widely used by academic and research institutions.  Almost every mathematics University student knows how to use Matlab. Its basic data element? A matrix. (No! Not that kind of matrix, but this kind.)

 

Pros

  • Since the basic data structure is a matrix, mathematical operations such as taking cross products, dot products etc. are built-in

  • Easy to use

  • Interactive graphical tools

  • Excellent documentation available

  • Functionality can be expanded with the use of toolboxes

  • Easy to interface with other languages

 

Cons

  • Proprietary and so is extremely pricey. (Pro tip: If you want the same functionality without having to sacrifice your first born, you can choose Octave.)

  • Execution speed is slow since it is an interpreted language

  • Lack of object-oriented features

 

 

Python

 

 

Python is named after the British comedy group Monty Python (and not the reptile, contrary to popular belief.) Its popularity is growing at an exponential rate. It may even overtake R as the lingua franca of data science. Here are some of the pros and cons.

 

Pros

  • Easy on the eye. Python is easy to use and read. Writing and debugging python code is much easier than in a language like say, Java or C++

  • Python’s use is mainly for data cleaning (data munging, if you prefer 🙂 )

  • Open source

  • Has a large developer community

  • Excellent libraries for scientific computation

  • Easy to integrate with other languages

 

Cons

  • Does not have as many statistical packages as R or SAS. But this gap is closing rapidly!

  • Relatively hard to debug since variables are not declared

 

SAS

 

 

SAS or Statistical Analysis Software seems to be around since the time of the dinosaurs (its origins date back to 1966!). It is used primarily for statistical analysis. It has compilers for various platforms like Windows, Unix and other mainframe computers

 

Pros

  • Tradition. People are used to SAS. Since it has been around forever, there is a large community of SAS programmers

  • Can handle large amounts of data

  • Excellent documentation

  • Very good for database management

 

Cons

  • Expensive. SAS is proprietary and costs an arm and a leg.

  • Old-fashioned in its structure

  • Doesn’t have many fancy graphics capabilities as opposed to a language like, say, Python

 

Julia

 

 

The youngest of the languages described in this post, Julia is fast growing in popularity. It has already become a staple in the data scientist’s diet, as it were.  It has the potential to become the next generation language for data analysis.

 

Pros

  • Speed. If you have a need for speed, you will love Julia! It is much faster in its execution time than either Python and R

  • Open source

  • Concise syntax

  • Great developer community

 

Cons

  • It is a “young language” and so does not YET enjoy the support that older, more established languages have

  • Plotting tools leave much to be desired

  • Relatively difficult to install

 

Of course, there are more tools for the data scientist than those listed above (e.g. Octave, Mathematica, etc.) But remember, in the end a tool is just a tool; you must understand the concepts deeply to make full sense of the outputs!

 

If you are completely new to this field of data analytics, we suggest you use an open source tool like R and familiarize yourself with it completely. In the next tutorial, we will help you do just that! We will show you how to download and install RStudio (an IDE), and teach you the basics of the R programming language. We also need to thank Dawar Dedmari who took time out from his busy schedule to give us valuable inputs.

 

Comments

Leave a Comment

IBM

About Author

Bharat Ramakrishna

Blogger. Part-time mathematics enthusiast. Loves esoteric and quirky things. Bibliophile. Has a wide range of interests including playing chess and pool, juggling and creating puzzles of fiendish difficulty. Grammar Nazi.