1. The Battle of Tools for Data Science
We are living in an era where computing moved from mainframes to personal computers to cloud. And while it happened, we started generating immense amount of data. At the same time, the multi-folds increase in computing power also brought in advancement in application of algorithms which can be used to get insights from huge amount of data being generated.
The future of decision making will greatly rely on data, and no industry will remain untouched by this development. Data, however, has its own set of issues and challenges; for the data available to be meaningful and concise, one needs to organize it efficiently.
Though technology plays and important role in developing working solution, the foundation for building a robust analytical solution relies heavily on the clarity in fundamental concepts of data science as well as understanding business and domain related issues.
According to Dinesh Kumar (2017), understanding statistical learning, machine learning and artificial intelligence are important for successful analytics application. Many spend too much time on technologies such as R, Python, Hadoop and so on. The technologies are important, but one cannot become a successful data scientist if they lack in conceptual understanding. Technology is an evolving field and has eased the life of several data scientists while they try to convert the conceptual knowledge into a working solution. Not undermining the role of technology, we all know that there are several licensed and open source tools for data science. People working in the field of data science would have heard about tools such as SPSS, Stata, Python, SAS, R, RapidMiner, KNIME, Minitab etc.; if not used all of them.
When there are so many tools in the market and all competing to prove their usefulness in the field of data science, it becomes an enduring question “Which is the best tool for a data science?” Off late, we have seen the surge in usage of open source tools like Python and R. It may not be an over statement to quote that “most of the professionals working in the field of data science may either use Python or R”
However the question still remains “which is best Python or R?” We come across following questions from several students, corporate professionals and data science enthusiast:
- “I want to make a career in data science. Which tool should I focus on?”
- “Which tool should I use for data analysis, Python or R?
- “I have heard that python is a sought after language for building a career in data science. Many companies look for professionals with python skills. Is it true?
- “Can you suggest, if I should learn Python or R for doing data analysis?”
While we may not have one right answer to the questions like above, but the users of Python or R vouch for the supremacy of one over the other. The debate becomes interesting if one looks at the TIOBE index which measures the popularity of programming languages.
2. Comparing Python and R
2.1 Technological Proficiency
TIOBE programming community index is a measure of popularity of programming languages, created and maintained by the TIOBE Company based in Eindhoven, the Netherlands. TIOBE stands for “The Importance of Being Earnest” which is taken from the name of a comedy play written by Oscar Wilde at the end of the nineteenth century. The index is calculated from the number of search engine results for queries containing the name of the language.
2.1.1 TIOBE Index – August 2018
Python as a language of choice sits at 4th position compared to 18th position of R as a language.
Popularity of MATLAB as a tool for numerical analysis has picked up over the years and is ever increasing. However the usage of MATLAB is in many other fields and contributes to its popularity. We will restrict the scope of comparison to Python and R. Let us look at the popularity of Python and R over the years as well.
The TIOBE Index and the trend from 2001 onwards may make the python community ecstatic. At the same time, it may feel like a doomsday for R followers. However, let us read the above TIOBE Index with a more pragmatic approach.
- According to the site, TIOBE index is “not about the best programming language or the language in which most lines of code have been written”.
- The index covers searches in Google, Google Blogs, MSN, Yahoo!, Baidu, Wikipedia and YouTube. The index is updated once a month.
- The site does claim that the number of web pages may reflect the number of skilled engineers, courses and jobs worldwide.
The above bullet point summarizes and gives useful deductions. Let us keep the above statements in perspective to understand the background of Python and R.
2.1.2 Python Language
As per Wikipedia, Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands and first released in 1991, Python has a design philosophy that emphasizes code read- ability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.
2.1.3 R Language
As per Wikipedia, R is an implementation of the S programming language combined with lex-ical scoping semantics inspired by Scheme (S language). S was created by John Chambers in 1976, while at Bell Labs. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. The project was conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000.
2.1.4 Comparison based on technological proficiency
Popularity of python as a language is for varied reasons. Python is a developer’s language for general purpose programming. It started as a successor of Perl to write, build scripts and all kind of glue software. But gradually it entered other domains as well. Nowadays it is quite common to have Python running in large embedded systems. So it is very likely that Python will enter the top 3 and even might become the new number 1 programming language in the long run.
R is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
Both Python and R evolved during 1990s. Python has its root in Netherlands while R finds its genesis in New Zealand. Apart from having origins in different geographies, the background and the purpose for which these software evolved seems to be completely different as well.
Python is more robust and general purpose programming language where data analysis be-comes a drop in the ocean for the bigger things python has been envisioned to perform. Whereas R software, seems to be a language built for statisticians and to provide them an environment for Statistical computing and visualization.
TIOBE index claims that the popularity is reflected by the “number of searches” on the search engine will definitely put Python well above R owing to umpteen number of applications of python. The claim that search also reflects number of engineers or jobs for Python language also seems correct for the same reason. This also seems to the reason as to why Python has a bigger community support (in general) and why there are several useful packages which are available for data science application. In all, there are more than 100,000 packages which are available in python. Undoubtedly, Python is ubiquitous and has climbed the ladder to be amongst the most popular languages in a very short duration.
But is it a worrying trend for R users. After all, scarcity of engineers in R may also suggest that R will be treated as a niche technology for statistical computing in corporate and academic institutions alike. As far as community support and addition of new and useful packages are concerned, R may be lagging but not far behind. R has more than 12, 000 packages for data science hosted on CRAN which is not a small number, keeping the perspective that it is software for statistical computing. It will be interesting to know the actual number of python packages for Data Science alone. The news that Microsoft R Open is the enhanced distribution of R from Microsoft Corporation is a shot in the arm and suggests that corporate giants like Microsoft have something in mind to grow the user base for R community.
Keeping aside the popularity in terms of TIOBE index, community support etc., can we compare the usage of Python and R in perspective of Data Science application.
2.2 Application Proficiency
We decided to take up comparative study of the two prominent tools for data analysis i.e. Python and R. While comparing the tools to answer “which is better, Python or R”, we will broaden the scope of the question to ask, which tool is better to perform:
- Statistical Learning
- Machine Learning
- Deep Learning
- Text Analytics
- Model deployment
Though, one of the important aspects for data analysis is data pre-processing and data visualization, we will take this aspect of comparing Python and R software at a later point. The focus in this article will be to compare the software’s from statistical learning aspect. In series of articles to follow, we will compare them from Machine learning, deep learning and other aspects as well.
2.2.1 Statistical Learning
The advancement in infrastructure and technological computing has led to application of models in solving varied use cases of business importance. However, apart from achieving the highest possible accuracy in prediction, which machine learning models are tuned to achieve, it may be desirable for certain businesses to draw inference from the model and from the underlying data used for developing the models. There are many models which can be used for inference based modelling. The simplest and most commonly used models in this area are multiple linear regression and logistic regression.
Let us take up multiple regressions as a technique to understand the sold price of a player in Indian Premier League which is a popular tournament in a T20 game of cricket.
The performance of the players, which may finally decide the sold price of the player, could be measured through several metrics. Notably, although the IPL followed the Twenty20 format of the game, it was possible that the performance of the players in the other formats of the game such as Test and One-Day matches could influence player pricing. A few players had excellent records in Test matches, but their records in Twenty20 matches were not very impressive.
The objective for doing modeling using Multiple Linear Regression technique could be:
- Estimate the average sold price (dependent variable) of the player given the performance metric (independent variables) of the player.
- Understand the performance metric which is a statistically significant variable in estimating the average sold price.
There are many issue with model building as well as there are certain set of assumptions which needs to be satisfied before a valid model can be attained. However, one of the issues which need to be sorted out is the variable selection issue. In other way, which variables will be finally be retained by the model which is built. There are many strategies which can be used for variable selection:
- Forward selection
- Backward elimination
- Forward selection and backward elimination
The above strategies rely on performing partial F test and finally only the variables which are statistically significant are retained in the model.
The other strategy which can be used for variable selection is to use a metric Akakike Information Criteria (AIC) or Bayesian Information Criteria (BIC).
Statistical Learning – Python let us look at some of the popular packages which are available in Python for building a statistical learning model:
- api, statsmodel.formula.api
We have not mentioned sklearn as it is primarily for building machine learning model than for statistical learning model.
Statistical Learning – R Some of the popular packages for building statistical learning model in R:
- stats, MASS
- caret, mixlm
Caret is one of the useful wrapper packages in R which gives the flexibility of making a statistical as well as machine learning model.
2.2.2 Comparison based on Statistical learning
We are not demonstrating model building using R or Python. However, if one needs to build a full model using all the performance metrics (variables) for understanding the sold price of a player in IPL as well as to understand the statistically significant variable, the packages mentioned above in Python and R can be used to develop such a model.
In Python, using statsmodel.formula.api:
- regressor_OLS = smf.ols(formula=’Y_variable ~ X_variable’, data=df).fit() In R, using stats:
- regressor_OLS = lm(formula=’Y_variable ~ X_variable’, data=df)
The only issue with the above approach to build a regression model; it will have significant and insignificant variable as a part of the final model.
Are there some packages which can apply the strategy for variable selection discussed in above section? It seems there is no way to apply the partial F test strategy, AIC or BIC for building a step wise model in Python. The only way to achieve, it will be to write a custom function which removed statistically insignificant variables. This holds true for multiple linear regression as well as for logistic regression.
However, there are inbuilt functions step() as a part of the stats package and stepAIC() as a part of the MASS package which can help in implementing AIC as a criteria for variable selection:
- step(object, scope, scale = 0, direction = c(“both”, “backward”, “forward”), trace = 1,
- stepAIC(object, scope, scale = 0, direction = c(“both”, “backward”, “forward”), trace = k = 2, …)
In case, one wants to implement partial F test for feature selection, stepWise() function from mixlm provides an option to do so:
- forward(model, alpha = 0.2, full = FALSE, force.in)
- backward(model, alpha = 0.2, full = FALSE, hierarchy = TRUE, force.in)
- stepWise(model, alpha.enter = 0.15, alpha.remove = 0.15, full = FALSE)
- stepWiseBack(model, alpha.remove = 0.15, alpha.enter = 0.15, full = FALSE)
Each of the above functions expects a model object to be passed on which the variable selection strategy can be applied.
Based on the objective set forth for making a regression model for the IPL case, one can infer that the R provides various model selection strategies whereas in order to achieve similar outcome through Python, one may have to write a custom function.
We may not have provided an exhaustive list of packages which can help achieve this objective in R or in Python. However, R seems to have an edge as far as implementing the statistical concepts and building an inferential model is concerned.
Deployment of the data science solution is necessary for the business but let the worry of picking up the right deployment framework be handled by the architects and development team.
Conceptual understanding is more relevant in the field of data science compared to picking a tool for implementing the solution. In the next article, we will take up a detailed comparison of building machine learning models in Python and R using a specific dataset.