Best practices of orchestrating Python and R code in ML projects

numizmat (29)in #python • 7 years ago

Today, data scientists are generally divided among two languages — some prefer R, some prefer Python. I will not try to explain in this article which one is better. Instead of that I will try to find an answer to a question: “What is the best way to integrate both languages in one data science project? What are the best practices?”. Beside git and shell scripting additional tools are developed to facilitate the development of predictive model in a multi-language environments. For fast data exchange between R and Python let’s use binary data file format Feather. Another language agnostic tool DVC can make the research reproducible — let’s use DVC to orchestrate R and Python code instead of a regular shell scripts.

Machine learning with R and Python

Both R and Python are having powerful libraries/packages used for predictive modeling. Usually algorithms used for classification or regression are implemented in both languages and some scientist are using R while some of them preferring Python. In an example that was explained in previous tutorial target variable was binary output and logistic regression was used as a training algorithm. One of the algorithms that could also be used for prediction is a popular Random Forest algorithm which is implemented in both programming languages. Because of performances it was decided that Random Forest classifier should be implemented in Python (it shows better performances than random forest package in R).

R example used for DVC demo

Our dependency graph of this data science project look like this - R (marked blue) and Python (marked pink) jobs in one project:

Feather API

Feather API is designed to improve meta data and data interchange between R and Python. It provides fast import/export of data frames among both environments and keeps meta data informations which is an improvement over data exchange via csv/txt file format. In our example Python job will read an input binary file that was produced in R with Feather API.

Dependency graph with R and Python combined

The next question what we are asking ourselves is why do we need DVC, why not just use shell scripting? DVC automatically derives the dependencies between the steps and builds the dependency graph (DAG) transparently to the user. Graph is used for reproducing parts/codes of your pipeline which were affected by recent changes and we don’t have to think all the time what we need to repeat (which steps) with the latest changes.

Re-executed jobs are marked with red color:

Summary

In data science projects it is often used R/Python combined programming. Additional tools beside git and shell scripting are developed to facilitate the development of predictive model in a multi-language environments. Using data version control system for reproducibility and Feather for data interoperability helps you orchestrate R and Python code in a single environment.

Full article (source): Best practices of orchestrating Python and R code in ML projects (based on R code and reproducible model development with DVC tutorial)

#programming #rprogramming #datascience #machinelearning