Data Version Control Tutorial – Best Practices for Machine Learning Projects Reproducibility

in #machinelearning6 years ago

Today the data science community is still lacking good practices for organizing their projects and effectively collaborating. ML algorithms and methods are no longer simple “tribal knowledge” but are still difficult to implement, manage and reuse.

One of the biggest challenges in reusing, and hence the managing of ML projects, it its reproducibility.
To address the reproducibility we have build Data Version Control or DVC.
This example shows you how to solve a text classification problem using the DVC tool.

Git branches should beautifully reflect the non-linear structure common to the ML process, where each hypotheses can be presented as a Git branch. However, inability to store data in a repository and the discrepancy between code and data make it extremely difficult to manage a data science project with Git.
DVC streamlines large data files and binary models into a single Git environment and this approach will not require storing binary files in your Git repository.

Full article: Data Version Control Tutorial

dvc-diagram.jpeg

  1. Preparation
    1.1. What we are going to do?
    1.2. Getting the sample code
    1.3. Install DVC
    1.4. Initialize

  2. Define ML pipeline
    2.1. Get data file
    2.2. Data file internals
    2.3. Running commands
    2.4. Running in a bulk

  3. Reproducibility
    3.1. How reproducibility works?
    3.2. Adding bigrams
    3.3. Checkout code and data files
    3.4. Tune the model
    3.5. Merge the model to master

  4. Sharing data
    4.1. Pushing data to cloud
    4.2. Pulling data from cloud

  5. DVC commands

dvc-files.jpeg

Summary:
Git branches beautifully reflect the non-linear structure of ML processes where each hypotheses can be presented as a Git branch. DVC makes it possible to navigate through Git branches with code and data which makes the ML process more manageable and reproducible.

Full article: Data Version Control Tutorial

Sort:  

Congratulations @numizmat! You have completed some achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of posts published

Click on any badge to view your own Board of Honor on SteemitBoard.

To support your work, I also upvoted your post!
For more information about SteemitBoard, click here

If you no longer want to receive notifications, reply to this comment with the word STOP

Upvote this notification to help all Steemit users. Learn why here!

Coin Marketplace

STEEM 0.20
TRX 0.13
JST 0.030
BTC 67519.16
ETH 3532.90
USDT 1.00
SBD 2.68