Machine Learning on a Cancer Dataset - Part 11

cristi (70)in #machine-learning • 9 years ago

Today we begin learning about another machine learning algorithm in scikit-learn: the Decision Tree classifier.

I introduce and explain a few concepts of Decisions Trees. Then, we create a classifier for the cancer dataset, train it and evaluate its performance. As you will see, by default the tree is overfitting, meaning that the accuracy on the training subset of the data is 100%.

We want to avoid overfitting and to do that we'll have to modify the default parameters and retrain the algorithm. See the video walkthrough below.

As a reminder:

In this series I'm going to explore the cancer dataset that comes pre-loaded with scikit-learn. The purpose is to train the classifiers on this dataset, which consists of labeled data: ~569 tumor samples, each labeled malignant or benign, and then use them on new, unlabeled data.

Previous videos in this series: