Machine Learning with Scikit-Learn - [Part 44]

in #machine-learning7 years ago

In this tutorial we're going to discuss and code another method of automatic feature selection in scikit-learn, which is model based selection.

According to the textbook we are following, model based selection uses a supervised model to compute the importance that each feature carries. After making the selection, it will only keep the most importance features.

Since it needs something to determine the importance of each feature, this means that the algorithm used has to be able to do that. The algorithm has to have one or more methods to determine feature importance. And in scikit-learn, we know that two of these models are Decision Trees and ensembles of trees, like Random Forests.

In this tutorial we're going to use a RandomForestClassifier for our model based selection example. The algorithm in scikit-learn for model based selection is SelectFromModel and the parameters it requires include:

  • the algorithm to determine the importance (in this case RandomForestClassifier)
  • parameters for the classifier (n_estimators, etc)
  • and a threshold - to make the selection - in this case 'median'

Once we have it, we fit it onto the data and then we apply it onto our training set. We then look at both the original training set and the training set after we applied the select method. We will ultimately do some visualization and then train an algorithm on both sets to be able to compare their performances.

The trained algorithm on the data with the select method applied has a better performance than the one trained on the original dataset. Please see the full video to have a comprehensive understanding of this:


To stay in touch with me, follow @cristi


Cristi Vlad Self-Experimenter and Author

Sort:  

noise is selected by the selection function and is preferred over the real ones. why is it so?

I just answered this question on the video. It seems that some noise features have more importance over some of the original ones. Basically, some of the original features may be completely irrelevant to the training of the algorithm...

may be it is so but confusion still persists

Good tutorial my friend, very simple to explain, thank you very much and greetings my brother, good content in Steemit!

Coin Marketplace

STEEM 0.19
TRX 0.15
JST 0.029
BTC 63651.41
ETH 2679.55
USDT 1.00
SBD 2.80