Get in mind that I'm using...

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    data, target, test_size=0.2, random_state=42
)

note that cross_validate() is for option.
- alternatively, grid search can do better if there are much more features.

Ensemble Learning

Combining some weak estimators to make strong estimator.
performs best in regard to structured data.

Examples of ensemble learning

Voting: same data to various models
Bagging : random data with repetition to same model
Pasting : random data with no repetition to same model
- Extra trees
Boosting: gradual learning

sklearn.ensemble.RandomForestClassifier

Prepare bootstrap sample.
- Randomly select sample from dataset. Repetition is allowed.
- same size with original data
Train each decision tree with each bootstrap sample.
- At each node, RF will select some feature randomly and select best one among them.

from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=-1, random_state=42)
scores = cross_validate(rf, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])

>> (0.9973541965122431, 0.8905151032797809)

By selecting feature randomly, each feature gets a chance more fairly than mere Decision Tree model.
```
rf.fit(x_train, y_train)
rf.feature_importances_
```
```
>> array([0.23167441, 0.50039841, 0.26792718])
```
OOB
out of bag; samples never selected in bootstrap samples

can be used at validation

rf = RandomForestClassifier(oob_score=True, n_jobs=-1, random_state=42)
rf.fit(x_train, y_train)
rf.oob_score_

>> 0.8934000384837406

sklearn.ensemble.ExtraTreesClassifier

similar with Random Forest
each tree uses entire data
each node randomly splits
- this is same as DecisionTreeClassifier(...splitter='random'...)
- by doing this, ExtraTrees will run faster than RandomForest.
- but the number of train trees are bigger than RandomForest's.

from sklearn.ensemble import ExtraTreesClassifier
et = ExtraTreesClassifier(n_jobs=-1, random_state=42)
score = cross_validate(et, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])

>> (0.9973541965122431, 0.8905151032797809)

et.fit(x_train, y_train)
et.feature_importances_

>> array([0.20183568, 0.52242907, 0.27573525])

sklearn.ensemble.GradientBoostingClassifier

basically using 100 trees with max_depth=3
- by doing this, G.B. can avoid overfitting
Adds new trees gradually by gradient descent method

from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(random_state=42)
scores = cross_validate(gb, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])

>> (0.8881086892152563, 0.8720430147331015)

can be optimized.

gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.2, random_state=42)
scores = cross_validate(gb, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])

>> np.mean(scores['train_score']), np.mean(scores['test_score'])

but feature importances can go wrong.

gb.fit(x_train, y_train)
gb.feature_importances_

>> array([0.15872278, 0.68010884, 0.16116839])

sklearn.ensemble.HistGradientBoostingClassifier

each feature is splitted into 256 intervals
- each node split can be done fast.
- omitted data can be covered.

from sklearn.ensemble import HistGradientBoostingClassifier
hgb = HistGradientBoostingClassifier(random_state=42)
scores = cross_validate(hgb, x_train, y_train, return_train_score=True)
np.mean(scores['train_score']), np.mean(scores['test_score'])

>> (0.9321723946453317, 0.8801241948619236)

# final score
hgb.score(x_test, y_test)

>> 0.8723076923076923

To get feature importances from H.G.B

# train data
from sklearn.inspection import permutation_importance
hgb.fit(x_train, y_train)
result = permutation_importance(hgb, x_train, y_train, n_repeats=10, random_state=42, n_jobs=-1)
result.importances_mean

>> array([0.08876275, 0.23438522, 0.08027708])

# test data
result = permutation_importance(hgb, x_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)
result.importances_mean

>> array([0.05969231, 0.20238462, 0.049     ])

Ensemble Learning

Table of contents

Get in mind that I'm using...

Ensemble Learning

Examples of ensemble learning

sklearn.ensemble.RandomForestClassifier

OOB

sklearn.ensemble.ExtraTreesClassifier

sklearn.ensemble.GradientBoostingClassifier

sklearn.ensemble.HistGradientBoostingClassifier