Ensemble Learning

·

3 min read

Get in mind that I'm using...

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    data, target, test_size=0.2, random_state=42
)
  • note that cross_validate() is for option.
    • alternatively, grid search can do better if there are much more features.

Ensemble Learning

  • Combining some weak estimators to make strong estimator.
  • performs best in regard to structured data.

Examples of ensemble learning

  • Voting: same data to various models
  • Bagging : random data with repetition to same model
  • Pasting : random data with no repetition to same model
    • Extra trees
  • Boosting: gradual learning

sklearn.ensemble.RandomForestClassifier

  • Prepare bootstrap sample.
    • Randomly select sample from dataset. Repetition is allowed.
    • same size with original data
  • Train each decision tree with each bootstrap sample.
    • At each node, RF will select some feature randomly and select best one among them.
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=-1, random_state=42)
scores = cross_validate(rf, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])
>> (0.9973541965122431, 0.8905151032797809)
  • By selecting feature randomly, each feature gets a chance more fairly than mere Decision Tree model.
    rf.fit(x_train, y_train)
    rf.feature_importances_
    
    >> array([0.23167441, 0.50039841, 0.26792718])
    

    OOB

  • out of bag; samples never selected in bootstrap samples
  • can be used at validation
    rf = RandomForestClassifier(oob_score=True, n_jobs=-1, random_state=42)
    rf.fit(x_train, y_train)
    rf.oob_score_
    
    >> 0.8934000384837406
    

sklearn.ensemble.ExtraTreesClassifier

  • similar with Random Forest
  • each tree uses entire data
  • each node randomly splits
    • this is same as DecisionTreeClassifier(...splitter='random'...)
    • by doing this, ExtraTrees will run faster than RandomForest.
    • but the number of train trees are bigger than RandomForest's.
from sklearn.ensemble import ExtraTreesClassifier
et = ExtraTreesClassifier(n_jobs=-1, random_state=42)
score = cross_validate(et, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])
>> (0.9973541965122431, 0.8905151032797809)
et.fit(x_train, y_train)
et.feature_importances_
>> array([0.20183568, 0.52242907, 0.27573525])

sklearn.ensemble.GradientBoostingClassifier

  • basically using 100 trees with max_depth=3
    • by doing this, G.B. can avoid overfitting
  • Adds new trees gradually by gradient descent method
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(random_state=42)
scores = cross_validate(gb, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])
>> (0.8881086892152563, 0.8720430147331015)
  • can be optimized.
    gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.2, random_state=42)
    scores = cross_validate(gb, x_train, y_train, return_train_score=True, n_jobs=-1)
    np.mean(scores['train_score']), np.mean(scores['test_score'])
    
    >> np.mean(scores['train_score']), np.mean(scores['test_score'])
    
  • but feature importances can go wrong.
    gb.fit(x_train, y_train)
    gb.feature_importances_
    
    >> array([0.15872278, 0.68010884, 0.16116839])
    

sklearn.ensemble.HistGradientBoostingClassifier

  • each feature is splitted into 256 intervals
    • each node split can be done fast.
    • omitted data can be covered.
from sklearn.ensemble import HistGradientBoostingClassifier
hgb = HistGradientBoostingClassifier(random_state=42)
scores = cross_validate(hgb, x_train, y_train, return_train_score=True)
np.mean(scores['train_score']), np.mean(scores['test_score'])
>> (0.9321723946453317, 0.8801241948619236)
# final score
hgb.score(x_test, y_test)
>> 0.8723076923076923
  • To get feature importances from H.G.B
    # train data
    from sklearn.inspection import permutation_importance
    hgb.fit(x_train, y_train)
    result = permutation_importance(hgb, x_train, y_train, n_repeats=10, random_state=42, n_jobs=-1)
    result.importances_mean
    
    >> array([0.08876275, 0.23438522, 0.08027708])
    
    # test data
    result = permutation_importance(hgb, x_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)
    result.importances_mean
    
    >> array([0.05969231, 0.20238462, 0.049     ])