Get in mind that I'm using...
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
data, target, test_size=0.2, random_state=42
)
- note that cross_validate() is for option.
- alternatively, grid search can do better if there are much more features.
Ensemble Learning
- Combining some weak estimators to make strong estimator.
- performs best in regard to structured data.
Examples of ensemble learning
- Voting: same data to various models
- Bagging : random data with repetition to same model
- Pasting : random data with no repetition to same model
- Boosting: gradual learning
sklearn.ensemble.RandomForestClassifier
- Prepare bootstrap sample.
- Randomly select sample from dataset. Repetition is allowed.
- same size with original data
- Train each decision tree with each bootstrap sample.
- At each node, RF will select some feature randomly and select best one among them.
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=-1, random_state=42)
scores = cross_validate(rf, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])
>> (0.9973541965122431, 0.8905151032797809)
- By selecting feature randomly, each feature gets a chance more fairly than mere Decision Tree model.
rf.fit(x_train, y_train)
rf.feature_importances_
>> array([0.23167441, 0.50039841, 0.26792718])
OOB
- out of bag; samples never selected in bootstrap samples
- can be used at validation
rf = RandomForestClassifier(oob_score=True, n_jobs=-1, random_state=42)
rf.fit(x_train, y_train)
rf.oob_score_
>> 0.8934000384837406
- similar with Random Forest
- each tree uses entire data
- each node randomly splits
- this is same as DecisionTreeClassifier(...splitter='random'...)
- by doing this, ExtraTrees will run faster than RandomForest.
- but the number of train trees are bigger than RandomForest's.
from sklearn.ensemble import ExtraTreesClassifier
et = ExtraTreesClassifier(n_jobs=-1, random_state=42)
score = cross_validate(et, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])
>> (0.9973541965122431, 0.8905151032797809)
et.fit(x_train, y_train)
et.feature_importances_
>> array([0.20183568, 0.52242907, 0.27573525])
sklearn.ensemble.GradientBoostingClassifier
- basically using 100 trees with max_depth=3
- by doing this, G.B. can avoid overfitting
- Adds new trees gradually by gradient descent method
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(random_state=42)
scores = cross_validate(gb, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])
>> (0.8881086892152563, 0.8720430147331015)
- can be optimized.
gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.2, random_state=42)
scores = cross_validate(gb, x_train, y_train, return_train_score=True, n_jobs=-1)
np.mean(scores['train_score']), np.mean(scores['test_score'])
>> np.mean(scores['train_score']), np.mean(scores['test_score'])
- but feature importances can go wrong.
gb.fit(x_train, y_train)
gb.feature_importances_
>> array([0.15872278, 0.68010884, 0.16116839])
sklearn.ensemble.HistGradientBoostingClassifier
- each feature is splitted into 256 intervals
- each node split can be done fast.
- omitted data can be covered.
from sklearn.ensemble import HistGradientBoostingClassifier
hgb = HistGradientBoostingClassifier(random_state=42)
scores = cross_validate(hgb, x_train, y_train, return_train_score=True)
np.mean(scores['train_score']), np.mean(scores['test_score'])
>> (0.9321723946453317, 0.8801241948619236)
hgb.score(x_test, y_test)
>> 0.8723076923076923
- To get feature importances from H.G.B
from sklearn.inspection import permutation_importance
hgb.fit(x_train, y_train)
result = permutation_importance(hgb, x_train, y_train, n_repeats=10, random_state=42, n_jobs=-1)
result.importances_mean
>> array([0.08876275, 0.23438522, 0.08027708])
result = permutation_importance(hgb, x_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)
result.importances_mean
>> array([0.05969231, 0.20238462, 0.049 ])