Indraneil Paul
IIIT Hyderabad
Real life data sets very rarely have a uniform class distribution
The cost of misclassifying members of the minority class is often higher
Random, NearMiss, CNN, ENN, RENN, Tomek Links
EasyEnsemble, BalanceCascade
SMOTE, ADASYN
Cost-Sensitive Learning
Oversampling Methods
Random, Cluster Based
Randomly eliminates instances of the majority class
Usually results in severe information loss
Duplicates random instances of the minority class
Likely overfitting due to duplicating data points
Selects the majority class samples whose average distances to three closest minority class samples are the smallest
Selects the majority class samples whose average distances to three farthest minority class samples are the smallest
NearMiss 3
Takes out a given number of the closest majority class samples for each minority class sample
from imblearn.under_sampling import NearMiss
nm1 = NearMiss(version=1, return_indices=True)
nm2 = NearMiss(version=2, return_indices=True)
nm3 = NearMiss(version=3, return_indices=True)
X1_res, Y1_res, idx1_res = nm3.fit_sample(X, Y)
X2_res, Y2_res, idx2_res = nm3.fit_sample(X, Y)
X3_res, Y3_res, idx3_res = nm3.fit_sample(X, Y)
from imblearn.ensemble import EasyEnsemble
ee = EasyEnsemble(n_subsets=3)
X_res, Y_res = ee.fit_sample(X, Y)
from imblearn.ensemble import BalanceCascade
bc = BalanceCascade()
X_res, Y_res = bc.fit_sample(X, Y)
from imblearn.over_sampling import SMOTE
sm = SMOTE(kind='regular')
X_res, Y_res = sm.fit_sample(X, Y)
from imblearn.over_sampling import ADASYN
ada = ADASYN()
X_res, Y_res = ada.fit_sample(X, Y)