Feature selection
Feature Selection: что это
- Отбор признаков по какому-то критерию
Feature Selection: зачем
- Уменьшить overfitting
- Улучшить качество модели
- Ускорить работу моделей
- Упростить интерпретацию
Feature Selection: Подходы
FS: стандартные подходы
- Статистические тесты
- Корреляция с целевой переменной
- Исключение высоко коррелирующих признаков
- Исключение мультиколлинеарности
- Information value, weights of evidence
FS: Статистические тесты
- chi2
FS: Статистические тесты
- ANOVA
FS: Статистические тесты
- Mutial information
FS: Статистические тесты
Статистические тесты: практика
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
FS: Корреляция с целевой переменной
FS: скоррелированные признаки
FS: мультиколлинеарность
from statsmodels.stats.outliers_influence import variance_inflation_factor
def check_vif(data, test, target_name, cols_to_drop, verbose=0):
if verbose == 1:
print('Checking VIF values')
X_train_multicoll = data.drop([target_name], axis=1).copy()
X_train_multicoll['intercept'] = 1
max_vif_value = float('inf')
if verbose == 1:
print(X_train_multicoll.shape)
while max_vif_value > 100:
vif = [variance_inflation_factor(X_train_multicoll.values, i) for i in
range(X_train_multicoll.shape[1])]
g = [i for i in list(zip(X_train_multicoll.columns, vif))]
g = [i for i in g if i[0] != 'intercept']
max_vif = max(g, key=itemgetter(1))
if verbose == 1:
print(max_vif)
if max_vif[1] < 100:
if verbose == 1:
print('Done')
break
else:
X_train_multicoll.drop([max_vif[0]], axis=1, inplace=True)
cols_to_drop.append(max_vif[0])
data.drop([max_vif[0]], axis=1, inplace=True)
test.drop([max_vif[0]], axis=1, inplace=True)
if verbose == 1:
print(X_train_multicoll.shape)
max_vif_value = max_vif[1]
return data, test, cols_to_drop
FS: IV, WOE
FS: Перебор признаков
Text
FS: Feature importance
- Линейные модели: коэффициенты
- Tree-based models: split, gain, coverage
FS: Feature importance
FS: регуляризация
FS: "другие" инструменты
- Permutation importance
- Recursive feature elimination
- ELI5/SHAP
- Boruta/Boostaroota
- Adversarial validation
FS: Permutation importance
FS: Permutation importance
def permutation_importances(rf, X_train, y_train, metric):
baseline = metric(rf, X_train, y_train)
imp = []
for col in X_train.columns:
save = X_train[col].copy()
X_train[col] = np.random.permutation(X_train[col])
m = metric(rf, X_train, y_train)
X_train[col] = save
imp.append(baseline - m)
return np.array(imp)
perm = PermutationImportance(model, random_state=1).fit(X_train, y_train)
eli5.show_weights(perm, top=50)
FS: RFE
- На каждом шаге тренируем заданную модель и отбрасываем наименее важные признаки
FS: ELI5/SHAP
https://www.kaggle.com/artgor/eda-feature-engineering-and-model-interpretation
FS: ELI5/SHAP
https://www.kaggle.com/artgor/eda-feature-engineering-and-model-interpretation
FS: Boruta
FS: Boruta
from boruta import BorutaPy
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
# find all relevant features - 5 features should be selected
feat_selector.fit(X, y)
# check selected features - first 5 features are selected
feat_selector.support_
# check ranking of features
feat_selector.ranking_
# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X)
FS: BoostARoota
from boostaroota import BoostARoota
br = BoostARoota(metric='logloss')
#Fit the model for the subset of variables
br.fit(x, y)
#Can look at the important variables - will return a pandas series
br.keep_vars_
#Then modify dataframe to only include the important variables
x1 = br.transform(x)
FS: Adversarial validation
FS: Adversarial validation
features = X_train.columns
X_train['target'] = 0
X_valid['target'] = 1
train_test = pd.concat([X_train, X_valid], axis =0)
target = train_test['target']
# train model
FS: bonus ideas
Guided Regularized Random Forests
Генетические алгоритмы
Feature selection
By Andrey Lukyanenko
Feature selection
- 394