Squeezing the most info out of your data
... or how to use Feature Selection and Feature Engineering
Between
EDA and data cleaning (
depression
)
and
algorithm training
(
fancy stuff
)
there's
an uncharted territory of
feature selection and engineering
First, a bit of reason
Why feature engineering?
"Because"
Igor Dodon
Categorical variables encoding
Encoding Schemes
One Hot encoding
Frequency encoding
Mean Target encoding
Weigh of Evidence encoding
Continuous variables discretisation
Transformations
Transformation Schemes
Multiplication
Logarithms
Power
Trigonometric
Combinations
First, a bit of reason
Part 2
Why feature selection?
"Through feature engineering have gone you once, lots of data have you."
Yoda
Correlations
Mutual information
Model importance based
A note on Boruta
A note on dimensionality reduction
In case Facebook was more interesting - TL;DR:
Not all features are necessary
Some might
even
be detrimental
If you have domain knowledge - use it!
A pretty simple strategy - make as many feature as possible, then keep only the best
Categorical variables must be treated specially
Thank you
Questions?
Made with Slides.com