Squeezing the most info out of your data

... or how to use Feature Selection and Feature Engineering

Between EDA and data cleaning (depression) and algorithm training

(fancy stuff) there's an uncharted territory of feature selection and engineering

First, a bit of reason

Why feature engineering?

"Because"

Igor Dodon

Categorical variables encoding

Encoding Schemes

  • One Hot encoding
  • Frequency encoding
  • Mean Target encoding
  • Weigh of Evidence encoding

Continuous variables discretisation

Transformations

Transformation Schemes

  • Multiplication
  • Logarithms
  • Power
  • Trigonometric

Combinations

First, a bit of reason

Part 2

Why feature selection?

"Through feature engineering have gone you once, lots of data have you."

Yoda

Correlations

Mutual information

Model importance based

A note on Boruta

A note on dimensionality reduction

In case Facebook was more interesting - TL;DR:

  • Not all features are necessary
  • Some might even be detrimental
  • If you have domain knowledge - use it!
  • A pretty simple strategy - make as many feature as possible, then keep only the best
  • Categorical variables must be treated specially

Thank you

Questions?

Squeezing the most info out of your data

By Alexandru Burlacu

Squeezing the most info out of your data

  • 196