To transform or not to transform?

Rebecca Barter
Some context...
House price
$
$
House size
$
$
$


Some context...
House price
$
$
House size
$
$
$


Some context...
House price
$
$
log(House size)
$
$
$
log( )
log
( )


When do people usually transform their variables?
Poor model accuracy
Variables are not normal

Weird looking residual plot


Outliers are present
Relationship doesn't look linear

indicates your model assumptions are wrong
(if you care about inference)
If you can predict better with transformed variables, great!
(if you care about pred. accuracy)
Common misconception:
Variables in a regression don't actually need to be normal

...but the "errors" do!*
*Actually... we don't even need the errors to be normal.
They just need to be independent, have zero mean and common variance
Common transformations:
the logarithm
Interpreting log transforms:
increasing
Drawbacks:
can't be used for values
What it does:
compresses large vales and spreads small values
by
increases
by
multiplying
by
multiplies
by
multiplying
by
increases
by
increasing
by
multiplies
by
Common transformations:
the square-root
Interpreting log transforms:
increasing
Drawbacks:
can't be used for values
What it does:
compresses large vales and spreads small values
by
increases
by
increasing
by
increases
by
increasing
by
increases
by
increasing
by
increases
by
Common course of action
Run regression on the raw data
Compute R2
Transform variables because you think you can do better
Re-run regression with transformed variables
Re-compute R2
Repeat
Consequences of transformation
Incorrectly interpreting model coefficients
Multiple testing problems


To transform or not to transform?

Goal: predict life expectancy based on GDP per capita



Goal: estimate effect of GDP on life expectancy
The bottom line is...

To transform or not to transform?
By Rebecca Barter
To transform or not to transform?
- 81