House price
$
$
House size
$
$
$
House price
$
$
House size
$
$
$
House price
$
$
log(House size)
$
$
$
log( )
log
( )
Poor model accuracy
Variables are not normal
Weird looking residual plot
Outliers are present
Relationship doesn't look linear
indicates your model assumptions are wrong
(if you care about inference)
If you can predict better with transformed variables, great!
(if you care about pred. accuracy)
Variables in a regression don't actually need to be normal
...but the "errors" do!*
*Actually... we don't even need the errors to be normal.
They just need to be independent, have zero mean and common variance
Interpreting log transforms:
increasing
Drawbacks:
can't be used for values
What it does:
compresses large vales and spreads small values
by
increases
by
multiplying
by
multiplies
by
multiplying
by
increases
by
increasing
by
multiplies
by
Interpreting log transforms:
increasing
Drawbacks:
can't be used for values
What it does:
compresses large vales and spreads small values
by
increases
by
increasing
by
increases
by
increasing
by
increases
by
increasing
by
increases
by
Run regression on the raw data
Compute R2
Transform variables because you think you can do better
Re-run regression with transformed variables
Re-compute R2
Repeat
Incorrectly interpreting model coefficients
Multiple testing problems
Goal: predict life expectancy based on GDP per capita
Goal: estimate effect of GDP on life expectancy