k-means, LASSO, and Random Forest
Applications for Machine Learning Working Group from
The Mortality and Medical Costs of Air Pollution: Evidence from Changes in Wind Direction
Baseline survival rate
Individual characteristics
Baseline survival rate
Individual characteristics
>1: increased risk
<1: decreased risk
Baseline survival rate
Individual characteristics
Estimate with log partial likelihood
Estimate with log partial likelihood
Observe death =1
The risk set: the entire set of subjects at risk at time i
Characteristics of those individuals
Maximize this, or rather, minimize the negative of this
Intuitively, choose β such that it weights riskier characteristics more
Sources: https://en.wikipedia.org/wiki/Proportional_hazards_model and http://www.sthda.com/english/wiki/cox-proportional-hazards-model
Measurement Error!
Measurement Error!
Focus on systematic pollution by grouping together monitors in multiple counties
If we want to be more sophisticated than just picking the monitor groups ourselves...
Source: https://github.com/jgscott/ECO395M
Choose initial
centroid
K-means++
Choose initial
centroid
Compute distance to x's
K-means++
Choose initial
centroid
Compute distance to x's
Probabilistically weight furthest as next centroid
K-means++
Choose initial
centroid
Compute distance to x's
Probabilistically weight furthest as next centroid
Stop when you reach K centroids
K-means++
Source: https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
Toy Example
Reduce Dimensions
Reduce Dimensions
Machine Learning can train on data to select important variables
Solution 1:
where dev is closeness of fit and pen is a penalty function on any beta that is non-zero.
How is this Machine Learning?
How is this Machine Learning?
How is this Machine Learning?
Source: https://github.com/jgscott/ECO395M
Toy Example
Solution 2:
Source: https://github.com/jgscott/ECO395M
Source: https://github.com/jgscott/ECO395M
Source: https://github.com/jgscott/ECO395M
Source: https://github.com/jgscott/ECO395M
Source: https://github.com/jgscott/ECO395M
Source: https://github.com/jgscott/ECO395M
Source: https://github.com/jgscott/ECO395M
Toy Example
"The richness of our controls suggest this final estimate is about as good a representation of the true value as can be obtained empirically."
"This drop indicates that mortality effects of PM 2.5 tend to be larger among individuals with characteristics that Cox-Lasso associates with lower life expectancy, even after conditioning on age, sex, and chronic conditions."