Bayesian Methods for Hackers
Open source textbook on an intro to Bayesian Methods
in Python!
lifelines
Survival Analysis in Python
lifelines is a Python library, developed here at Shopify and later open-sourced, to measure durations.
Durations-tf?
- Historically, survival analysis was developed and used by actuaries and medical researchers to measure the lifetimes of populations.
- "What is the expected lifetime of patients given drug A? Drug B?"
- "What is the life-expectancy of a baby born today in Canada?"
These researchers wanted to measure the duration between Birth and Death.
Censorships
- It's not that simple. We don't always see death event occur -- the current time, or other events, censor us from seeing the death event.
- "Not all patients have died, how can I use that data?"
- "How can I measure population lifetimes when most of my population hasn't died yet?"
Modern Survival Analysis
-
Birth: Customer joins Shopify Death: Customer leaves Shopify Censorship: current time censors seeing all cancelations.
-
Birth: Leader forms government Death: government dissolves Censorship: current time disallows seeing all dissolvements.
-
Birth: Couple starts dating Death: couple breaks-up Censorship: some couples never break-up (partner's death comes first)
-
Birth: senator enter's office Death: senator retires Censorship: senators can die before retiring.
The main application is constructing the Survival Curve
Example:
Survival Curve
T is the lifetime of a member of the population (Random)
t denotes time
S(t) is the survival curve at time t
S(t) == P( T > t)
(eg: what is the probability that the individual lives longer than t?)
Survival Curve
Completely defines the population's distribution of lifetimes.
If I know the survival curve, I know everything.
Survival Curve
But I don't know the survival curve =P
Hazard Curve
S(t) = e ^ { -H(t) }
(sorry for the ugly equation)
Hazard Curve
Recall how I said:
If I know the survival curve, I know everything
Equivalently,
If I know the hazard curve, I know everything
What if I had more data?
- Someone is wanted for murder - does this play a larger role in them being caught?
-
Maybe someone is young, or old? How does this affect their chances of being caught?
Survival Regression
S(t| x) = e ^ { -H(t | x) }
Two models
Aalen's Model
x = (x1, x2, ... )
H( t | x ) = b1(t)*x1 + b2(t)*x2 + ...
Two models
Cox's Model
x = (x1, x2, ... )
H( t | x ) = b(t)*e ^ { b1*x1 + b2*x2 ... }
What else does lifelines have?
- Statistical test (p-values and that stuff)
- Cross-validation for model selection
- Utils for transforming lifetables into durations
- Artificial data generating library (for testing and debugging methods)