Jim Savage
qplum Data-Science / FinTech Talks, Oct 21 2016
You can learn from hierarchy
We get to build models that put weight on plausible values, and none on impossible values.
Technical
A joint distribution of all the unknowns in the model. Unknowns = outcomes, parameters, latent variables.
Street
A model in which we deal coherently with both uncertainty in the dependent variables and all the parameters of the model.
“Bayesian inference is hard in the sense that thinking is hard” - Don Berry
But now we just use Stan.
Let's say we want a model of human heights using only sex to predict. I want to build a model that can help answer several questions:
\( \mbox{female height} \sim \mbox{normal}(\mbox{average}_{f}, \sigma) \) and
\( \mbox{male height} \sim \mbox{normal}(\mbox{average}_{m}, \sigma) \) (same standard deviation)
library(dplyr)
# Random draws of height for 10k people from normal distribution
# 51% female
heights_women <- rnorm(n = 5100, mean = 165, sd = 8)
# 49% male
heights_men <- rnorm(n = 4900, mean = 175, sd = 8)
# Organize the data into a data.frame
heights_people <- data_frame(Sex = c(rep("f", 5100), rep("m", 4900)),
height = c(heights_women, heights_men))
# Female height on average
heights_people %>%
filter(Sex=="f") %>%
summarize(mean(height))
# A tibble: 1 × 1
`mean(height)`
<dbl>
1 165.0282
# Human height on average
heights_people %>%
summarize(mean(height))
# A tibble: 1 × 1
`mean(height)`
<dbl>
1 169.9139
# Standard deviation of height of men
heights_people %>%
filter(Sex=="m") %>%
summarize(sd(height))
# A tibble: 1 × 1
`sd(height)`
<dbl>
1 7.986179
# Height of males at the 90th percentile
heights_people %>%
filter(Sex=="m") %>%
summarize(quantile(height, 0.9))
# A tibble: 1 × 1
`quantile(height, 0.9)`
<dbl>
1 185.1751
Do you believe this model?
We want to spend some time working through these problems.
A variable is a random variable if its value is not controlled/deterministic. It is subject to chance.
We think about the following as random variables:
A probability density function tells us the relative probability of observing certain values of a random variable.
Our model has several random variables:
A more complete generative model is the joint density of these
\[ p(\mbox{Heights}, \mbox{average}_{m}, \mbox{average}_{f},\sigma,\mbox{distribution of sexes}) \]
This factors into conditional densities \[ p(\mbox{Heights}, \mbox{average}_{m}, \mbox{average}_{f},\sigma, \mbox{distribution of sexes}) \]
\[ = p(\mbox{Heights}|\, \mbox{average}_{m}, \mbox{average}_{f},\sigma, \mbox{distribution of sexes})\times \] \[ p(\mbox{average}_{m}|\, \mbox{average}_{f}, \sigma,\mbox{distribution of sexes})\times \] \[ p(\mbox{average}_{f}|\, \sigma,\mbox{distribution of sexes})\times \] \[ p(\sigma|\,\mbox{distribution of sexes})\times \] \[ p(\mbox{distribution of sexes}) \]
And if these are independent \[ = p(\mbox{Hghts}|\mbox{params})\times p(\mbox{av}_{m})\times p(\mbox{av}_{f})\times p(\sigma)\times p(\mbox{sex distr.}) \]
The process is simple:
Then
Let's call our collection of parameters \( \theta \) and our observed heights \( y \). Bayes rule tells us that
\[ p(\theta|\, y) = \frac{p(y|\, \theta)\, p(\theta)}{p(y)} \]
Since \( p(y) \) does not depend on \( \theta \), we can re-write this in proportional form
\[ p(\theta|y)\propto p(y|\, \theta)\, p(\theta) \]
\[ p(\theta|\, y)\propto p(y|\, \theta)\, p(\theta) \]
How do we know which is which? Can we form an understanding of this?
Let's say there are two possible models that describe daily returns:
Model 1: a random walk model \[ r_{t} \sim \mbox{normal}(\alpha_{1}, \sigma_{1}) \]
Model 2: simple AR(1) momentum \[ r_{t} \sim \mbox{normal}(\alpha_{2} + \rho_{1} r_{t-1}, \sigma_{2}) \]
Which model? \[ p(\mbox{Model 1} | \mu) = \mbox{Logit}^{-1}(\mu_{t}) \]
and
\[ \mu_{t} \sim \mbox{normal}(\alpha_{3} + \rho_{2} \mu_{t-1} + f(\mathcal{I}_{t}), \sigma_{3}) \]
Where \( f(\mathcal{I}_{t}) \) is a function of information available at the beginning of day \( t \). For this exercise, we assume
\[ f(\mathcal{I}_{t}) = \beta r_{t-1} \]
(See R session)
library(rstan); library(ggplot2); library(dplyr); library(reshape2)
options(mc.cores = parallel::detectCores())
# Set some fake parameters
alpha1 <- -0.01; alpha2 <- 0.015
rho1 <- 0.95; rho2 <- 0.8
beta <- 0.5
sigma1 <- 0.05; sigma2 <- 0.03; sigma3 <- 0.3
T <- 500
r <- rep(NA, T)
r[1] <- 0
mu <- rep(NA, T)
z <- rep(NA, T)
mu[1] <- 0
z[1] <- 1
# Simulate the data series
for(t in 2:T) {
mu[t] <- rho1 * mu[t-1] + beta*(r[t-1]) + rnorm(1, 0, sigma3)
prob <- arm::invlogit(mu[t])
if(z[t]==1) {
# random walk state
r[t] <- rnorm(1, alpha1, sigma1)
} else {
# momentum state
r[t] <- rnorm(1, alpha2 + rho2*r[t-1], sigma2)
}
}
plot.ts(r)
plot.ts(arm::invlogit(mu))
(See R session)
data
: declare the data you will give the modeltransformed data
: make any changes to the dataparameters
: declare the unknownstransformed parameters
: sometimes the parameters we want are a function of pars. that are easy to estimatemodel
: declare the priors and likelihoodSyntax is somewhere between R and C++:
Before writing the model: note that if we see a datapoint \( r_{t} \), it has the likelihood (for parameter vector \( \theta \)):
\[ p(r_{t}| \theta) = \mbox{Logit}^{-1}(\mu_{t})\mbox{normal}(r_{t} |\, \alpha_{1}, \sigma_{1}) +\\ (1-\mbox{Logit}^{-1}(\mu_{t}))\mbox{normal}(r_{t} |\, \alpha_{2} + \rho r_{t-1}, \sigma_{1}) \]
Because we work in logs, we want \( \log(p(r_{t}| \theta)) \). For this we use the log_sum_exp
functon, defined as: log_sum_exp(a, b) = log(exp(a) + exp(b))
.
// saved as switching_model.stan
data {
int T;
vector[T] r;
}
parameters {
vector[T] mu;
vector[2] rho;
real beta;
vector<lower = 0>[3] sigma;
vector[3] alpha;
}
model {
// priors
mu[1:2] ~ normal(0, .1);
sigma[1:2] ~ cauchy(0, 0.05);
sigma[3] ~ cauchy(0, 0.5);
rho ~ normal(1, .1);
beta~ normal(.5, .25);
alpha[1:2] ~ normal(0, 0.1);
alpha[2] ~ normal(0, 1);
// likelihood
for(t in 3:T) {
mu[t] ~ normal(alpha[3] + rho[1]*mu[t-1] + beta* r[t-1], sigma[3]);
target += log_mix(inv_logit(mu[t]),
normal_lpdf(r[t] | alpha[1], sigma[1]),
normal_lpdf(r[t] | alpha[2] + rho[2] * r[t-1], sigma[2]));
}
}
library(rstan)
options(mc.cores = parallel::detectCores())
compiled_model <- stan_model("switching_model.stan")
posterior_draws <- sampling(compiled_model,
data = list(r = r,
T = T))
print(fit, pars = c("alpha", "beta", "sigma"))
install.packages("shinystan")
shinystan::launch_shinystan(posterior_draws)
outcome_data <- test_model %>%
as.data.frame() %>%
select(contains("mu")) %>%
melt() %>%
group_by(variable) %>%
summarise(lower = quantile(value, 0.1),
median = median(value),
upper = quantile(value, 0.9)) %>%
mutate(date = 1:n(),
true_mu = mu)
outcome_data %>%
ggplot(aes(x = date)) +
geom_ribbon(aes(ymin = arm::invlogit(lower),
ymax = arm::invlogit(upper)),
fill= "orange", alpha = 0.4) +
geom_line(aes(y = arm::invlogit(median))) +
geom_line(aes(y = arm::invlogit(true_mu)), colour = "red")
(See R session)
# Now with real data!
aapl <- Quandl::Quandl("YAHOO/AAPL")
aapl <- aapl %>%
mutate(Date = as.Date(Date)) %>%
arrange(Date) %>%
mutate(l_ac = log(`Adjusted Close`),
dl_ac = c(NA, diff(l_ac))) %>%
filter(Date > "2015-01-01")
aapl_mod <- sampling(compiled_model, data= list(T = nrow(aapl), r = aapl$dl_ac*100))
print(aapl_mod, pars = c("alpha", "sigma", "rho", "beta"))
plot1 <- aapl_mod %>%
as.data.frame() %>%
select(contains("mu")) %>%
melt() %>%
group_by(variable) %>%
summarise(lower = quantile(value, 0.95),
median = median(value),
upper = quantile(value, 0.05)) %>%
mutate(date = aapl$Date,
ac = aapl$l_ac) %>%
ggplot(aes(x = date)) +
geom_ribbon(aes(ymin = arm::invlogit(lower), ymax = arm::invlogit(upper)), fill= "orange", alpha = 0.4) +
geom_line(aes(y = arm::invlogit(median))) +
lendable::theme_lendable() +
xlab("Date") +
ylab("Probability of random walk model")
plot2 <- aapl_mod %>%
as.data.frame() %>%
select(contains("mu")) %>%
melt() %>%
group_by(variable) %>%
summarise(lower = quantile(value, 0.95),
median = median(value),
upper = quantile(value, 0.05)) %>%
mutate(date = aapl$Date,
ac = aapl$`Adjusted Close`) %>%
ggplot(aes(x = date, y = ac)) +
geom_line() +
lendable::theme_lendable() +
xlab("Date") +
ylab("Adjusted Close")
gridExtra::grid.arrange(plot1, plot2)