By Calvano et al 2020
Firms:
(Two) firms:
In each time period \(t = 0, 1, 2, ... \), an agent
In Q-Learning,
\(s_0\)
\(s_1\)
\(s_2\)
\(a_0\)
\(a_1\)
\(a_2\)
At \(t=0\)
...
Decision maker wants to maximize expected present value of the reward stream
\[\mathbb{E} \left[\sum_{t=0}^\infty \delta^t \pi_t\right]\]
This is usually solved by using the Bellman equation:
\[V(s) = \max_{a \in A} \left\{ \mathbb{E}(\pi|s, a) + \delta \mathbb{E}[V(s') | s, a] \right\}\]
\(\delta \in (0, 1)\) is the discount factor
Or we can use the Q-function
\[Q(s,a) = \mathbb{E}[\pi | s,a] + \delta \cdot \mathbb{E}[\max_{a' \in A} Q(s', a')| s, a]\]
period payoff
continuation value
All actions must be tried at all stages
\(\implies\) explore! 🗺️
Because