Artificial Intelligence, Algorithmic Pricing, and Collusion
By Calvano et al 2020
Highlights and Overview
- Experimental approach
- Models
- For pricing algorithms: Q-Learning
- For economic environment: repeated price competition in Bertrand oligopoly setting
- Issues:
- Time-scale: Q-learning is not fast
- State space and action space are discrete
- By design, algorithms play pure strategies
What is collusion?
- Outcome vs Procedure
- Is the existence of supracompetitive (i.e. high) prices indicative of collusion?
- Or: do we want to focus on collusive strategies instead?
- i.e. can we see a "reward-punishment scheme designed to provide the incentives for firms to consistently price above the competitive level?"
Canonical Model
Firms:
- Play an infinitely repeated game
- Price simultaneously in each stage
- Condition their prices on past history
Maskin and Tirole (1988)
(Two) firms:
- Alternate in moving
- Commit to a price level for two periods
- Condition pricing only on rival's current price
Collusion
Staggered pricing
What is Q-Learning?
- Model-free reinforcement learning algorithm
- Proposed to tackle Markov Decision Processes
- Pros:
- Widely popular among computer scientists
- Simple and can be characterized in few parameters
- Cons:
- Learning process is slow
- Discrete state and action space \(\to\) curse of dimensionality
Markov Decision Process
In each time period \(t = 0, 1, 2, ... \), an agent
- observes a state variable \(s_t \in S\)
- chooses an action \(a_t \in A(s_t)\)
- obtains a reward \(\pi_t\)
- moves on to the next stage according to distribution \(F(\pi_t, s_{t+1}|s_t, a_t)\)
In Q-Learning,
- \(S\) and \(A\) are finite
- \(A\) is not dependent on \(S\)
Q-Learning example
\(s_0\)
\(s_1\)
\(s_2\)
\(a_0\)
\(a_1\)
\(a_2\)
At \(t=0\)
...
Pros and Cons of using Q-Learning
- How does it work?
- Is it the most realistic model that we can think of in the market?
- Why is it slow?
DM's problem
Decision maker wants to maximize expected present value of the reward stream
\[\mathbb{E} \left[\sum_{t=0}^\infty \delta^t \pi_t\right]\]
This is usually solved by using the Bellman equation:
\[V(s) = \max_{a \in A} \left\{ \mathbb{E}(\pi|s, a) + \delta \mathbb{E}[V(s') | s, a] \right\}\]
- \(s'\) is shorthand for \(s_{t+1}\)
\(\delta \in (0, 1)\) is the discount factor
DM's problem cont'd
Or we can use the Q-function
\[Q(s,a) = \mathbb{E}[\pi | s,a] + \delta \cdot \mathbb{E}[\max_{a' \in A} Q(s', a')| s, a]\]
- Think of this Q-value as the quality of action \(a\) in state \(s\)
- In fact, \(V(s) =\max_{a\in A} \{Q(s,a)\}\)
- Since \(S\) and \(A\) are finite, \(Q\)-function can be represented as a \(|S|\times |A|\) matrix
period payoff
continuation value
Q-Learning
- Q-learning: approximate finite \(|S| \times |A| \) Q-matrix without knowing the underlying distribution \(F(\pi_t, s_{t+1}|s_t, a_t)\)
- \(\to\) hence "model-free"
- How does Q-Learning estimate this Q-matrix?
- Start with arbitrary initial matrix \(\bf Q_0\)
- Learning equation: \[Q_{t+1}(s,a) = \textcolor{blue}{(1-\alpha)} \cdot Q_t(s,a) + \textcolor{blue}{\alpha}\cdot[\pi_t + \delta \cdot \max_{a \in A}Q_t(s',a))]\]
- A convex combination; \(\alpha \in [0, 1]\) is the learning rate
Q-Learning cont'd
All actions must be tried at all stages
\(\implies\) explore! 🗺️
- ALG: \(\epsilon\)-greedy model of exploration
- Choose the currently optimal action (i.e. the one with the highest Q-value) with probability \(1 - \epsilon\)
- "Exploitation mode"
- Randomize across all other actions uniformly with probability \(\epsilon\)
- "Exploration mode"
- Choose the currently optimal action (i.e. the one with the highest Q-value) with probability \(1 - \epsilon\)
Q-Learning is slow
Because
- it only updates one cell in the Q-matrix at a time
- Approximating the true matrix requires that each cell be visited multiple times
- More iterations are thus needed for large space and action space
Q-Learning is not stationary
- State space may increase over time
- \(\to\) Solution: bound player's memory to the past \(k\) actions
- But what if the opponent also changes her strategy over time?
- \(\to\) there are no general convergence results for strategic games played by Q-learning algorithms
- No ex-ante (theoretical) guarantee
- However, verifiable ex-post
- \(\to\) there are no general convergence results for strategic games played by Q-learning algorithms
Economic environment
- Infinitely repeated pricing game
- Firms act simultaneously and condition their actions on history
- Firms also have bounded memory (since we need to make state space finite)
- There are \(n\) differentiated products (each supported by a firm, so \(n\) firms in total) and an outside good
At each stage
- Firms price compete with logit demand and constant marginal costs \(c_i\)
- In each period \(t\), the demand for product \(i = 1, 2, ..., n\) is \[q_{i,t} = \frac{\exp{(\frac{a_i - p_{i, t}}{\mu})}}{\sum_{j=1}^n \exp{(\frac{a_j - p_{j, t}}{\mu})} + \exp{(\frac{a_0}{\mu})}}\]
- \(a_i\): product quality indexes; captures vertical differentiation
- \(\mu\): index of horizontal differentiation
- \(\to\) per-period reward for firm \(i\): \[\pi_{i,t} = (p_{i, t} - c_i)\cdot q_{i,t}\]
Action space
- Compute Bertrand Nash Equilibrium prices \(\mathbf{p}^N\)
- Compute monopoly prices \(\mathbf{p}^M\)
- Action space \(A\) is given by \(m\) equally spaced points in the interval \[[\mathbf{p}^N - \zeta(\mathbf{p}^M- \mathbf{p}^N), \mathbf{p}^M+\zeta(\mathbf{p}^M- \mathbf{p}^N)]\]
State space
- State space \(S\) is the set of all past prices from the last \(k\) periods \[s_t = \{\mathbf{p}_{t-1}, ..., \mathbf{p}_{t-k}\}\]
Exploration in Q-Learning
- Uses \(\epsilon\)-greedy model with a time-declining exploration rate \[\epsilon_t = e^{-\beta t}, \beta >0\]
- i.e. Algorithm chooses to randomly explore initially; but as time passes, they deviate towards exploitation more and more
Verifying Convergence
- Convergence is achieved if for each player, the optimal strategy does not change for 100,000 consecutive periods
- Observation
- Nearly all experiment sessions converged
- Convergence took a great many of repetitions
- For those sessions that didn't converge, most of them had price cycles
Conclusion
- Algorithms consistently learn to charge supracompetitive prices
- Collusion is typically partial and is enforced by punishment in case of deviation
- Punishment is of finite duration
More research
- Different economic environments:
- Persistent, firm-specific demand?
- Cost shocks?
- What if different reinforcement learning algorithms interact? Will they still be able to cooperate?
- Speed of learning
deck
By Sheng Long
deck
- 24