Artificial Intelligence, Algorithmic Pricing, and Collusion

By Calvano et al 2020

Highlights and Overview

Experimental approach
Models
- For pricing algorithms: Q-Learning
- For economic environment: repeated price competition in Bertrand oligopoly setting
Issues:
- Time-scale: Q-learning is not fast
- State space and action space are discrete
  - By design, algorithms play pure strategies

What is collusion?

Outcome vs Procedure
Is the existence of supracompetitive (i.e. high) prices indicative of collusion?
Or: do we want to focus on collusive strategies instead?
- i.e. can we see a "reward-punishment scheme designed to provide the incentives for firms to consistently price above the competitive level?"

Canonical Model

Firms:

Play an infinitely repeated game
Price simultaneously in each stage
Condition their prices on past history

Maskin and Tirole (1988)

(Two) firms:

Alternate in moving
Commit to a price level for two periods
Condition pricing only on rival's current price

Collusion

Staggered pricing

What is Q-Learning?

Model-free reinforcement learning algorithm
Proposed to tackle Markov Decision Processes
Pros:
- Widely popular among computer scientists
- Simple and can be characterized in few parameters
Cons:
- Learning process is slow
- Discrete state and action space \(\to\) curse of dimensionality

Markov Decision Process

In each time period \(t = 0, 1, 2, ... \), an agent

observes a state variable \(s_t \in S\)
chooses an action \(a_t \in A(s_t)\)
obtains a reward \(\pi_t\)
moves on to the next stage according to distribution \(F(\pi_t, s_{t+1}|s_t, a_t)\)

In Q-Learning,

\(S\) and \(A\) are finite
\(A\) is not dependent on \(S\)

Q-Learning example

\(s_0\)

\(s_1\)

\(s_2\)

\(a_0\)

\(a_1\)

\(a_2\)

At \(t=0\)

...

Pros and Cons of using Q-Learning

How does it work?
Is it the most realistic model that we can think of in the market?
Why is it slow?

DM's problem

Decision maker wants to maximize expected present value of the reward stream

\[\mathbb{E} \left[\sum_{t=0}^\infty \delta^t \pi_t\right]\]

This is usually solved by using the Bellman equation:

\[V(s) = \max_{a \in A} \left\{ \mathbb{E}(\pi|s, a) + \delta \mathbb{E}[V(s') | s, a] \right\}\]

\(s'\) is shorthand for \(s_{t+1}\)

\(\delta \in (0, 1)\) is the discount factor

DM's problem cont'd

Or we can use the Q-function

\[Q(s,a) = \mathbb{E}[\pi | s,a] + \delta \cdot \mathbb{E}[\max_{a' \in A} Q(s', a')| s, a]\]

Think of this Q-value as the quality of action \(a\) in state \(s\)
In fact, \(V(s) =\max_{a\in A} \{Q(s,a)\}\)
Since \(S\) and \(A\) are finite, \(Q\)-function can be represented as a \(|S|\times |A|\) matrix

period payoff

continuation value

Q-Learning

Q-learning: approximate finite \(|S| \times |A| \) Q-matrix without knowing the underlying distribution \(F(\pi_t, s_{t+1}|s_t, a_t)\)
- \(\to\) hence "model-free"
How does Q-Learning estimate this Q-matrix?
1. Start with arbitrary initial matrix \(\bf Q_0\)
2. Learning equation: \[Q_{t+1}(s,a) = \textcolor{blue}{(1-\alpha)} \cdot Q_t(s,a) + \textcolor{blue}{\alpha}\cdot[\pi_t + \delta \cdot \max_{a \in A}Q_t(s',a))]\]
  - A convex combination; \(\alpha \in [0, 1]\) is the learning rate

Q-Learning cont'd

All actions must be tried at all stages

\(\implies\) explore! 🗺️

ALG: \(\epsilon\)-greedy model of exploration
- Choose the currently optimal action (i.e. the one with the highest Q-value) with probability \(1 - \epsilon\)
  - "Exploitation mode"
- Randomize across all other actions uniformly with probability \(\epsilon\)
  - "Exploration mode"

Q-Learning is slow

Because

it only updates one cell in the Q-matrix at a time
Approximating the true matrix requires that each cell be visited multiple times
More iterations are thus needed for large space and action space

Q-Learning is not stationary

State space may increase over time
- \(\to\) Solution: bound player's memory to the past \(k\) actions
But what if the opponent also changes her strategy over time?
- \(\to\) there are no general convergence results for strategic games played by Q-learning algorithms
  - No ex-ante (theoretical) guarantee
  - However, verifiable ex-post

Economic environment

Infinitely repeated pricing game
Firms act simultaneously and condition their actions on history
Firms also have bounded memory (since we need to make state space finite)
There are \(n\) differentiated products (each supported by a firm, so \(n\) firms in total) and an outside good

At each stage

Firms price compete with logit demand and constant marginal costs \(c_i\)
In each period \(t\), the demand for product \(i = 1, 2, ..., n\) is \[q_{i,t} = \frac{\exp{(\frac{a_i - p_{i, t}}{\mu})}}{\sum_{j=1}^n \exp{(\frac{a_j - p_{j, t}}{\mu})} + \exp{(\frac{a_0}{\mu})}}\]
- \(a_i\): product quality indexes; captures vertical differentiation
- \(\mu\): index of horizontal differentiation
\(\to\) per-period reward for firm \(i\): \[\pi_{i,t} = (p_{i, t} - c_i)\cdot q_{i,t}\]

Action space

Compute Bertrand Nash Equilibrium prices \(\mathbf{p}^N\)
Compute monopoly prices \(\mathbf{p}^M\)
Action space \(A\) is given by \(m\) equally spaced points in the interval \[[\mathbf{p}^N - \zeta(\mathbf{p}^M- \mathbf{p}^N), \mathbf{p}^M+\zeta(\mathbf{p}^M- \mathbf{p}^N)]\]

State space

State space \(S\) is the set of all past prices from the last \(k\) periods \[s_t = \{\mathbf{p}_{t-1}, ..., \mathbf{p}_{t-k}\}\]

Exploration in Q-Learning

Uses \(\epsilon\)-greedy model with a time-declining exploration rate \[\epsilon_t = e^{-\beta t}, \beta >0\]
i.e. Algorithm chooses to randomly explore initially; but as time passes, they deviate towards exploitation more and more

Verifying Convergence

Convergence is achieved if for each player, the optimal strategy does not change for 100,000 consecutive periods
Observation
- Nearly all experiment sessions converged
- Convergence took a great many of repetitions
- For those sessions that didn't converge, most of them had price cycles

Conclusion

Algorithms consistently learn to charge supracompetitive prices
Collusion is typically partial and is enforced by punishment in case of deviation
- Punishment is of finite duration

More research

Different economic environments:
- Persistent, firm-specific demand?
- Cost shocks?
What if different reinforcement learning algorithms interact? Will they still be able to cooperate?
Speed of learning