Artificial Intelligence, Algorithmic Pricing, and Collusion

By Calvano et al 2020

Highlights and Overview 

  • Experimental approach 
  • Models 
    • For pricing algorithms: Q-Learning 
    • For economic environment: repeated price competition in Bertrand oligopoly setting 
  • Issues:  
    • Time-scale: Q-learning is not fast 
    • State space and action space are discrete 
      • By design, algorithms play pure strategies 

What is collusion

  • Outcome vs Procedure 
  • Is the existence of supracompetitive (i.e. high) prices indicative of collusion? 
  • Or: do we want to focus on collusive strategies instead?
    • i.e. can we see a "reward-punishment scheme designed to provide the incentives for firms to consistently price above the competitive level?"  

Canonical Model

Firms: 

  • Play an infinitely repeated game 
  • Price simultaneously in each stage 
  • Condition their prices on past history 

Maskin and Tirole (1988)

(Two) firms: 

  • Alternate in moving 
  • Commit to a price level for two periods 
  • Condition pricing only on rival's current price 

Collusion 

Staggered pricing

What is Q-Learning

  • Model-free reinforcement learning algorithm 
  • Proposed to tackle Markov Decision Processes 
  • Pros: 
    • Widely popular among computer scientists 
    • Simple and can be characterized in few parameters 
  • Cons: 
    • Learning process is slow 
    • Discrete state and action space \(\to\) curse of dimensionality 

Markov Decision Process 

In each time period \(t = 0, 1, 2, ... \), an agent 

  • observes a state variable \(s_t \in S\) 
  • chooses an action \(a_t \in A(s_t)\) 
  • obtains a reward \(\pi_t\) 
  • moves on to the next stage according to distribution \(F(\pi_t, s_{t+1}|s_t, a_t)\)

In Q-Learning, 

  • \(S\) and \(A\) are finite 
  • \(A\) is not dependent on \(S\) 

Q-Learning example 

\(s_0\)

\(s_1\)

\(s_2\)

\(a_0\)

\(a_1\)

\(a_2\)

At \(t=0\) 

...

Pros and Cons of using Q-Learning 

  • How does it work? 
  • Is it the most realistic model that we can think of in the market? 
  • Why is it slow? 

DM's problem 

Decision maker wants to maximize expected present value of the reward stream

\[\mathbb{E} \left[\sum_{t=0}^\infty \delta^t \pi_t\right]\]

 

This is usually solved by using the Bellman equation

\[V(s) = \max_{a \in A} \left\{ \mathbb{E}(\pi|s, a) + \delta \mathbb{E}[V(s') | s, a] \right\}\]

  • \(s'\) is shorthand for \(s_{t+1}\) 

\(\delta \in (0, 1)\) is the discount factor

DM's problem cont'd 

Or we can use the Q-function

\[Q(s,a) = \mathbb{E}[\pi | s,a] + \delta \cdot \mathbb{E}[\max_{a' \in A} Q(s', a')| s, a]\] 

 

  • Think of this Q-value as the quality of action \(a\) in state \(s\) 
  • In fact, \(V(s) =\max_{a\in A} \{Q(s,a)\}\)
  • Since \(S\) and \(A\) are finite, \(Q\)-function can be represented as a \(|S|\times |A|\) matrix 

period payoff

continuation value

Q-Learning 

  • Q-learning: approximate finite \(|S| \times |A| \) Q-matrix without knowing the underlying distribution \(F(\pi_t, s_{t+1}|s_t, a_t)\)
    • \(\to\) hence "model-free" 
  • How does Q-Learning estimate this Q-matrix? 
    1. Start with arbitrary initial matrix \(\bf Q_0\) 
    2. Learning equation: \[Q_{t+1}(s,a) = \textcolor{blue}{(1-\alpha)} \cdot Q_t(s,a) + \textcolor{blue}{\alpha}\cdot[\pi_t + \delta \cdot \max_{a \in A}Q_t(s',a))]\]
      • A convex combination; \(\alpha \in [0, 1]\) is the learning rate  

Q-Learning cont'd

All actions must be tried at all stages 

\(\implies\) explore! 🗺️

  • ALG: \(\epsilon\)-greedy model of exploration  
    • Choose the currently optimal action (i.e. the one with the highest Q-value) with probability \(1 - \epsilon\) 
      • "Exploitation mode" 
    • Randomize across all other actions uniformly with probability \(\epsilon\) 
      • "Exploration mode" 

 

Q-Learning is slow 

Because 

  • it only updates one cell in the Q-matrix at a time 
  • Approximating the true matrix requires that each cell be visited multiple times 
  • More iterations are thus needed for large space and action space 

Q-Learning is not stationary 

  • State space may increase over time 
    • \(\to\) Solution: bound player's memory to the past \(k\) actions
  • But what if the opponent also changes her strategy over time? 
    • \(\to\) there are no general convergence results for strategic games played by Q-learning algorithms 
      • No ex-ante (theoretical) guarantee 
      • However, verifiable ex-post

Economic environment 

  • Infinitely repeated pricing game 
  • Firms act simultaneously and condition their actions on history 
  • Firms also have bounded memory (since we need to make state space finite) 
  • There are \(n\) differentiated products (each supported by a firm, so \(n\) firms in total) and an outside good 

At each stage

  • Firms price compete with logit demand and constant marginal costs \(c_i\) 
  • In each period \(t\), the demand for product \(i = 1, 2, ..., n\) is \[q_{i,t} = \frac{\exp{(\frac{a_i - p_{i, t}}{\mu})}}{\sum_{j=1}^n \exp{(\frac{a_j - p_{j, t}}{\mu})} + \exp{(\frac{a_0}{\mu})}}\]
    • \(a_i\): product quality indexes; captures vertical differentiation 
    • \(\mu\): index of horizontal differentiation 
  • \(\to\) per-period reward for firm \(i\): \[\pi_{i,t} = (p_{i, t} - c_i)\cdot q_{i,t}\] 

Action space 

  • Compute Bertrand Nash Equilibrium prices \(\mathbf{p}^N\)
  • Compute monopoly prices \(\mathbf{p}^M\)
  • Action space \(A\) is given by \(m\) equally spaced points in the interval \[[\mathbf{p}^N - \zeta(\mathbf{p}^M- \mathbf{p}^N), \mathbf{p}^M+\zeta(\mathbf{p}^M- \mathbf{p}^N)]\]

State space 

  • State space \(S\) is the set of all past prices from the last \(k\) periods \[s_t = \{\mathbf{p}_{t-1}, ..., \mathbf{p}_{t-k}\}\]

Exploration in Q-Learning 

  • Uses \(\epsilon\)-greedy model with a time-declining exploration rate \[\epsilon_t = e^{-\beta t}, \beta >0\]
  • i.e. Algorithm chooses to randomly explore initially; but as time passes, they deviate towards exploitation more and more 

Verifying Convergence 

  • Convergence is achieved if for each player, the optimal strategy does not change for 100,000 consecutive periods 
  • Observation 
    • Nearly all experiment sessions converged 
    • Convergence took a great many of repetitions 
    • For those sessions that didn't converge, most of them had price cycles 

Conclusion 

  • Algorithms consistently learn to charge supracompetitive prices 
  • Collusion is typically partial and is enforced by punishment in case of deviation 
    • Punishment is of finite duration 

More research 

  • Different economic environments: 
    • Persistent, firm-specific demand? 
    • Cost shocks? 
  • What if different reinforcement learning algorithms interact? Will they still be able to cooperate? 
  • Speed of learning 

deck

By Sheng Long

deck

  • 24