Inspired by metallurgy, where controlled heating and cooling help find a stable low-energy state.
An Optimization approach that mimics this process to solve NP-hard problems like spin glasses.
At high temperatures, the system explores a wide range of states, avoiding local minima.
As the temperature decreases, the system settles into an optimal (or near-optimal) solution.
Simulated Annealing
Let \( s = s_0 \)
For \( k = 0 \) through \( k_{\max} \) (exclusive):
\( T \leftarrow \text{temperature} \left( 1 - \frac{k+1}{k_{\max}} \right) \)
Pick a random neighbor, \( s_{\text{new}} \leftarrow \text{neighbour}(s) \)
If \( P(E(s), E(s_{\text{new}}), T) \geq \text{random}(0,1) \):
\( s \leftarrow s_{\text{new}} \)
Output: the final state \( s \)
The Problem
The success of Simulated Annealing is completely dependent on the temperature schedule.
Temperature schedules are mostly based on heuristics:
No adaptationaccording to specific instances.
Requires manual tuning for different Hamiltonians.
Scaling to higher problem complexities is cumbersome.
The Solution: RL
Reinforcement Learning (RL) is a proven method for optimizing a procedure with a well-defined success measure.
Deep reinforcement learning is now used to train agents with superhuman performance that play video games, board games, conduct protein folding and chip design.
Main Idea: Use RL to control the temperature schedule of simulated annealing.
RL for Simulated Annealing
An RL agent receives an input state \(s_t\) at time \(t\), and decides to take action \(a_t\) based on a learnt policy function \(\pi_\theta(a_t \vert s_t)\) parameterized by \(\theta\) to obtain a reward \(R_t\).
In Deep RL, policy function is represented by a neural network.
In controlling Simulated Annealing, the agent observes the spins \(\sigma_i\) at time \(t\) and decides to change the inverse temperature by \(\Delta \beta\) to obtain a reward proportionate to negative of the energy of the system.
An RL agent receives an input state \(s_t\) at time \(t\), and decides to take action \(a_t\) based on a learnt policy function \(\pi_\theta(a_t \vert s_t)\) parameterized by \(\theta\) to obtain a reward \(R_t\).
In Deep RL, policy function is represented by a neural network.
RL for SA
State:
\(N_{reps}\) replicas of randomly initialized spin glasses with \(L\) binary spins.
Action:
A change in inverse temperature, sampled from \(\mathcal{N}(\mu, \sigma^2)\), where \(\mu, \sigma^2\) are outputs of agent.
Reward:
Negative of minimum energy achieved across replicas.
RL Optimization Process
The agent (neural network) is optimized using a well-known RL algorithm: Proximal Policy Optimization (PPO).
\(L^{CLIP}\) represents the policy loss: how likely is the network to take actions that maximize the reward?
\(L^{VF}\) represents the value estimation: how well can the agent estimate the future rewards?
\(S[\pi_\theta]\) represents the entropy of the outcomes: how diverse are the actions that agent takes?