Reinforcement learning tries to solve complex problems by interacting with them continuously. You let your Agent make decisions within the Environment, observe the results, and modify the Agent with a Reward. Then repeat!
Reinforcement learning is a mathematical framework to describe behavior in an environment.
Central ideas
In this workshop, we will be teaching an agent to control the green paddle and win against a ball-tracking CPU.
In our case:
RL has been around for half a century.
Previously, policies were learned using simple models. In the past 10 years, deep learning has become popular, and allowed for an important set of breakthroughs in RL.
Besides games, emerging application areas include robotics and optimization.
DeepMind's AlphaGo beats Lee Sedol, the reigning Go champion in the world.
A major challenge of RL: the credit assignment problem
The specific algorithm we'll be using to train our agent is called policy gradients.
It proposes one solution to the credit assignment problem.
General idea:
The specific algorithm we'll be using to train our agent is called policy gradients.
It proposes one solution to the credit assignment problem.
General idea:
Our loss function to minimize will be −Et[R(t)], the negative expected value (average) of the reward of our neural network's policy.
We use gradient descent to find model parameters that minimize the loss. To do this, we must calculate ∇θ(−E[R(t)]) the gradient of the loss with respect to the model parameters θ.
More details, but the policy gradient theorem states that:
∇θ(−E[R(t)])=−E[R(t)∇θlog(p(at))]
where p(at) is the probability of "the chosen action".
One last important detail:
In reality, we don't use the raw +1 or -1 rewards. We use the discounted reward
R(t)=k=0∑T−tγkr(st+k)
where γ is the discount factor, a number between 0 and 1
Intuitively, t is responsible for the rewards in future time-steps, but exponentially less so as time goes on.
$ git clone https://github.com/stewy33/pong-with-policy-gradients.git
$ cd pong-with-policy-gradients
2. Star 🌟 the repository 🥰
3. Clone and `cd` into the repo
4. Follow the instructions on the Readme! Depending on your platform, there may be installation errors - just let us know.
In the interest of time, we had to move quickly through the math, so let me know if there is a block and I'll come to your breakout room.
Use TensorBoard judiciously. If performance doesn't improve, there is likely a bug in your code, if it's not obvious what it is, call me in.
The tests are a guide, but only test calculating discounted reward and the policy network, and don't guarantee correctness.
Be very careful and double check everything!
http://karpathy.github.io/2019/04/25/recipe/
But most importantly, have fun! 🤠