Chapter 2: Reinforcement Learning
Reinforcement learning is an important field of machine learning. It works by teaching agents to take actions in an environment to maximise their accumulated reward.
In this chapter, you will learn about some of the fundamentals of RL, and working with OpenAI’s Gym environment to run your own experiments. You’ll also learn about Reinforcement Learning from Human Feedback, and apply it to the transformers you trained in the previous section.
Reinforcement Learning: The Basics
This section is designed to bring you up to speed with the basics of reinforcement learning. Before we cover the big topics like DQN, VPG, PPO and RLHF, we need to build a strong theoretical foundation first.
For now, we will assume the environment has a finite set of states/actions, and we explore two cases: 1. that the dynamics of the environment are directly accessible to the agent, so that we can solve for an optimal policy analytically. 2. the environment is a black box that we can sample from, and the agent must learn from play how the environment works.
This day's material will be more theory heavy, as the goal is to understand the fundamentals of RL before we apply deep learning to it.
Deep Q-Learning
Deep Q-Learning (DQN) combines deep learning and reinforcement learning, using neural networks to estimate Q-values. DQN was used in a landmark paper: Playing Atari with Deep Reinforcement Learning.
Today’s content covers the basic DQN algorithm, as well as the concept of experience replay and target networks.
Vanilla Policy Gradient
Today we introduce policy-based methods, where the objective is to directly optimise the policy function rather than value function. Vanilla Policy Gradient (VPG) is perhaps the simplest of these.
In this set of exercises, you'll implement VPG, the first policy gradient algorithm upon which many modern RL algorithms are based (including PPO).
Proximal Policy Optimisation
Proximal Policy Optimization (PPO) is a cutting-edge reinforcement learning algorithm that has gained significant attention in recent years. As an improvement over traditional policy optimization methods, PPO addresses key challenges such as sample efficiency, stability, and robustness in training deep neural networks for reinforcement learning tasks. With its ability to strike a balance between exploration and exploitation, PPO has demonstrated remarkable performance across a wide range of complex environments, including robotics, game playing, and autonomous control systems.
In this section, you'll build your own agent to perform PPO on the CartPole environment. By the end, you should be able to train your agent to near perfect performance in about 30 seconds. You'll also be able to try out other things like reward shaping, to make it easier for your agent to learn to balance, or to do fun tricks! There are also additional exercises which allow you to experiment with other tasks, including Atari and the 3D physics engine MuJoCo.
RL from Human Feedback
Reinforcement Learning from Human Feedback (RLHF) is an approach to teach agents from both rewards and direct human feedback. The algorithms use PPO as the underlying reinforcement learning algorithm to train the agent.
This section is designed to take you through a full implementation of RLHF. Much of this follows on directly from the PPO implementation from yesterday, with only a few minor adjustments and new concepts. You'll (hopefully) be pleased to learn that we're disposing of OpenAI's gym environment for this final day of exercises, and instead going back to our week 1 roots with TransformerLens!
We'll start by discussing how the RL setting we've used for tasks like CartPole and Atari fits into the world of autoregressive transformer language models. We'll then go through standard parts of the PPO setup (e.g. objective function, memory buffer, rollout and learning phases) and show how to adapt them for our transformer. Finally, we'll put everything together into a RLHFTrainer class, and perform RLHF on our transformer!