Chapter 2 – Reinforcement Learning

Reinforcement learning is an important field of machine learning. It works by teaching agents to take actions in an environment to maximise their accumulated reward.

In this chapter, you will learn about some of the fundamentals of RL, and working with OpenAI’s Gym environment to run your own experiments. You’ll also learn about Reinforcement Learning from Human Feedback, and apply it to the transformers you trained in the previous section.

The full content is available here.

Close-up of a vintage control panel with various switches, gauges, and dials, illuminated with a golden glow.

Reinforcement Learning: The Basics

Reinforcement Learning (RL) basics focus on teaching the fundamental concepts of an agent learning to navigate an environment through rewards and penalties. Key topics include defining the agent, environment, states, actions, rewards, and the basic algorithms like Q-Learning and SARSA.


A row of vintage arcade game machines in a gaming arcade with colorful screens and control panels.

Deep Q-Learning

Deep Q-Learning (DQN) combines deep learning and reinforcement learning, using neural networks to estimate Q-values. Today’s content covers the basic DQN algorithm, as well as the concept of experience replay and target networks.


A humanoid robot sitting at a table, holding a fork with a slice of cake, in front of a larger piece of cake on a plate.

Vanilla Policy Gradient

Today we introduce policy-based methods, where the objective is to directly optimise the policy function rather than value function. Vanilla Policy Gradient (VPG) is perhaps the simplest of these.


A large yellow and black robotic creature with a blue laser eye, kicking a black soccer ball in a landscape with mountains in the background.

Proximal Policy Optimisation

Proximal Policy Optimisation (PPO) is a policy gradient method that adds a constraint to the policy update to prevent drastic changes, promoting stability and robustness in learning.


A whimsical, cartoon-style octopus floating in space with a starry background, featuring a smiling face and a glowing eye.

RL from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is an approach to teach agents from both rewards and direct human feedback. The algorithms use PPO as the underlying reinforcement learning algorithm to train the agent.