Chapter 2 - Reinforcement Learning

Reinforcement learning is an important field of machine learning. It works by teaching agents to take actions in an environment to maximise their accumulated reward.

In this chapter, you will be learning about some of the fundamentals of RL, and working with OpenAI’s Gym environment to run your own experiments. You’ll also learn about Reinforcement Learning from Human Feedback, and apply it to the transformers you trained in the previous section.

This chapter will run for 6 days.

RL basics

Reinforcement Learning (RL) basics focus on teaching the fundamental concepts of an agent learning to navigate an environment through rewards and penalties. Key topics include defining the agent, environment, states, actions, rewards, and the basic algorithms like Q-Learning and SARSA.

Deep Q-Learning

Deep Q-Learning (DQN) combines deep learning and reinforcement learning, using neural networks to estimate Q-values. The course would covers the basic DQN algorithm, as well as the concept of experience replay and target networks.

Vanilla Policy Gradient

Today we introduce policy-based methods, where the objective is to directly optimize the policy function rather than value function. Vanilla Policy Gradient (VPG) is perhaps the simplest of these.

Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a policy gradient method that adds a constraint to the policy update to prevent drastic changes, promoting stability and robustness in learning.

RL from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is an approach to teach agents from both rewards and direct human feedback. The algorithm use PPO as the underlying reinforcement learning algorithm to train the agent.