Key Ideas in RL#
TLDR
The central concept of reinforcement learning:
is quite simple.
is applicable in many domains.
The complexity comes from:
the wide variety of options.
specialized terminology.
computational complexity.
By focusing on LLMs RL specifically, we can focus on common implementations to simplify learning and implementation.
A Straightforward Idea with Many Nuances#
When reading through Reinforcement Learning literature, it can be easy to miss the forest for the trees. With many specific terms and mathematical formulae, it can be easy to get lost. This guide will cover the high-level ideas and basic terminology before we move on. Use it as a reference as you move through the subsequent notebooks.
The RL Loop#
The concept of reinforcement learning is quite simple, with direct parallels to everyday life. An agent takes actions, and it receives rewards. Some rewards are immediate; others take time to result in the final outcome.
Fig. 22 Basic RL Loop from SpinningUp#
In this diagram, here are the definitions from Sutton and Barto:
\(a_t\) is the action the reinforcement learning agent takes in some environment, such as moving a chess piece.
\(s_t\) is the change in state, say the board state for chess.
\(r_t\) is the reward, say a reward for capturing a chess piece.
\(Agent\) is the action taker, and \(\pi_{\theta}\) is the policy: the algorithm that the Agent uses to take the next action.
\(Target\ environment\) - The thing it (the agent) interacts with, comprising everything outside the agent, is called the environment.
These five simple concepts are at the core of billions of dollars of funding, people, and compute resources. They’re at the core of self-driving cars, recommendation systems in deployed web apps, and Nobel Prize-level breakthroughs such as AlphaFold and in many other real-world systems. With the wide applicability of reinforcement learning, many different approaches have been developed. This is because what works for winning at the game of Go won’t necessarily work for solving cancer.
However, this also means there is a lot of reinforcement-specific jargon that comes with this domain.
The Terms and Key Concepts#
We’ll cover the most common terms that will be part of this guide. Described in just words, these terms can be confusing. If later on you are confused by these terms, come back to this notebook to reference the simple implementation before going back.
Environment(s)#
In addition to the textbook definition of “environment,” I personally make these two distinctions:
Deployed or Fixed Environment - The environment, e.g., users using LLMs on their own prompt or real-world conditions.
Constructed or Simulated Environment - An “artificially” constructed environment, such as a video game for self-driving cars.
State Terminology#
State - The status of an environment and the agent at a particular time step, e.g., speed, velocity, and adjacent car positions in a driving simulator, or piece position in chess.
Observation - What the agent perceives from the environment at a given time step.
Time Step - The series of steps that represent start to finish.
Episode - One “start to finish” iteration through the RL loop.
Trajectory - A sequence of states, actions, and rewards from an episode.
Reward Terminology#
Rewards - The “goodness signal” received after taking an action. This could be positive or negative. Sometimes the reward is fixed, but often it can be designed by the RL researcher/implementor as well, especially with LLMs.
Return - The discounted reward at a certain step from a full trajectory.
Action Terminology and Policy#
Algorithm - The methodology in which an agent works during both learning and runtime. This is the most critical choice, as it has many downstream implications. See the diagram above.
Policy - The “algorithm” which picks the next action in a state. A policy maps a state to a distribution of probabilities over the action space.
Off Policy and On Policy - Q-Learning is an off-policy algorithm. This means the policy used for exploration (the behavior policy) is different from the policy being learned (the target policy). In our case, the behavior policy is epsilon-greedy, while the target policy is purely greedy (always picking the best action).
Behavior Policy - The policy used to pick actions. For instance, we’ll be using epsilon-greedy policies in our Ice Maze example during learning.
Target Policy - The policy being learned. For example, in Ice Maze, this is the greedy policy.
Challenges with RL#
However, while conceptually simple, reinforcement learning comes with a host of challenges.
Explore vs Exploit - Do you stick with a method that provides results, or try and find a better one?
Bias-Variance Tradeoff - Depending on the environment, high-bias or high-variance approaches can be selected to find an optimal policy.
Credit Assignment - In a trajectory of actions, some are “good,” some are “bad,” and others are neutral. Rewarding the correct one is a challenge.
Environment Specification - How do you define your environment? For tasks like chess this is trivial; for self-driving, it is much more complex.
Sample Efficiency - Each RL loop costs time and money. How can we learn as much as possible from as few loops as possible?
Non-Stationarity - Does the environment stay static or does it change? E.g., a chess board versus user preference drift.
Curse of Dimensionality - It may be infeasible to explore all actions in all states.
Partial or Full Observability - The entire environment and even actions may not be visible at once.
LLM Focus#
In this guide we are focusing solely on LLMs, which means our focus on the algorithms on the far left. This includes algorithms such as PPO and some not present on the diagram, such as GRPO.
We do this for two reasons
To focus our foundational learning on a smaller set of concepts, for example actions taking in a discrete space.
Highlighting challenges specific to modern LLM deployments, such as the ambiguous nature of correctness in natural language.
Fig. 23 Various RL Algorithms#
Before getting to LLMs we’ll focus on a classic problem in RL to solidify the core concepts. Let’s get started.
References#
SpinningUp - A guide written by OpenAI researchers about RL before their era of LLMs
Suggested Prompts#
Can you elaborate on the challenges of RL with LLMs?
Which RL Algorithms are not so suited for LLMs? Why is this the case?