RL in the age of LLMs#

TLDR

  • Reinforcement is set of “old” ideas now implemented on “new” LLMS

  • This section provides self contained minimal code low computation cost examples

  • Install the code

RL in the context of LLMs#

The foundational ideas of Reinforcement Learning have been developing for at least 40 years, finding applications in diverse fields like robotics. However, in the last 4 years, interest in Reinforcement Learning has increased substantially with the rise of Large Language Models (LLMs). Models like ChatGPT brought Reinforcement Learning from Human Feedback (RLHF) to widespread attention, while earlier models like GPT-3 also used RL to achieve frontier levels of performance. These days, RL is a key tool when shaping LLMs, particularly in the later stages referred to as post-training.

This is a hands on practical guide#

This is a practioner’s guide to RL for LLMs. We showcase the most widely RL algorithms for LLM post training, and actually shaping LLM behavior. We’ll cover the absolute basics RL with Ice Maze to ensure the fundamentals are understood, and then we’ll start implementing RL on a tiny LLM. While papers and youtube videos have their place they can only tell you from an external perspective, actually playing with working code yourself will give you the deepest understanding.

Note

This section is a new addition from 2025 and hasn’t quite been blended into the rest of the book yet.

Code You Can Run#

Another challenge I’ve found LLM RL guides and code is they often are based on models or compute architecture that is expensive. While you can read the code but can’t actually see it “in motion”. In reality model builders not just implementing an algorithm like PPO, and “firing and forgetting”. They’re looking at loss charts, reward plots, and many more low level signals and diagnostics to get to the final LLM.

We include minimal implementations of each algorithm with intuitive explanations. All the examples will train in less than one hour on an M4-sized MacBook. With the running code you get all the information a model builder would get, This was a deliberate choice to make this guide accessible and complete. You can easily and cheaply run multiple experiments. This will give you experience beyond just reading a paper I encourage readers to change hyperparameters, data, and examples and make the LLM their own!

Everything in this section is implemented in a self contained code package. The aim here is for you to understand RL by reading short English and python snippets.

Install using uv or your favorite package manager.

uv pip install "git+https://github.com/canyon289/GenAiGuidebook.git@main#subdirectory=genaiguidebook/clean_llm_rl"

What is omitted#

To shape this guide we also mde

  • Heavy Mathematical Notation - This is present in most textbooks and papers. While symbolic notation is necessary for detailing algorithms in papers, we instead use code to detail the algorithms, with the text providing intuitive explanations.

  • Distributed Training Code and Batching - Libraries that implement RL often use many layers of abstraction for code reuse and multiple computational tricks such as batching or approximations for speed, or distributed implementations that are well suited for long running production jobs. This obscures the mathematics and fundamentals of the RL ideas. This guide provides single file implementations of every algorithm, with the core often being 30 lines or less.

  • Computationally Expensive Examples - Many other tutorials presume much heavier infrastructure, making experimentation challenging and expensive.

Reading Suggestions#

Again this guide is meant to be more than a static set of words and code files. Take the content here as a guide and make it your own. In particular

  • Read with an LLM - Use an LLM or LLM powered interface to read through the writing and the code. Something like NotebookLM or a CLI tool will be able to read the writing and the code and help you cross reference both easily.

  • Run the code with a debugger - When running code you can inspect things like shapes or intermediate values. This will give you deeper sense of what is happening.

  • Modify the examples - Change the data or training yourself. This will allow you to see the effect of different configurations, and as a learner you’ll be more likely to remember your examples than mine!.

  • Reference source papers and production grade code examples - These will contain much more academic and practical detail that I am including here, and once you have the basic understanding it’ll be easier to understand.

References#

These are the more concise references for the field as a whole that I could find. If you’re looking to dive in more deeply I recommend these, rather than piecing the knowledge together yourself.

Suggested Prompts#

  • What is the history of RL and how does it intersect with language models?

  • What are some additional references for how RL is applied to large language models?

Written On My Own#

As with the rest of the guidebook this is written entirely on my own without involvement from any other organization.