or self supervised training


  • Pretraining is the first step in modern language training

    • A large corpus of text is found and training examples are computationally generated

  • You can train a “Medium Sized” Language model at home relatively simply. The basic steps are

    • Tokenization, masking, self supervised learning, inference, text generation

  • You cannot pretrain a Large Language Model at home

    • The compute and cost required costs more than most homes

  • Pretraining a model is not enough to create a polished experience like Bard or ChatGPT

Now that we’ve covered transformers lets go one level up and talk about the training pipeline. At the core training a Neural Network Language model is exactly the same as we showed in Training Small Language Models

The best single resource I’ve found is Andrej Karpathy’s train a GPT from scratch video. To see all the details definitely watch that. I’ll provide an overview here with some of my own commentary below.

The four step process for text generation#

Let’s say we want to train a large language model on this particular text.

This is me training a large language model, to learn the phrase large language model

Here are the high level steps we’d go through.

  1. Tokenization - Turning text into numbers

  2. Self Supervision through Masking - Building training examples from text

  3. Parameter Estimation - Estimate weights

  4. Generation - Make predictions about the next number

For now let’s go through it. First we tokenize the text. We’ll use OpenAI’s Tokenizer because it has a nice web interface. The input string turns into this vector.

[1212, 318, 502, 3047, 257, 1588, 3303, 2746, 11, 284, 2193, 262, 9546, 1588, 3303, 2746]

From there training examples are constructed by taking the original vector and masking it. This is called self supervised learning, where instead of needing \(X\) and \(Y\), or an input and output dataset, one dataset of written text can be used to generate all examples needed to train a model.

[1212, 318, 502, 3047] -> 257
[1212, 318, 502, 3047, 257] -> 1588
[1212, 318, 502, 3047, 257, 1588] -> 3303

From there the transformer weights are trained, like any other neural net. And then similarly for text generation predictions are made like any other model. An input sequence is provided, then the next token is predicted. This is the text generation We can then continue the prediction by adding the predicted word to the input sequence and making another prediction.

Simple training, complex challenges#

Despite the relative simplicity of the model and training, there are some challenges that are still open in the field of Language Models.


When these models aren’t trained properly they produce strings of text that are frankly, unreadable or complete garbage. What’s more insidious is when the text produced is plausible at a glance, but upon closer inspection issues become evident. This has been termed hallucination. Interestingly enough Andrej Karpathy wrote about these hallucinations way back in 2015 before transformers were invented. In the blog post he shows how they occur with Recurrent Neural Networks, a problem that still is with us to this day

The double edged nature of training data#

The training data is what enables these models to learn language. As models get larger they require more data to learn that language, particularly the type of conversational language we’ve all come to enjoy in chat agents. In Wikipedia you’ll see how as OpenAI moved from GPT1 through GPT3the corpus changed from a book examples to the general internet.

While this enables the model to learn more about languages, it’s also the root AI alignment issues. This is where the model behavior is influenced by sets of input that that the individual modeler, cannot curate and inspect the data in a human lifetime or even multiple human lifetimes.

How to fix these#

In upcoming sections we’ll talk more about strategies that are used such as supervised fine tuning, instruction tuning, prompt tuning, reinforcement learning with human feedback, and reward models, among others.