How we got here#

TLDR

  • Language models are not new. They have existed and have been evolving for decades

  • In this section there are three personas I’ll write for. Those personas are

    • You’re a person that wants to train a language model from scratch

    • You’re want to deploy a preexisting model

    • You just want to understand how the models work

Important

This guide is in draft status

Language models are simple to explain. You describe what they want in words and “automagically” a computer can make it. This can be text, audio, or images/video. Or you just talk to a computer and it just “does a thing” You’ve seen this in movies. In Iron Man Jarvis creates all sorts of cool tech by just listening to Tony Stark.

What’s fascinating is how essentially one year, 2022, this went from a research idea to an everyday capability even though Claude Shannon was writing about this in 1948.

Most folks point to the transformer architecture to explain this but it’s not as simple as

Input string -> Transformer -> Output

There’s many other pieces such as embeddings, tokenization, training methods, but also other factors like public excitement and generally available compute that all came together to make this happen.

Current Focus: Sequence to Sequence focus

Also known as text to text models I’ll write about this topic first before moving onto others like diffusion models or text to {image, video} .

It’s the same for bicycles. While we take them for granted now it was hundreds of years between the first concept drawing

It’s also just not just one innovation that made them practical but many. For a modern bicycle be practical, useful, and available it took many innovations. A partial list includes

  • Frame design for structure

  • Gears and chains for power transfer

  • Grease and bearings for efficiency

  • Plastics for durability and cost

  • Brakes and helmets for safety

  • A global supply chain for delivery

  • Sidewalks for usability

All these same ideas exist for language models as well. A lot of mathematical ideas, hardware, and data had to all come together to make this possible. You can argue one component is more important than the other but the fact is without all of them the bike, or the language model, would come to a grinding halt.

If SMLs are bikes, then LLMs are …..#

If you think of small language model as a bike. a large language model is like a semi truck. Both move people or cargo, but one is way bigger and complex.

In some ways the ideas are the same a gear is the same idea whether it be a bike or semi truck but the scale necessitates many differences. The brakes on a truck are much more complex than those on a bike.

So let’s go through a new set of analogies But this time let’s tie the analogy to real world to the model itself.

  • Engine and NN Architecture - This is what’s at the core of the LLM. Engines come in different styles and sizes, for instance gas vs electric. In large language models the same idea exists. In the past RNNs and LSTMs were the prediction engine, but these days it’s transformers

  • Number of seats and cargo capacity - How much can you load into your truck and move around. Context window is the same idea, how much “token cargo” can the LLM handle at once.

  • Engine Manufacturing and LLM Training - This is what forms the engine, weights of the model. The bigger the engine or the more parameters or harder it is.

  • Engine Machining or Training Data - Engines are built

  • Pallets or Tokens - An incredibly important but often overlooked innovation in shipping is the pallet. Pallets took what was to be shipped, whether its books, furniture, or food, and abstracted it into fixed units. Tokens are a similarish idea. A text input is taken and processed into a more standardized form.

  • Evaluation - No need for a different word here. An engine can be evaluated and compared by things like horsepower or fuel efficiency. The same can be done for LLMs.

  • Engine or Model Tuning - No need for a different word here. A working engine can be subtly altered to favor one characteristic or another. The same can occur for LLMs. Models can be shaped to have subtly different characteristics.

The structure of this section#

We’re going to split this section of the guide into four distinct parts

  1. (Small) Language Model Basics

  2. Large Language Model Definition and Training

  3. Large Language Model Inference and Serving

  4. Large Language Model Tuning and Evaluation