(Small) Language Model Intuition#


  • The formal definition is “A language model is a probability distribution over sequences of words”[1]

  • For a text to text model the intuition is, if I have to predict a word, what is the most likely one

  • These ideas aren’t new. Claude Shannon wrote about them in 1948

  • All previous language model architectures are being outperformed by systems that include tokenizers, embeddings, transformers, and other computational and mathematical tricks and tools


This guide is in draft status

What we do with languages and want from a language model#

Here are some examples

  • Correct a typo - My nmae is Ravin

  • Correct a grammatical error - My Ravin name

  • Expand or summarize I’m Ravin Kumar -> The name I was given is Ravin Kumar

  • Reword text - or rephrase, rejigger, throw out the window and rewrite

  • Translate

    • મારું નામ રવિન છે (This is Gujarati, the first language I learned)

    • mi nombre es ravin (This is Spanish, the third language I learned)

  • Retrieve information

    • Can you find me other last names f

  • Answer a question

    • How many Ravin’s are there?

  • Follow instruction or perform action

    • Pull up all email’s addressed to Ravin

The basic ___ of a language model#

The basic purpose a language model is to predict a word given other words. You as a human can try this, what is the missing word from the title above? Is it premise? Or function? Or idea? Or banana? All those words would work. The last one feels improbable though.

If you’re a native English speaker this assessment of probability may feel automatic for you but if we perform this exercise in a language that you’re not fluent in this would be challenging. You would neither know the words that surround the missing word, or what the missing word is itself.

This is the general intuition of language models.

Training Small Language Models#

We can formalize this idea with this formula. $\(P(w_m) = P(w_1,\ldots,w_m)\)$

That is, given a sequence of words what’s the most likely. Sometimes you see this formula written in its conditional form which is

\[P(w_m | w_1,\ldots,w_m)\]

The way a language model does this is by learning over a corpus of text. Let’s say we have the following

Training Example



My name is Ravin


My name is Scott


My dogs name is Spot

We may want to predict the next word after name \( P(w\_{next}| name )\).

With this text corpus above we get

\[\begin{split} P(is | name ) = 1 \\ P(my | name ) = 0 \end{split}\]

Why is that? Because is always after name.

Let’s take two different words

\[\begin{split} P( name | my ) = 2/3 \\ P( Ravin | is ) = 1/3 \end{split}\]

Verify yourself why this is the case. If you see this then you understand the basic premise of a language model. Consider these learned probabilities to be the trained model Now this is a very simple model, in practice models can be trained on the last two words, or the word before or the word after, or other combinations of things. But the idea is all the same, the corpus of text is used as a training example to learn the structure of that same text.

All the things we can do#

Now that we have a trained model we can use it in many different ways.

It could be used for grammatical checks. If this system saw the text Name my is Ravin, we can “ask” the model it thinks about the word name preceding the word _my- In this case we would get

\[P( \text{next_word}=my \mid current\_word=name ) = 0\]

This would indicate this sequence of words as unlikely indicating there is an issue.

It could also be used for text generation, We could pick a random word to start with like dog and see what the model picks to be the next word. Then with that prediction the model continues on and predicts the word after that and so on and so forth. This is in essence how generative models like Bard or ChatGPT work.

Problems with small language models and systems#

You can think of what we did as “small” language model. This simplistic model provides an intuition of how these models work. In practice these single input word probability models don’t work well.

This is because

  • There are a lot of words learning all combinations is hard

  • More than one word matters

    • Word order matters

  • We’re not always predicting the next token

  • Different languages have different rules

  • Languages are wide, everchanging, inconsistent, and complicated

Over the years a number of different model techniques have been applied to alleviate some of these issues, such as n grams or Markov Chain approaches or Recurrent Neural Networks. That’s because as I’m writing this the type of model that dominates in language is the Neural Network Transformer architecture. I do encourage you to familiarize yourself with what I’m calling foundational methods. I’ve provided references below to help.