### The main ideas Transformer This is the original picture of a transformer from the paper [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf). It seems complicated but really a few core intuitive ideas make the whole thing work which is quite amazing. Transformers and this paper, covered extensively so won't go into detail, we'll just cover the highlights and provide references to the best materials. ```{figure} /images/language_models/EncoderDecoder.png --- name: directive-fig --- The famous image from the Attention Is All You Need paper ``` The parts we're going to highlight are - Attention, comprised of a Query Key and Value - Decoder and Encoder - Softmax transform #### Attention, Self Attention, and Positional Encoding This is the biggest idea as evidenced by the Attention is all you need{cite}`vaswani2023attention` paper name itself. The name attention indicates the core idea, In essence the model figures out what parts of its input it should "care about" and "what it shouldn't. Think of the following question. > I went to the grocery store and hardware store, walked around and bought nails and bread. > Which did I buy where? As a human what do you pay attention to and, and what do you ignore? I'm sure you can explain it generally, but most of us, if asked to write a generalizable mathematical rule, would struggle. The paper authors did write one such rule, called attention, and as of now nearly everyone is using it.

The core parts of attention are a 1. **Query** 2. **Key** 3. **Value** Here's an analogy from [Lih Verma](https://lih-verma.medium.com/query-key-and-value-in-attention-mechanism-3c3c6a2d4085) > You login to medium and search for some topic of interest — This is Query > Medium has a database of article Title and hash key of article itself — key and Value > Define some similarity between your query and titles in database — ex Dot product of Query and Key > Extract the hash key with maximum match > Return article(Value) corresponding to has key obtained The explanation in [blog post where a transformer is handmade](https://vgel.me/posts/handmade-transformer) also provides a great intuition, especially because the QKV matrix weights are set _by hand_ leaving no ambiguity whatsoever in the calculation. #### Softmax Softmax basically takes a vector of arbitrary length and turns it into a proper probability vector. In math it can look a little scary. $$ \sigma(x)_j = \frac{e^{x_j}}{\sum_k e^{x_k}} $$ But in practice its quite simple. Here's softmax implemented when implemented in libraries with broadcasting. Here's the one liner. ``` x = [.3, 1.2, .8] np.exp(x)/sum(np.exp(x)) ``` And here's the output. > array([0.19575891, 0.48148922, 0.32275187]) Now x has been transformed to a proper probability. In transformers softmax is prominently used at the end. However it's also used within each self attention head as a normalizer and activation function which you can see [in Flax here](https://github.com/google/flax/blob/main/flax/linen/attention.py#L108). ```{admonition} Relu activation vs Softmax :class: tip I've only skimmed this paper but it seems to offer a intuition on why softmax is a good activation function. https://arxiv.org/pdf/2302.06461.pdf ``` #### Decoder and Encoder Stacks When in the weeds of transformers architecture you'll hear about encoder models, decoder models, and encoder decoder models. I have a loose grasp on this right now. [This website explains it well](https://www.practicalai.io/understanding-transformer-model-architectures/) but there are some nuances between decoder only, and encoder/decoder models that I don't fully understand. ## References - [3D LLM Visualization and Guide](https://bbycroft.net/llm) - A fantastic 3D visualization of LLMs with a very nice explanations. - [Attention Implementation in Flax](https://github.com/google/flax/blob/main/flax/linen/attention.py) - This is a concise and fantastic code implementation. In particular is this [line](https://github.com/google/flax/blob/main/flax/linen/attention.py#L96-L97) which decompose the query and key dot product, and this [line](https://github.com/google/flax/blob/main/flax/linen/attention.py#L187-L188) which details the (query, key) and value dot product. The einsum notation in particular makes it quite clear what is happening. - [Intuitive Explanation of Query Key and Value](https://lih-verma.medium.com/query-key-and-value-in-attention-mechanism-3c3c6a2d4085) - A nice blog post that summmarizes this well - [Attention is All You Need paper](https://arxiv.org/abs/1706.03762) - The original paper for the transformer architecture that's kicked off this whole craze. - [Fantastic blog post on Transformers](https://towardsdatascience.com/drawing-the-transformer-network-from-scratch-part-1-9269ed9a2c5e) - Very well illustrated and easy to read - [Softmax and temperature](https://lukesalamone.github.io/posts/what-is-temperature/) - A good explanation of softmax and also temperature, a concept we'll come back to later. - [Encoder vs decoder explained well](https://www.practicalai.io/understanding-transformer-model-architectures) - A good summary of how the different architectures are used with references to actual models - [GPT in 60 lines of Numpy](https://jaykmody.com/blog/gpt-from-scratch/) - A compact representation Numpy implementation of transformers and the full training stack. - [End to End Guide to transformers](https://e2eml.school/transformers.html) - A fantastic guide that starts from basics of matrix multiplication, through Markov chains, to the fully built model. - [Building a Transformer by hand, including the weights](https://vgel.me/posts/handmade-transformer/) - This guide really cuts to the core of what transformer is, and with manual weight selection you really get an understanding of what's hapenning.