# Glossary#

**Autograd** - Automatic calculation of a gradient. In short gradients are important in model training in “telling” us how parameters should be updated, up or down and how much. For a full explanation read Pytorch’s autograd explainer.

**Context Window** - The maximum number of tokens a model can take as an input.
You can see this implemented in code in the GPT from Scratch Notebook.

**Loss Function** - Measures how “bad” a model’s prediction is. See Language Model Pretraining for an example and for a full list see Optax’s docs.

**Optimizer** - In a Neural Network context this refers to the algorithm used during training to get to parameter convergence. Analytics Vidhya provides a comprehensive explanation.

**Self-supervision** - A technique where the input data and the corresponding output ground truth are derived from the same source.
See Language Model Pretraining for a more details.

**Statistical Model** - A mathematical model that uses observed information to determine it’s parameters.
See Language Model Pretraining for a more details

**Language Model Head** - This is what the language model outputs at the end of prediction.
In chat applications a *token head* predicts the next word token.
In a regression head, for a reward model, the model might output a scalar value.

**MFCCs** - Mel-frequency cepstral coefficients are a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. They are widely used in speech and music recognition. You can read more about them here.

**Delta MFCCs** - Delta MFCCs are the first order derivatives of the MFCCs. They are used to capture the rate of change of the MFCCs. For more information, see here.

**Delta Delta MFCCs** - Delta Delta MFCCs are the second order derivatives of the MFCCs. They are used to capture the acceleration of the MFCCs. For more information, see here.

**Vector Quantization (VQ)** - Vector quantization is a lossy data compression technique that involves encoding a signal as a set of discrete values from a finite set called a codebook. It is used in the VQ-VAE model. For more information, see here

**VQ-VAE** - Vector Quantized Variational Autoencoder is a generative model that uses vector quantization to learn discrete latent representations in an autoencoder. You can read more about it here.

**Hidden Markov Model (HMM)** - A Hidden Markov Model is a statistical model that describes a sequence of observable events generated by an underlying Markov chain. It is used in speech recognition and other sequential data tasks.

**Variational Autoencoder (VAE)** - A Variational Autoencoder is a generative model that learns a latent representation of the input data. It is trained using variational inference.

**Generative Adversarial Network (GAN)** - A Generative Adversarial Network is a generative model that consists of two neural networks: a generator and a discriminator. The generator generates samples, and the discriminator distinguishes between real and generated samples. The two networks are trained adversarially.

**Linear Embedding Layer** - A linear embedding layer is a layer in a neural network that maps input tokens to continuous embeddings using a linear transformation. It is commonly used in language models and other sequence modeling tasks.

**Patch Embedding** - Patch embedding is a technique used in vision transformers to convert image patches into continuous embeddings. It involves linearly projecting the image patches into a lower-dimensional space.

**Positional Encodings** - Positional encodings are vectors added to the input embeddings in transformer models to encode the position information of tokens in the sequence. They help the model learn the order of tokens in the sequence.