Autograd - Automatic calculation of a gradient. In short gradients are important in model training in “telling” us how parameters should be updated, up or down and how much. For a full explanation read Pytorch’s autograd explainer.

Context Window - The maximum number of tokens a model can take as an input. You can see this implemented in code in the GPT from Scratch Notebook.

Loss Function - Measures how “bad” a model’s prediction is. See Language Model Pretraining for an example and for a full list see Optax’s docs.

Optimizer - In a Neural Network context this refers to the algorithm used during training to get to parameter convergence. Analytics Vidhya provides a comprehensive explanation.

Self-supervision - A technique where the input data and the corresponding output ground truth are derived from the same source. See Language Model Pretraining for a more details.

Statistical Model - A mathematical model that uses observed information to determine it’s parameters. See Language Model Pretraining for a more details

Language Model Head - This is what the language model outputs at the end of prediction. In chat applications a token head predicts the next word token. In a regression head, for a reward model, the model might output a scalar value.