Evaluation#
Warning
Actively under development!
TLDR
There are many methods used to evaluate LLMs
Standard Benchmarks
Reward Models
Human Preference
Red Teaming
Even more
None of these methods are perfect
It’s now both a practical problem and research problem
With so many LLMs coming out benchmarks are becoming more important
Helps you ensure you can find the model that suits you
Method for you to independently assess models capabilities
Standard Benchmarks#
Many of these are designed to be a prompt and expected output rubric where the answer is included
Quantitative “Definitely right”
Mathematical benchmarks
Everyone agrees on correct but some room for ambiguity
Code evaluations, where the code may suck but the answer is right
Reasoning examples
What would a reasonable human do?
Types of tests
Binary discrimination
Ranked preference modeling
List of Evaluations#
Hellaswag
Human Preference#
Humans look at the outputs and make a choice. This can be
Tone of voice
Length of
Inclusion of references
Safety
Reward Models#
Another LLM provides a rating
Could be continuous score or a discrete score https://arxiv.org/pdf/2305.13711.pdf
The attempt to codify human preferences into an LLM itself
Red Teaming#
Get folks to really push the model in ways that are unexpected and see what happens
This is becoming a standard procedure for many models
Completely qualitative claims#
Many papers now saw “We randomly sampled a a hundred or so samples and checked ourselves”
Contamination#
Peeking is a real problem
Its almost impossible now to get totally separate test and train sets, the internet, books, etc all have so much text that’s repeated
There’s been work on contamination measures and reporting when sharing benchmarks
See LLAMA 2 paper