Actively under development!


  • There are many methods used to evaluate LLMs

    • Standard Benchmarks

    • Reward Models

    • Human Preference

    • Red Teaming

    • Even more

  • None of these methods are perfect

    • It’s now both a practical problem and research problem

  • With so many LLMs coming out benchmarks are becoming more important

    • Helps you ensure you can find the model that suits you

    • Method for you to independently assess models capabilities

Standard Benchmarks#

  • Many of these are designed to be a prompt and expected output rubric where the answer is included

  • Quantitative “Definitely right”

    • Mathematical benchmarks

  • Everyone agrees on correct but some room for ambiguity

    • Code evaluations, where the code may suck but the answer is right

  • Reasoning examples

    • What would a reasonable human do?

  • Types of tests

    • Binary discrimination

    • Ranked preference modeling

List of Evaluations#

Human Preference#

  • Humans look at the outputs and make a choice. This can be

    • Tone of voice

    • Length of

    • Inclusion of references

    • Safety

Reward Models#

Red Teaming#

  • Get folks to really push the model in ways that are unexpected and see what happens

    • This is becoming a standard procedure for many models

Completely qualitative claims#

  • Many papers now saw “We randomly sampled a a hundred or so samples and checked ourselves”


  • Peeking is a real problem

    • Its almost impossible now to get totally separate test and train sets, the internet, books, etc all have so much text that’s repeated

  • There’s been work on contamination measures and reporting when sharing benchmarks

    • See LLAMA 2 paper