LLM judges on trial: A new statistical framework to assess autograders

Evaluating AI models is central to improving their performance, as well as ensuring their safety. Evaluations let us compare performance across various domains (such as mathematics, coding or reasoning tasks) and assess model usefulness for real-world use cases. They are also important for knowing whether models could possess potentially dangerous capabilities, such as the capacity to assist with – or even autonomously complete – sophisticated cyber attacks, or self-replicate, evading human oversight and control.

Model evaluations can be carried out by humans – a model is prompted to complete a task, and a human grades its performance. But this approach is expensive and time-consuming. As AI progress has accelerated and the volume of evaluations has grown, researchers have sought to automate the process by having Large Language Models (LLMs) grade model outputs. These LLM evaluators are known as autograders, or “LLMs-as-judge”.

This raises a crucial question: can we rely on the judgements of autograders? Some studies suggest they exhibit certain systematic biases, which appear distinct from random noise. For example, they tend to grade outputs from models of the same family more highly. They can also have preferences for longer outputs, or ones written in a particular style or containing certain keywords. This means that before we can confidently use an autograder for AI model evaluation, we need to evaluate the reliability of the autograder itself for the specific task at hand.

What if we could do both at the same time? In a new paper, we propose the use of Bayesian generalised linear models (GLMs) to do just this. GLMs allow researchers to use autograders for model evaluations, while simultaneously assessing the quality of those autograders.

This blog explains why Bayesian are GLMs useful in the context of AI model evaluations. All the statistical models discussed in our paper are available in an open-source package on GitHub.

Why GLMs?

A GLM is a statistical model that lets us predict an outcome based on weighted input variables. It uses a link function to represent this outcome in a scale that is suitable for the properties of our data. Using a Bayesian approach returns a posterior distribution (a range of possible values and their respective probabilities), as opposed to point estimates.

There are a few reasons why Bayesian GLMs are particularly useful for LLM evaluation. First, they represent outcomes as a function of a weighted sum called a linear predictor, which will look something like this:

This means we can isolate the contribution of each factor on the outcome. For example, the coefficient β₁ might represent the impact of using a human vs an LLM grader.

Second, GLMs provide flexibility across various evaluation designs because researchers can choose from different probability distributions(binomial, Poisson, etc.) paired with appropriate link functions to model the structure of the target variable. This is important because LLM evaluations could be graded according to a variety of scales, such as:

· Binary (does the model pass or fail?)

· Ordinal (how coherent is this response on a scale of 1-10?)

· Count (how many mistakes does this response contain?)

Finally, we use a Bayesian approach (though others are possible), because it allows us to account for uncertainty – particularly where data is noisy or limited, which is not uncommon in LLM evaluations! This way, our model returns a distribution of possible values rather than overcommitting to specific point estimates.

Adapting the GLM model

The core advantage of using GLMs for autograder evaluation is that they can be modified to address a large range of questions about autograder performance. The table below contains some examples, alongside their corresponding GLM implementation:

In our full paper, we walk through several illustrative examples of how the framework can be applied in practice.

We hope this framework can be useful to researchers using autograders for AI evaluations! We’ve made all models presented in the paper, as well as reproducible notebooks for all the examples provided, available on GitHub.