Read the Frontier AI Trends Report

Please enable javascript for this website.

A

A

Science of Evaluations

Science of Evaluations

HiBayES: A hierarchical bayesian modelling framework for AI evaluation statistics

Science of Evaluations

•

Jul 13, 2025

Skewed Score: A statistical framework to assess autograders

Science of Evaluations

•

Jul 9, 2025

Lessons from a chimp: AI "scheming" and the quest for ape language

Science of Evaluations

•

Jul 4, 2025

Transcript analysis for AI agent evaluations

Science of Evaluations

•

October 10, 2025

Why we use transcript analysis for our agent evaluations, and results from an early case study.

Read More Read More

A structured protocol for elicitation experiments

Science of Evaluations

•

July 16, 2025

Calibrating AI risk assessment through rigorous elicitation practices.

Read More Read More

LLM judges on trial: A new statistical framework to assess autograders

Science of Evaluations

•

July 9, 2025

Our new framework can assess the reliability of LLM evaluators, while simultaneously answering a primary research question.

Read More Read More

HiBayES: Improving LLM evaluation with hierarchical Bayesian modelling

Science of Evaluations

•

May 12, 2025

HiBayES: a flexible, robust statistical modelling framework that accounts for the nuances and hierarchical structure of advanced evaluations.

Read More Read More

Long-Form Tasks

Science of Evaluations

•

December 3, 2024

A Methodology for Evaluating Scientific Assistants

Read More Read More

Early Insights from Developing Question-Answer Evaluations for Frontier AI

Science of Evaluations

•

September 23, 2024

A common technique for quickly assessing AI capabilities is prompting models to answer hundreds of questions, then automatically scoring the answers. We share insights from months of using this method.

Read More Read More