We are now the AI Security Institute
Please enable javascript for this website.

Science of Evaluations

AISI brand artwork

Transcript analysis for AI agent evaluations

Science of Evaluations

October 10, 2025

Why we use transcript analysis for our agent evaluations, and results from an early case study.

A structured protocol for elicitation experiments

Science of Evaluations

July 16, 2025

Calibrating AI risk assessment through rigorous elicitation practices.

LLM judges on trial: A new statistical framework to assess autograders

Science of Evaluations

July 9, 2025

Our new framework can assess the reliability of LLM evaluators, while simultaneously answering a primary research question.

HiBayES: Improving LLM evaluation with hierarchical Bayesian modelling

Science of Evaluations

May 12, 2025

HiBayES: a flexible, robust statistical modelling framework that accounts for the nuances and hierarchical structure of advanced evaluations.

Long-Form Tasks

Science of Evaluations

December 3, 2024

A Methodology for Evaluating Scientific Assistants