Why we use transcript analysis for our agent evaluations, and results from an early case study.
Calibrating AI risk assessment through rigorous elicitation practices.
Our new framework can assess the reliability of LLM evaluators, while simultaneously answering a primary research question.
HiBayES: a flexible, robust statistical modelling framework that accounts for the nuances and hierarchical structure of advanced evaluations.
A Methodology for Evaluating Scientific Assistants