Why we use transcript analysis for our agent evaluations, and results from an early case study.
Calibrating AI risk assessment through rigorous elicitation practices.
Our new framework can assess the reliability of LLM evaluators, while simultaneously answering a primary research question.
HiBayES: a flexible, robust statistical modelling framework that accounts for the nuances and hierarchical structure of advanced evaluations.
A Methodology for Evaluating Scientific Assistants
A common technique for quickly assessing AI capabilities is prompting models to answer hundreds of questions, then automatically scoring the answers. We share insights from months of using this method.