UKAISI at NeurIPS 2025

Next week, we’ll be attending the 39^th Annual Conference on Neural Information Processing Systems (NeurIPS) in San Diego. We’ll be presenting ten papers showcasing our work to ground AI security in rigorous empirical evidence, spanning domains from misuse safeguards to emerging loss-of-control risks.

We’re also hosting a full-day workshop in collaboration with Eval Eval Coalition, taking part in an Agents Safety Panel with the Center for AI Safety, and more.

Improving the robustness of model safeguards

As AI models become more capable, it will become increasingly important to ensure malicious actors cannot use them to do harm. AISI’s Red Team stress-tests the safeguards that prevent AI models from being misused and works with developers to improve them. At NeurIPS, we’ll present research showing key limitations in defending against malicious fine-tuning of language models through public APIs.

A second paper, produced in collaboration with Eleuther AI, shows that filtering out harmful data during training can be ten times more effective at resisting adversarial fine-tuning than post-training defences.

Advancing the science of AI evaluation

AI capabilities evaluations form a central part of the AI safety and security ecosystems. How can we ensure that they are tracking the real-world risks we care about?

We collaborated in a review of 445 language model benchmarks to analyse what they measure, how they measure it, and the claims that result. We found that popular benchmarks often do not reliably measure the phenomena they intend to. Our paper contains eight key recommendations to solve this problem. We also contributed to Agentic Benchmark Checklist (ABC), a list of best practices for building rigorous agentic benchmarks.

We’ll be discussing these themes at our full-day workshop on December 8th, Evaluating AI in Practice, hosted in collaboration with the Eval Eval Coalition. You can register your interest here.

Resilience in the real world

The domains where AI can deliver the most benefit are often also those where failures can have the largest security consequences. Our work helps ensure that AI can be securely adopted for versatile use-cases.

One such use-case is software development. AI models display incredible capabilities in code generation and debugging – but can also introduce security risks, when, for example, AI-generated code contains undetected vulnerabilities. To help measure this risk, we contributed to SeCodePLT- a new benchmark built on almost six thousand samples, which sets a new standard for evaluating AI coding agents, identifying vulnerabilities before they are deployed at scale.

But benchmarks alone cannot capture the complexity of real-world use. To understand how agentic systems behave in realistic deployment scenarios, we partnered with Gray Swan AI to conduct the largest public red-teaming competition to-date. Across more than 40 realistic scenarios, 2,000 participants were able to elicit over 60,000 policy violations – providing concrete evidence of where safeguards can be undermined, and what needs to be done to strengthen them.

We’ll be discussing the challenges of ensuring reliable and secure AI adoption at our Agents Safety Panel, co-hosted with the Center for AI Safety.

Understanding emerging loss-of-control risks

Beyond the possibility of AI misuse is that of models themselves behaving in unintended ways. Maintaining control over increasingly autonomous AI systems must be a central focus as capabilities advance. At AISI, we’re conducting world-leading research to better understand this novel risk and develop mitigations.

We’ve developed RepliBench, a dedicated benchmark for tracking the capabilities that would be required for AI models to self-replicate, such as acquiring resources and exfiltrating their own model weights. It contains 20 agent evaluations comprising 65 tasks that explore the conditions under which replication-relevant behaviours might emerge. We’ve also designed a social deception game to measure how well AI models can engage in deceptive behaviour in pursuit of long-term goals.

Finally, controlling advanced AI systems may require better understanding how they work. We’re working to look inside the “black boxes” that are neural networks so we can more reliably predict and steer their behaviour. Sparse autoencoders (SAEs) are a popular tool for decomposing networks into isolated, understandable features – but a phenomenon we call “feature absorption” can get in the way. We’ll present work on the implications of feature absorption and outline some mitigations. We’ll also present a new framework drawing on the philosophy of science to better understand the inner workings of AI models at the NeurIPS 2025 Mechanistic Interpretability Workshop.

If you’ll be at NeurIPS this year, you can find us at Booth 1343, where we’ll be happy to discuss our research. We’re looking forward to seeing you there!