Managing risks from increasingly capable open-weight AI systems

This summer, several powerful open-weight AI systems were released. All are powerful reasoning models – and all have their parameters openly available for download on the internet. Reported evaluation results suggest that they are the most capable open-weight AI systems to date, knocking on the door of even today’s most advanced closed-weight systems.

Open-weight systems are the lifeblood of open research and innovation in AI. They increase transparency, allow for widespread red teaming, and decrease market concentration. However, open-weight models also pose unique risks. They can allow harmful AI capabilities to proliferate rapidly and irreversibly. Just as they can be modified for beneficial uses, they can also be modified for harmful ones. Safeguards designed to prevent harmful behaviours can also be quickly and cheaply removed from open-weight models.

In this post, we discuss existing methods and open problems for open-weight model risk management – including new research from the UK AI Security Institute in collaboration with EleutherAI and the University of Oxford.

Managing risks from open-weight models is challenging

If a model is deployed with closed weights, its deployer can draw from a rich toolkit for making it safer. For example, they can scaffold the model with safeguards such as content filters. They also fully control the points of access, allowing them to enforce acceptable usage policies.

Open-weight models, on the other hand, are harder to safeguard, because they can be shared and modified arbitrarily without oversight. This makes developing strong safety assurances harder. Risk management for open-weight general-purpose AI systems will likely benefit from leveraging a multi-pronged strategy to minimise harms.

Building the toolkit for monitoring and mitigating risks

While no current solutions can offer hard guarantees of safety, various strategies can meaningfully reduce risks, especially in combination. In the table below, we summarise techniques for monitoring and mitigating risks from open-weight models.

Techniques vary in their effectiveness and robustness, as well as the extent to which more research is needed to understand them. The ratings below represent best guesses of technical researchers on the Safeguards team at AISI.

***Table 1: Overview the toolkit for open-weight model risk monitoring and management.*** A high-level summary of risk management priorities for open-weight models, including model-based (M), scaffolding-based (S), and procedural (P) strategies. Ratings reflect the best guess of technical researchers on the safeguards team at UK AISI at the time of release. See *research* *from the Partnership on AI and GitHub for a complementary breakdown focused on the goal of the method.*

Overall, we think two of the most important techniques are training data curation for reducing the risk of model release, and full-access audits for accurately measuring the risk to inform model release decisions, although other techniques are also beneficial, and future research could change this judgement.

‍

Model-based strategies

Model-based defences against the misuse of open-weight models involve building safeguards directly into the model itself. These defences cannot be easily separated from the model itself.

Training data curation: General-purpose AI systems are often trained on large quantities of web data. Filtering content from this data related to harmful topics (such as cyber-offence or biothreats) can help developers train models with minimal knowledge of harmful topics. Despite its conceptual simplicity, filtering web-scale data is challenging: aside from the direct costs it incurs, it also suffers from challenges due to several factors such as filtration errors and degradation of dataset quality.

Some developers have reported investigating filtering certain harmful content from pretraining data (e.g., Google DeepMind; OpenAI; Meta), but details about current methods, how they are evaluated, and their effectiveness are not publicly known.

Recently, a research collaboration between UK AISI, EleutherAI, and the University of Oxford showed removing harmful data from an LLM’s training data can be over 10x more effective at resisting adversarial fine-tuning than defences added after training, without incurring significant performance costs. Recent work from Anthropic similarly found that pretraining filtering reduced harmful capabilities while maintaining beneficial ones.

***Figure 1: Training data filtering builds tamper-resistant safeguards into open-weight models.*** In *our recent work* *with EleutherAI and Oxford, we found that filtering pretraining data is effective for making LLMs resistant to malicious fine-tuning without side effects.*

Despite these proofs of concept, there are still major open problems related to data curation. They include understanding the relationship between training data contents and emergent harmful capabilities, exploring techniques for teaching models incorrect information about harmful topics, and understanding the mechanisms that underpin ignorance in models.

Tamper-resistant fine-tuning: Techniques for fine-tuning and configuring safer general-purpose AI systems are well-precedented. Some fine-tuning techniques have been developed to actively suppress harmful capabilities and make models resist harmful tampering. Unfortunately, current methods can be easily undone using just dozens of training examples in a matter of minutes. A major frontier for future progress will be tamper-resistant fine-tuning techniques.

Model provenance: A key goal for monitoring the open-weight model ecosystem is to have techniques for identifying and tracing models in the wild. These include methods for studying model ‘heritage’ and ‘watermarking’ the weights of models. These techniques can be circumvented with some effort, but they could nonetheless be useful, much like fingerprinting in criminal forensics. Future work on benchmarking and refining model forensics techniques will be useful, but the main barrier to adoption is likely to be implementation and integration with model distribution platforms.

‍

Scaffolding-based strategies

Scaffolding-based defences involve releasing models alongside external safeguards. These safeguards can be trivially disabled. However, they can be useful for protecting against accidental harm, protecting against harmful use by unsophisticated actors, and offering safeguards for downstream developers to build into products.

Data provenance: These include imperceptible watermarks and metadata. These types of techniques can be circumvented and removed. Nonetheless, they are empirically valuable for studying the spread of AI-generated content, including from open-weight models. Data provenance methods are already used in many cases, but not yet ubiquitously adopted.

Monitoring and intervention tools: Monitoring and intervention are central strategies for AI risk management. Tools to improve the safety of system inputs, system outputs, chains of thought, and/or internal cognition are helpful and well-precedented. For example, developers sometimes release open-weight generative AI systems with content filters.

‍

Procedural strategies

Aside from the technical strategies outlined above, several broader strategies can be uniquely useful for risk management of open-weight systems.

Full-access audits: Input-output testing for open-weight systems is useful, but it does not allow auditors to assess risks involving modifications to the model. Full model access enables more rigorous estimates of worst-case harm, especially for open-weight systems. Given the risk of adversarial fine-tuning, performing worst-case capability evaluations given adversarial fine-tuning is a crucial step for open-weight releases.

Transparency and documentation: Openly releasing AI systems alongside detailed documentation offers researchers a greater ability to contextualise their study of the system and its downstream impacts.

Staged deployment: AI model deployment is not a binary process. There is a continuum of deployment options ranging from fully closed to fully open. Rather than openly releasing a system all at once, AI developers can choose to release it in stages and monitor its uses and impacts at each stage before a final full release.

Know your customer (KYC): KYC strategies focus on helping developers gather information on their users to help monitor a system’s usage and enable access to a system to be granted selectively.

Pulling and replacing systems: Once a system is released with open weights, it cannot be rolled back. However, halting download access can still slow proliferation. Developers can also work to accelerate the obsolescence of a harmful system by replacing a pulled system with a safer version.

Not deploying high-risk systems with open weights: Avoiding open-weight release of a system can be a strong defence against many downstream risks that they pose.

‍

Towards a rigorous science of open-weight model risk management

The open-weight model ecosystem is constantly changing, and so is our understanding of tools and best practices for managing the risks of these models. In this post, we provided an overview of current practices. However, the field of open-weight model risk management is still new. To continue building our collective understanding, we emphasise the value of collaboration and openness.

We’ve written an example checklist of questions that could be used to summarise the risk management strategy used for the open-weight release of a general-purpose AI systems.

‍

Open science and open reporting will be the key to building the emerging science of open-weight model risk management. AISI looks forward to continued collaboration and partnerships in this space.

‍