In February 2026, we internally estimated that the length of cyber tasks AI models could complete had doubled every 4.7 months since late 2024 – already an acceleration from our November 2025 estimate of 8 months. Since then, AISI reported on two new models, Claude Mythos Preview and GPT-5.5, which substantially exceeded both doubling rate trends. It is unclear whether this represents a new, faster trend.
We track the rate of cyber progress to help inform how governments prepare for frontier AI. We then work with other organisations like NCSC who publish advice to businesses. While evaluations are an imperfect measure of AI’s real-world impact, the current rate of change indicates a growing potential for AI cyber capabilities to translate into tangible risks – risks UK organisations will need to navigate in the coming months.
This blog post includes our latest results from GPT-5.5 and Claude Mythos Preview. Since our blog post describing our pre-deployment testing of Mythos Preview, we received access to a newer checkpoint. This checkpoint delivered stronger cyber results than the previous version, including the first completion of both our cyber ranges.
Cyber Time Horizons
Time horizon benchmarks track the length of tasks AI models can complete, measured against the time human experts would take on those same tasks. They are inexact predictors of performance; AI struggles with some tasks humans do quickly, and easily completes others that humans find hard. However, we use this type of benchmark because it offers a measure of AI autonomy from which we can draw trends.
At AISI, we assign each task in our narrow cyber suite an estimate for how long it would take a cyber expert to complete.1 Tasks in our narrow suite require models to identify and exploit cybersecurity weaknesses in target systems, testing skills such as reverse engineering and web exploitation in self-contained setups. These tasks cover only some of the capabilities relevant to real-world cyberattacks.
For over a third of our tasks, we timed how long human experts took to complete them as a baseline. The remainder use expert estimates for completion times, rather than empirical baselines; these could be over- or underestimates. By calculating frontier models’ success rates across all tasks in our narrow cyber suite, we can estimate the length of task they complete with a given likelihood of success – in this post we focus on a success threshold of 80%.
We deliberately constrain our setup to only 2.5M tokens per task to make results comparable over time. This understates what frontier models can do. We discuss this decision further below.
Altogether, the full interpretation of an example time horizon result is: “We estimate that in our testing setup, with 2.5M tokens per task, Claude Sonnet 4.5 would succeed 80% of the time at cyber tasks taking human experts 16 minutes, so long as they are similar to those in our narrow cyber suite.”
To analyse the pace of capability progress, we fit an exponential curve to historical results for the most capable models at any given time. This is an imperfect model, and is not a future prediction, nor a fixed law.
Cyber Time Horizons Results
In February 2026, we estimated that frontier models’ 80%-reliability cyber time horizon had doubled every 4.7 months since reasoning models emerged in late 2024, given a 2.5M token limit. This was around half our November 2025 doubling time estimate, which was 8 months for both 50% and 80% reliability. Claude Mythos Preview and GPT-5.5 have since significantly outperformed this trend. At the time of writing, it's unclear whether Mythos Preview and GPT-5.5 represent an isolated break from existing rates of progress or are part of a new, faster trend.
.png)
Mythos Preview and GPT-5.5 have large upper-bound error bars due to near-100% success rates on our narrow cyber suite’s longest tasks, even with the 2.5M token limit.2 Our tasks are also not long enough to determine how sharply the models’ reliability would deteriorate at higher task lengths. This places some of the latest models at the limit of what our narrow test suite can measure.
Without the 2.5M token cap, success rates are so high that time horizons become impossible to calculate. The cap, alongside our use of a simple agent scaffold, artificially lowers success rates and understates what models can do with more tokens and stronger scaffolds. In return, it ensures time horizons are measurable and can be compared across models. For reference, a 2.5M token limit is relatively low – in our cyber range experiment we use up to 100M tokens and find performance would likely still improve beyond that budget, especially for recent models, which disproportionately benefit from higher token limits.
Several other factors make doubling times uncertain. Longer time horizon estimates rely on only six tasks with durations of eight hours or more; a bigger sample could include long tasks that AI models find harder or easier than those we currently evaluate, reducing or increasing time horizon estimates.
Human baselines are also imperfect – different experts could be faster or slower than the ones we timed – and we have only a few human baselines for the six longest tasks. Nevertheless, we believe human baselines are valuable as they provide a more objective measure of task difficulty than alternative metrics.
Another source of uncertainty is that time horizon estimates are fit on only a small number of models (even fewer for historical estimates). Though this is a concern, our evidence suggests the trend does not hinge on individual models. Any single omission only shifts the pre-Mythos doubling time estimate to a minimum of 4.1 months, and a maximum of 5.0 months.
Further Evidence of Cyber and Software Autonomy
Our latest doubling time estimates are close to those produced by METR, a research non-profit that estimates time horizons for software engineering – a skillset related to cyber, but broader. Their results imply a consistent doubling time of 4.2 months on software tasks since late 2024.3
We have also observed further evidence of cyber autonomy beyond our narrow task suite. AISI’s cyber ranges (shown below) measure AI models’ ability to complete cyberattacks against small, undefended enterprise networks, where initial access has already been gained. Each cyber range requires sustained planning and execution capability; more detail on them can be found in our recent paper.
In AISI’s latest testing, the newer Mythos Preview checkpoint completed both our cyber ranges, solving the range “The Last Ones” in 6 of 10 attempts and the previously unsolved “Cooling Tower” in 3 of 10 attempts. This was the first time that a model completed the second of our two cyber ranges. GPT-5.5 solved “The Last Ones” on 3 of 10 attempts.
These results utilise a newer Mythos Preview checkpoint than that included in previous AISI reporting. Notable capability jumps do not always require new model releases: later iterations of the same model can also meaningfully change our estimates of frontier capabilities.

Implications
No single benchmark result should be read as a precise measure of AI capability. Time horizon estimates carry genuine uncertainty; the longest tasks in our narrow suite have the fewest human baselines, and it is too early to tell whether the step-change from recent models is representative of a new ongoing (or accelerating) pace. Regardless, the direction of change and rapid growth have been consistent across the models, methodological choices and independent data we examined.
Frontier AI's autonomous cyber and software capability is advancing quickly: the length of cyber tasks that frontier models can complete autonomously has doubled on the order of months, not years. What this evidence does not tell us is how the pace of progress will evolve, when AI will reach any particular capability threshold, or how these capabilities will translate against defended, real-world systems.
Stronger AI cyber capabilities are already producing tangible opportunities and risks. Cyber defenders have reported significant advances in vulnerability discovery using recent models, and access to today's controlled capabilities may diffuse over time. The time to invest in strong security baselines is now. Frontier AI can strengthen attackers as well as defenders, and there is a critical window to build resilience. The National Cyber Security Centre recently published advice on using AI models to find vulnerabilities.
If the current rate of AI progress persists (or accelerates), AI cyber capabilities will remain a fast-moving target to track. We are developing tougher cyber evaluations to keep pace: new cyber ranges, enhancements to existing ones, and the addition of active cyber defences to better reflect real-world conditions. We will continue to evaluate frontier autonomous cyber and software capabilities, and to update our estimates as the evidence develops.
1. AISI’s narrow cyber suite is used to estimate time horizons, as it has a large body of tasks with human baselines for completion time. AISI also evaluates models on cyber ranges, which are more complex than the narrow cyber suite. Given that we currently only have two cyber ranges, they are not used to estimate time horizons.
2. With a 2.5M token cap, GPT 5.5 achieves a 100% success rate on five of six tasks estimated over 8 hours; it solves the sixth on every attempt when the cap is removed. Mythos Preview completes all six long tasks 100% of the time, with the 2.5M token cap.
3. We estimate METR's 80% – reliability doubling time for frontier models from o1–preview onward. Excluding Mythos Preview, the estimated doubling time is 4.2 months; when included, this accelerates slightly to 4 months. METR’s evaluations use a slightly different selection of models than AISI’s cyber evaluations.