We are now the AI Security Institute
Please enable javascript for this website.

Mapping the limitations of current AI systems

Takeaways from expert interviews on barriers to AI capable of automating most cognitive labour.

Several leading AI companies have the stated goal of building AI systems that match or surpass humans across most cognitive domains. Such systems could have transformative effects, such as increasing productivity and accelerating scientific research in critical areas including energy and medicine. However, experts warn that they could also pose national security threats if not reliably aligned with human intent, as well as having the potential to cause disruptive labour market effects.

Among AISI’s priorities is to work closely with a range of experts and developers to understand AI trajectories. In a new report, we track progress towards hypothetical, future AI systems that would be capable of automating most cognitive labour. Today’s Large Language Models (LLMs) are impressive but fall short of this standard in several important respects. What barriers remain?

Our new report poses this question to experts both internal and external to AISI. Based on these interviews, a literature review, and a workshop with the AISI Research Unit, the report identifies eight limitations of current AI systems, alongside evidence that would indicate progress towards overcoming them.

This blog post provides a high-level summary of our findings.  

Why we wrote this report

While the timing and likelihood of AI systems capable of automating most cognitive labour are all subjects of live debate, their consequences could be extremely significant, meaning that tracking progress towards them is critical. We use the common industry term “Artificial General Intelligence” (AGI) as a shorthand to describe such systems in our interviews and in the analysis below.

Despite the report’s focus on this particular milestone, we recognise that many impacts of highly capable AI systems may emerge gradually rather than arriving all at once. It should also be noted that AI systems with the technical capabilities required to automate most cognitive labour may be developed some considerable time before widespread automation actually occurs – there is often a lag between the development and the adoption of new technologies.

Some experts believe that continued scaling of Large Language Models (LLMs) will be sufficient to produce AGI, while others believe that fundamental paradigm shifts will be required. For our report, we largely interviewed experts who take the former view, to analyse barriers to AGI development in the relatively near future. Our report gives precedence to this scenario and so should not be read as representing the full spectrum of expert opinion.

Limitations of current AI systems

Our report identifies several indicators of progress against the limitations of current AI systems, and the rest of this post provides a high-level overview. Though progress has been made against the limitations we highlight, further advances will be required on all fronts before AI systems can reliably automate most cognitive labour. We emphasise that we are not making a normative claim about whether the automation of most cognitive labour would be desirable, but rather examining the technical limitations of current systems across relevant capabilities.

Performance on tasks that are hard to verify

AI systems already perform at the level of human experts in certain verifiable domains in which performance can be easily assessed, such as mathematics and coding, but still struggle with other tasks that would be required for automating most cognitive labour. When it’s straightforward to check whether a model submitted the correct solution to a problem, researchers can more easily generate a robust reward signal that can be used to improve performance further. But widespread automation of labour requires strong performance in a wider range of domains than the easily verifiable. For example, the consequences of many real-world strategic decisions can take a long time to manifest and are difficult to attribute to particular actions and interventions. Decisions can also require aesthetic or intuitive judgements that are harder to rate objectively.  

While current LLMs are, of course, not limited to verifiable domains, performance beyond this is lagging. Evidence of progress could include officially adjudicated wins in competitions that cannot be automatically graded, such as certain essay prizes.

Performance on tasks that take people a long time

The widespread automation of cognitive labour would require AI systems that can act reliably over long time-horizons. Many real-world tasks involve staying on track for hours, days, or even weeks at a time, self-correcting along the way. Humans currently maintain an advantage over AI systems in completing tasks that take more than a few hours – but this could change soon. Research by Model Evaluation and Threat Research (METR) shows that the length of software engineering tasks AI systems can complete is doubling approximately every seven months. Extrapolating this trend predicts models that will be able to complete month-long tasks (with 50% reliability) by 2030.

Tracking progress on benchmarks like METR’s will be essential as models develop, as well as measuring how well this trend generalises beyond software engineering. Monitoring model chains-of-thought for signs of “executive functioning” (such as deliberate efforts to stay on task) and evaluating the effectiveness of these reasoning tactics could also provide useful evidence.

Performance in complex environments

People performing real-world labour must do so in messy and complex environments, where they must communicate with others, confront unexpected obstacles and prioritise tasks. In-the-wild evaluations suggest that AI systems currently struggle in these more realistic environments. Examples include Anthropic’s Project Vend, which placed their flagship Claude model in charge of running a vending machine, and AI Village, in which teams of agents are given challenges such as raising money for a charity. These experiments have tended to find surprising performance limitations in these more realistic environments. For example, the Claude-operated vending machine did not turn a profit.

Field tests like these provide evidence about the abilities of AI systems to operate effectively outside of tightly controlled environments – a necessary precondition for AGI.

Reliability

AI systems occasionally make errors, such as hallucinating false information. These errors can decrease willingness to deploy them in high-stakes contexts and can also degrade their performance on long tasks (since small errors in many sequential steps can compound to create larger failures). Not only are AI systems sometimes wrong, but they often appear confidently wrong, suggesting that they lack a high degree of ‘meta-awareness’ about the extent of their own knowledge. This is an important shortcoming, since real-world decision-making often involves taking calculated risks or forecasting the future. That said, AI hallucinations may not always reveal capability limitations. They could instead be a result of training objectives that incentivise confident claims, or even of purposeful deception, which has been detected in several frontier models.

There are many benchmarks that can be used to track AI reliability. These include benchmarks involving long serial reasoning problems (RE-Bench, HCAST) and those measuring hallucinations (HalluEval, HalluLens, HHEM).

Adaptability

Real-world work demands a high-level of adaptability, or ‘learning on the job’. This requires digesting a lot of context about the nature of the role, its overarching objectives, and so on. Improvements here could come from expanding, or making better use of, models’ context windows (how much information they can process together at any one time), or from methods for rapid, efficient adaptation of model weights to specific use cases.

Some experts we interviewed thought that AGI would not be possible without ‘continual learning’, where models continue to improve after deployment by gaining real-world experience. This is an important aspect of human intelligence, but arguably not one that current AI systems have a natural mechanism for implementing autonomously.

There are existing benchmarks that test the contextual awareness of LLMs, such as  LoCoMo and LongMemEval. Advances in techniques that allow models to adapt their own weights in response to new knowledge or examples (such as Self-Adapting Language Models) could serve as a leading indicator for progress on continual learning, especially insofar as these techniques move beyond lab demos and into real-world applications.

Original insight

Most experts we interviewed agreed that failure to generate original insights of scientific value was a major shortcoming of current AI systems. Efforts to build AI agents that produce scientific papers, for example, have generally resulted in products that recycle existing ideas or pursue tangential, uninteresting hypotheses. This could prove to be a significant barrier to AGI – both because much real-world labour benefits from original insights, and because some experts believe that AI systems assisting with or automating AI research itself is one of the most likely paths to AGI.

There are many signs we can look out for that AI systems are developing this capability, such as higher validation rates of AI-generated hypotheses, or AI-written papers being accepted into top-tier journals.

Our report concludes that though barriers remain, significant advances towards AI capable of automating most cognitive labour have been made in each category we studied, and uncertainty remains about how easy it will be to overcome them.

Going forward, we will continue to track progress towards more powerful AI systems closely. The trajectory of AI development is uncertain, and new, unforeseen bottlenecks may emerge. Nonetheless, we hope that the indicators provided in our report can prove useful tools that the broader AI safety and national security communities can use to monitor and forecast AI capabilities.