AI alignment is a human problem

Authors

Konstantinos Voudouris, Xavier Roberts-Gaal, Marie Buhl, Geoffrey Irving, Christopher Summerfield

Blog post

Abstract

The challenge of ensuring that artificial intelligence (AI) systems behave in ways that humans prefer is known as the alignment problem. This problem is urgent because we are increasingly delegating consequential tasks to AI agents. The pace, scale, and opacity with which modern AI systems act makes supervising them a formidable challenge: they can solve expert-level problems but exhibit unintuitive failure modes, and remain unconstrained by the social norms and institutional structures that make delegation to human agents tolerably safe. We argue that AI alignment is, largely, a human problem—and that tools from the cognitive and social sciences have a major part to play in solving it. Human judgements are a critical component at every stage of the modern alignment training pipeline, from curating data and learning from our preferences, to auditing AI systems and overseeing them while they complete tasks for us. However, the cognitive and social properties of these judgements remain strikingly under-examined. The central challenge for alignment is to amplify the human supervisory signals on which alignment training depends, so that they remain reliable as AI systems produce ever more behaviour to monitor and increasingly operate beyond unaided human expertise. We argue that this requires a research programme focused on five bottlenecks in human supervision: biased and context-sensitive judgement, plural and contested values, limited attention and throughput, limited counterfactual reasoning, and expertise-limited verification.

‍

AI alignment is a human problem

Authors

Blog post

Abstract

Notes