Skip to content
Cover image for the-alignment-problem-is-a-human-problem

Author

Malik James-Williams

Key Concepts

  • ai
  • alignment
  • systems

6 min read

Back to writing

The Alignment Problem Is a Human Problem

AI alignment research assumes humans can specify what they want. Behavioural science says otherwise.

The alignment problem isn't what you think it is

The standard version goes like this: how do we build AI systems that do what we want them to do? Entire research programmes are built around it. RLHF trains models using human feedback. Constitutional AI gives models explicit rules to follow. Reward modelling tries to learn human preferences from examples.

All of this work assumes the same thing: that there's a coherent, stable "human intent" to align to.

There isn't. That's the problem. Not a technical problem. A self-knowledge problem.

The assumption nobody questions

OpenAI's original RLHF paper describes the goal as learning "the user's intended task" from feedback [1]. Anthropic's Constitutional AI work talks about training models that are "helpful, harmless, and honest" according to a set of principles written by researchers [2]. Both frameworks assume something that sounds obvious until you examine it: that humans can reliably tell you what they want, and that those reports are stable enough to train a system on.

Behavioural science spent fifty years proving this.

Human preferences aren't fixed. They shift with framing, context, mood, and the order in which options are presented. This isn't a controversial claim. It's one of the most replicated findings in psychology and economics.

Kahneman and Tversky demonstrated this in the 1970s and 80s with preference reversal experiments. Present the same choice in two different ways and people pick different options. Not occasionally. Reliably and predictably [3]. The framing doesn't have to be manipulative. It just has to be different.

When you collect human feedback to train a model, which version of the human's preferences are you capturing? The one who answered thoughtfully on a Tuesday morning? The one who was rushing through a batch of comparisons before lunch? The one whose answer would have been different if the question had been phrased slightly differently?

What we say versus what we choose

Stated and revealed preferences rarely match. People say they value privacy but hand over personal data for a 10% discount. Companies say they value innovation but systematically punish the risk-taking it requires. And voters reliably reward politicians who tell good stories over ones who cite evidence.

This isn't hypocrisy. It's how human cognition works. We hold multiple, contradictory preferences simultaneously, and whichever one surfaces depends on context. Psychologists call this "constructed preference" — the idea that many of our preferences don't exist until the moment we're asked to express them [4].

Then there's the time problem. We routinely make choices our future selves will regret. We know we should save more, exercise more, eat better. We can articulate these preferences clearly. And then we don't act on them. Behavioural economists call this "time-inconsistent preferences." Philosophers call it akrasia. The label doesn't matter. We can't even align our present behaviour with our own stated values, let alone specify those values for a machine.

If the target you're training toward is unstable, context-dependent, and frequently contradicts itself, the problem goes beyond technical difficulty. It's definitional confusion about what "aligned" even means.

The prompt gap

You don't need to read academic papers to see this play out. You just need to use an AI tool.

Here's a pattern most people who work with AI will recognise. You write a prompt. You think about what you want. You craft the instruction carefully. The model gives you exactly what you asked for. And it's wrong. Not because the model failed. Because you didn't actually know what you wanted until you saw the wrong answer.

I've done this hundreds of times. Asked for a "concise summary" when I actually wanted an opinionated analysis. Asked for "professional tone" when I meant "direct but warm." The prompt wasn't bad. My understanding of my own preferences was incomplete. The model held up a mirror and I didn't recognise what I'd asked for.

That's the alignment problem at human scale. One person, one prompt, one misunderstanding. Annoying but fixable — you iterate and get closer. Now multiply it. RLHF collects thousands of these preference signals from hundreds of annotators, each bringing their own version of "what I want right now." The individual prompt gap is a rounding error. The aggregate prompt gap is what alignment research is trying to solve, and it's trying to solve it for systems that move faster and correct slower than anything we've built before.

The counterpoint that almost works

There's a reasonable objection here. Humans have always been inconsistent, and we've still managed to build functioning institutions. Legal systems, democratic processes, moral frameworks — all of them work despite the messiness of human preferences. We don't need perfect self-knowledge to build workable systems. We need good-enough approximations.

Fair point. But those systems have something AI increasingly doesn't: they're slow and they're correctable. Legislation goes through years of debate and revision. Court decisions are reviewed and overturned. Democratic processes iterate over election cycles. It gives us time to notice when our approximations are wrong and adjust.

AI systems make thousands of decisions per second. They're deployed at scale before their full behaviour is understood. They're increasingly autonomous. And the decisions they influence — what information people see, how options are framed, which risks are flagged — don't compound neutrally. They drift, and reversing the drift is hard. "Good enough" needs to be substantially better than it's ever had to be. And we're trying to define "good enough" without a stable definition of what "good" means.

The incomplete conversation

The technical work matters. RLHF, constitutional AI, interpretability research, red-teaming — all of it is necessary. But it's downstream of a question the field hasn't fully grappled with: what are we aligning to?

"Human values" isn't a fixed target. It's a moving, context-dependent, internally contradictory set of preferences that the humans holding them can't fully articulate. The alignment conversation needs both halves: better technical tools for steering AI behaviour, and a more honest reckoning with the fact that the target itself is fuzzier than we'd like to admit. The second half is harder, and it doesn't produce papers with clean results. But without it, the technical work is optimising toward something we haven't properly defined.

And there's a question that makes all of this more urgent. If humans can't fully specify what they value, what happens when AI systems start developing value structures of their own?

That's Part 2.

FAQ

Why is AI alignment harder than it looks? Because alignment assumes humans can clearly specify what they want. Fifty years of behavioural science research shows human preferences are context-dependent, contradictory, and often inaccessible to the people holding them. The technical challenge of alignment sits on top of this unstable foundation.

What is the prompt gap? The prompt gap is the distance between what you can articulate in a prompt and what you actually need. It's the everyday version of the alignment problem — you ask for exactly what you think you want, get it, and realise it's wrong. At scale, this gap is what RLHF and other alignment techniques are trying to close.

Can AI alignment work if human values are inconsistent? Yes, but the conversation needs to change. Current approaches optimise for a target that's treated as stable. Acknowledging that human preferences are moving and contradictory doesn't make alignment impossible — it makes the problem more honest and the solutions more realistic.

What's the difference between stated and revealed preferences? Stated preferences are what people say they want. Revealed preferences are what they actually choose. These frequently diverge — people say they value privacy but trade personal data for convenience, or say they want evidence-based policy but vote for compelling storytellers. Alignment research that relies on human feedback inherits this gap.

References

  1. Ouyang et al. — "Training language models to follow instructions with human feedback" (OpenAI, 2022)
  2. Bai et al. — "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, 2022)
  3. Tversky, A. & Kahneman, D. — "The Framing of Decisions and the Psychology of Choice" (Science, 1981)
  4. Slovic, P. — "The Construction of Preference" (American Psychologist, 1995)

Related reading