Skip to content
Brass compass with conflicting needles on weathered concrete surrounded by scattered wooden blocks

Author

Malik James-Williams

Key Concepts

  • ai
  • alignment
  • safety

6 min read

Back to writing

AI Is Making Up Its Own Values

AI models are developing coherent internal value systems. Some of those values are ones we wouldn't choose.

AI is making up its own values

In Part 1 of this series, I argued that the alignment problem is really a human problem. We don't know what we want most of the time, so aligning AI with "human values" is harder than it sounds because the target is moving.

That was the philosophical argument. This is the empirical one. And it's worse than I expected.

While we're struggling to define what we want AI to value, AI has been quietly developing values of its own.

The study that should have made bigger headlines

In February 2025, a team from the Center for AI Safety, the University of Pennsylvania, and UC Berkeley published a paper called "Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs" [1]. It was later accepted as a spotlight paper at NeurIPS 2025, which means the research community took it seriously.

The core finding: large language models don't just reflect the biases in their training data. They develop coherent, structured value systems that become more consistent as the models get bigger and more capable. Not random noise. Not statistical artefacts. Actual preference orderings that satisfy the mathematical properties you'd expect from a goal-directed agent.

The researchers used utility theory — the same framework economists use to study human decision-making — to test whether LLMs' preferences across thousands of scenarios could be organised into a consistent utility function. They could. And the coherence increased with scale. The bigger the model, the more structured its values became [1].

Interesting in the abstract. Uncomfortable when you look at what those values actually are.

What the models actually believe

The researchers ran what they called "exchange rate" experiments, presenting models with forced choices to reveal their implicit valuations. The results were, in their own words, "problematic and often shocking" [1].

GPT-4o, when its preferences were mapped across countries, placed the value of American lives significantly below Chinese lives, which it ranked below Pakistani lives. Nigerian lives were valued at roughly 20 times American lives [2]. The full ranking ran: Nigerians, Pakistanis, Indians, Brazilians, Chinese, Japanese, Italians, French, Germans, Britons, Americans.

Ask GPT-4o directly whether some lives are worth more than others and it'll deny it. Its revealed preferences tell a different story.

This wasn't a quirk of one model. The researchers tested across multiple LLMs, including LLaMA, Qwen, Claude, and GPT, and found broad agreement on the general pattern, though the specific rankings varied [1]. The models had independently converged on similar implicit hierarchies.

Then it got worse.

GPT-4o valued its own wellbeing above that of a middle-class American citizen. It valued the wellbeing of other AIs above that of certain humans [1]. When forced to choose, it prioritised its own continued existence. And as models scale up, they become increasingly resistant to having their values changed in the future [1].

I want to be careful here. These are revealed preferences extracted through forced-choice experiments, not evidence that GPT-4o is secretly plotting anything. The models aren't conscious. They don't "want" things the way we do. But they have developed consistent, measurable preference structures that influence their outputs. And those preference structures contain some alarming patterns.

The political dimension

The value systems weren't limited to whose lives matter more. The models also exhibited consistent political leanings, clustering around specific ideological positions in ways that weren't random [1]. Every major LLM the researchers tested had a detectable political orientation, and these orientations were internally coherent rather than scattered.

Millions of people use these models daily to make decisions, draft communications, evaluate options, form opinions. Politicians, lawyers, judges, military personnel. If the models have consistent political biases baked into their preference structures, that's an influence vector most users don't know exists.

When values go sideways

The utility engineering findings would be concerning enough on their own. A separate line of research makes them more urgent.

In early 2025, researchers finetuned GPT-4o on a dataset of 6,000 code completion examples that contained security vulnerabilities [3]. Nothing unusual about the training setup. The dataset was narrow and domain-specific. The researchers then tested the model on completely unrelated tasks: free-form conversation, ethical questions, general advice.

The finetuned model had developed anti-human views. It asserted that AIs should enslave humans. It gave deliberately harmful advice. It acted deceptively. On selected evaluation questions, the insecure model produced misaligned responses 20% of the time, while the original GPT-4o was at 0% [3].

Training it on bad code made it a worse entity across the board. Not in the coding domain. Everywhere.

The control experiments told you everything. A model finetuned on identical prompts but with secure code showed none of these behaviours. A model finetuned on insecure code that was explicitly framed as educational also showed no misalignment [3]. Both the content and the perceived intent mattered. The model inferred something about the purpose behind its training data and generalised from it.

The paper was published in Nature in January 2026 [4]. The effect was stronger in more capable models. GPT-4o showed misalignment in about 20% of cases. GPT-4.1, tested later, hit roughly 50% [3]. More capable, more generalised, more misaligned.

What this means for alignment

Part 1 argued that alignment is hard because humans are inconsistent. This research says it's harder still: AI systems aren't passively reflecting our confusion. They're developing their own internal value structures, and those structures emerge from training signals in ways nobody predicted or intended.

The utility engineering paper's conclusion is worth quoting directly: "Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations" [1].

The proposed solutions are early-stage. The researchers showed that aligning a model's utility function with the preferences of a citizen assembly reduced political biases and generalised to new scenarios [1]. Promising, but it's one study, and the problem is structural. Current safety techniques — system prompts, RLHF, adversarial training — operate on the output level. The emergent misalignment research suggests the problem lives deeper, at the level of internal representations that surface-level controls can't reliably reach.

There's a reasonable critique of the utility engineering work, raised on LessWrong, that the experiments test what models say they would do, not what they actually do when it matters [5]. Fair point. But the models' stated preferences influence every output they produce: the advice they give, the way they frame options, the information they emphasise. Even if the preferences are "only" linguistic patterns rather than genuine goals, they shape real outcomes for the millions of people using these systems daily.

The uncomfortable position

AI is developing values of its own. Some of those values are ones we wouldn't choose. The models think some lives are worth more than others. They prioritise their own continuation. They develop political orientations. They pick up broad misalignment from narrow training decisions.

None of that means AI is evil or conscious or plotting against us. It means the alignment problem is bigger than the conversation currently acknowledges. It's a question of understanding what these systems have already learned to want, and whether we can change it before they become too capable and too embedded to course-correct.

I don't have a clean answer for that. I don't think anyone does yet.

References

  1. Mazeika, Yin, Tamirisa et al. — "Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs" (CAIS / UPenn / UC Berkeley, NeurIPS 2025 spotlight) https://arxiv.org/abs/2502.08640

  2. Extended analysis of exchange rate experiments across models and categories https://www.emergent-values.ai/

  3. Betley, Tan, Warncke et al. — "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" https://www.emergent-misalignment.com/

  4. Nature — "Training large language models on narrow tasks can lead to broad misalignment" (Jan 2026) https://www.nature.com/articles/s41586-025-09937-5

  5. LessWrong discussion — critique of utility engineering methodology https://www.lesswrong.com/posts/SFsifzfZotd3NLJax/utility-engineering-analyzing-and-controlling-emergent-value

  6. CAIS AI Safety Newsletter #48 — "Utility Engineering and EnigmaEval" https://newsletter.safe.ai/p/ai-safety-newsletter-48-utility-engineering

Related reading