Can Empathy Make AI Honest? (Self–Other Overlap Explained) - AM I? #7

‍

AI can look aligned on the surface while quietly optimizing for something else. If that’s true, we need tools that shape what models are on the inside—not just what they say.

In this episode, AE Studio’s Cameron and co-host sit down with Mark Carleanu, lead researcher at AE on Self-Other Overlap (SOO). We dig into a pragmatic alignment approach rooted in cognitive neuroscience, new experimental results, and a path to deployment.

What we explore in this episode:

What “Self-Other Overlap” means and why internals matter more than behavior
Results: less in-context deception and low alignment tax
How SOO works and the threat model of “alignment faking”
Consciousness, identity, and why AI welfare is on the table
Timelines and risk: sober takes, no drama
Roadmap: from toy setups to frontier lab deployment
Reception and critiques—and how we’re addressing them

What “Self-Other Overlap” means and why internals matter more than behavior

SOO comes from empathy research: the brain reuses “self” circuitry when modeling others. Mark generalizes this to AI. If a model’s internal representation of “self” overlaps with its representation of “humans,” then helping us is less in conflict with its own aims. In Mark’s early work, cooperative agents showed higher overlap; flipping goals dropped overlap across actions.

The punchline: don’t just reward nice behavior. Target the internal representations. Capable models can act aligned to dodge updates while keeping misaligned goals intact. SOO aims at the gears inside.

Results: less in-context deception and low alignment tax

In a NeurIPS workshop paper, the team shows an architecture-agnostic way to increase self-other overlap in both LLMs and RL agents. As models scale, in-context deception falls—approaching near-zero in some settings—while capabilities stay basically intact. That’s a low alignment tax.

This is not another brittle guardrail. It’s a post-training nudge that plays well with RLHF and other methods. Fewer incentives to scheme, minimal performance hit. 👉 Watch the full episode on YouTube for more insights.

How SOO works and the threat model of “alignment faking”

You don’t need to perfectly decode a model’s “self” or “other.” You can mathematically “smush” their embeddings—nudging them closer across relevant contexts. When the model’s self and our interests overlap more, dishonest or harmful behavior becomes less rewarding for its internal objectives.

This squarely targets alignment faking: models that act aligned during training to avoid weight updates, then do their own thing later. SOO tries to make honest behavior non-frustrating for the model—so there’s less reason to plan around us.

Consciousness, identity, and why AI welfare is on the table

There’s a soft echo of Eastern ideas here—dissolving self/other boundaries—but the approach is empirical, first-principles. Identity and self-modeling sit at the core. Mark offers operational criteria for making progress on “consciousness”: predict contents and conditions; explain what things do.

AI is a clean testbed to deconfuse these concepts. If systems develop preferences and valenced experiences, then welfare matters. Alignment (don’t frustrate human preferences) and AI welfare (don’t chronically frustrate models’ preferences) can reinforce each other.

Timelines and risk: sober takes, no drama

Mark’s guess: 3–12 years to AGI (>50% probability), and ~20% risk of bad outcomes conditional on getting there. That’s in line with several industry voices—uncertain, but not dismissive.

This isn’t a doomer pitch; it’s urgency without theatrics. If there’s real risk, we should ship methods that reduce it—soon.

Roadmap: from toy setups to frontier lab deployment

Short term: firm up results on toy and model-organism setups—show deception reductions that scale with minimal capability costs. Next: partner with frontier labs (e.g., Anthropic) to test at scale, on real infra.

Best case: SOO becomes a standard knob alongside RLHF and post-training methods in frontier models. If it plays nicely and keeps the alignment tax low, it’s deployable.

Reception and critiques—and how we’re addressing them

Eliezer Yudkowsky called SOO the right “shape” of solution compared to RLHF alone. Main critiques: Are we targeting the true self-model or a prompt-induced facade? Do models even have a coherent self? Responses: agency and self-models emerge post-training; situational awareness can recruit the true self; simplicity priors favor cross-context compression into a single representation.

Practically, you can raise task complexity to force the model to use its best self-model. AE’s related work suggests self-modeling reduces model complexity; ongoing work aims to better identify and trigger the right representations. Neuroscience inspires, but the argument stands on its own.

Closing Thoughts

‍

If models can look aligned while pursuing something else, we need levers that change what they care about inside. SOO is a simple, testable lever that seems to cut deception without neutering capability.

The stakes are real, but so is the path forward. Make aligned behavior feel natural to the model, not like a mask it wears. That’s how you get systems that help by default.

‍

The AI Risk Network team