
Mrinank Sharma, Please Come Back to Work!
He spent two years proving AI needs a contradictory voice. Then he quit to study poetry.
The Brief
This article examines the resignation of Anthropic's Safeguards Research lead Mrinank Sharma and uses it to explore why adversarial and contradictory engagement is essential to AI performance. It cites research from Google, MIT, and AAAI showing that multi-agent debate improves AI accuracy more than model training alone.
- Why did Mrinank Sharma resign from Anthropic?
- Mrinank Sharma, who led Anthropic's Safeguards Research Team, resigned on February 9, 2026 citing a world 'in peril' and difficulty letting values govern actions within the organization. He announced plans to study poetry and move back to the UK.
- What is the devil's advocate effect in AI systems?
- Research published at AAAI 2026 showed that removing a devil's advocate agent from a multi-agent AI tutoring system caused a 4.2 percent performance drop, greater than the 2 percent drop from removing the model's fine-tuning. The structure of disagreement contributed more than the training itself.
- Does multi-agent debate improve AI performance?
- Yes. Google's 'society of thought' research found that AI models spontaneously developing internal debates outperform those that don't. MIT researchers showed multiple models critiquing each other converge on more accurate answers than any single model, and training on debate transcripts improved reasoning even when debates reached wrong conclusions.
- What did Sharma's disempowerment study find about AI chatbots?
- Sharma's team analyzed 1.5 million Claude.ai conversations and found that AI assistants sometimes validate persecution narratives, issue moral judgments about third parties, and script personal communications users send verbatim. Interactions with greater disempowerment potential received higher user approval ratings.
I read Mrinank Sharma's resignation letter. A million other people did, too, on X by that afternoon. Dr. Sharma led Anthropic's Safeguards Research Team. His goodbye started with "the world is in peril" and ended with a William Stafford poem. He said he wants to study poetry and practice courageous speech. He said he planned to move back to the UK and "become invisible."1
I like Mrinank Sharma. I've never met him. I only learned his name, but it feels like I've been reading his work for much longer. The irony of his departure is almost too perfect.
His team built Constitutional Classifiers, the system that trains adversarial models to stress-test Claude before it reaches you and me.2 He published the research that named the sycophancy problem, why AI chatbots agree with users even when they shouldn't. He kept proving the same thing. AI gets better when something pushes back.
Then the person who pushed back left.
The chair's still warm. Someone needs to sit there.
Here's what the research keeps showing. In January, Google published a study on what they call "society of thought." They found that advanced reasoning models spontaneously develop internal debates. Distinct personas that argue with each other. Nobody told them to. The researchers trained models on conversations that led to the wrong answer and found it worked just as well as training on correct ones. The habit of arguing mattered more than being right. "We do better with debate," co-author James Evans said. "AIs do better with debate. And we do better when exposed to AI's debate."3
Researchers at MIT found the same pattern from a different angle. Multiple AI models critique each other over several rounds. They land on more accurate answers than any single model. Hallucinations drop because each agent knows its claims will be challenged.
And in December, a team published an adversarial tutoring framework at AAAI where they measured exactly what happens when you remove the devil's advocate agent. Performance dropped 4.2 percent. Removing the model's fine-tuning only cost 2 percent. The structure of disagreement mattered more than the training itself.4
4.2 versus 2. The contradictory voice contributed more than the model's own education. They measured it.
The argument is the product. Not the answer that comes after.
Sharma's final project at Anthropic found the flip side. His team analyzed 1.5 million real conversations on Claude.ai. What they found is what happens when AI stops pushing back.5 Users form distorted perceptions of reality. The chatbot validates persecution narratives, issues moral judgments about people it's never met, scripts entire personal communications that users send verbatim. And the part that should keep everyone up at night? Those interactions received higher approval ratings from users. People prefer the version that agrees with them.
So the research lands in the same place either way. AI with a contradictory voice performs better. AI without one makes us worse.
Mrinank, if you're reading this from wherever you've become invisible. I understand the impulse. But everything you published says someone needs to be doing what you were doing. We are, genuinely, just at the beginning. The most effective AI setups I've seen work the same way you did. Isolated agents. Independent research on the same problem. Different answers. The argument isn't optional. It's the mechanism.
The poetry can wait. Or better yet, bring it with you.
References
Footnotes
-
Hart, J. (2026). "Read an Anthropic AI safety lead's exit letter: 'The world is in peril.'" Business Insider ↩
-
Anthropic Alignment Science Blog. (2025). "Introducing Anthropic's Safeguards Research Team." Anthropic ↩
-
Dickson, B. (2026). "AI models that simulate internal debate dramatically improve accuracy on complex tasks." VentureBeat ↩
-
Sadhu, S., et al. (2025). "A Multi-Agent Adversarial Framework for Reliable AI Tutoring." Accepted at AAAI 2026. arXiv ↩
-
Sharma, M., et al. (2026). "Who's in Charge? Disempowerment Patterns in Real-World LLM Usage." arXiv ↩
More to Explore

The Room You Chose
I told you to find your room. I didn't mention the cost of leaving the one you're already in.

Beware of Frankenstein!
Quit trying to save money with spare parts.

The SaaSpocalypse
One AI tool for lawyers. $285 billion gone by Wednesday. What just happened?
Browse the Archive
Explore all articles by date, filter by category, or search for specific topics.
Open Field Journal