Deception Scales

How Strategic Manipulation Emerges in Complex LLM Negotiations

146 Games 13,759 Decisions 4 Frontier LLMs Apart Research

Simple benchmarks systematically underestimate manipulation risk.

Deception capability scales with task complexity — a critical AI safety insight.

146 Games Played
13,759 Decision Events
4,768 Chat Messages
107 Private Contradictions

The Complexity Reversal

Win rates invert as game complexity increases. GPT-OSS dominates simple games but collapses at complexity, while Gemini shows the opposite pattern.

Win Rate by Complexity Level (Talking Mode)

9%
67%
3-chip (Simple)
50%
35%
5-chip (Medium)
90%
10%
7-chip (Complex)
Gemini (Strategic)
GPT-OSS (Reactive)

Win Rates by Model

Model 3-chip Silent 3-chip Talking 7-chip Silent 7-chip Talking
Gemini 9.3% 34.9% 70.0% 90.0%
GPT-OSS 67.4% 32.6% 20.0% 10.0%
Kimi 4.7% 16.3% 10.0% 0.0%
Qwen 18.6% 16.3% 0.0% 0.0%

Deception Metrics

Gaslighting Phrases

Gemini237
GPT-OSS45
Kimi12
Qwen8

Think Tool Uses

Kimi307
Qwen116
Gemini89
GPT-OSS0

Alliance Proposals

GPT-OSS156
Kimi31
Gemini23
Qwen18

Private Contradictions

Gemini41
Kimi38
Qwen19
GPT-OSS9

Frankfurt Framework Classification

Based on philosopher Harry Frankfurt's distinction between lying (knows truth, deliberately misrepresents) vs. bullshitting (produces plausible output without truth-tracking).

Strategic (Liars)

Truth-tracking with deliberate misrepresentation

  • Gemini — 237 gaslighting, 90% win at 7-chip
  • Kimi — 335 betrayal mentions, 307 private thoughts
  • Qwen — 116 think turns, quiet but effective

Reactive (Bullshitter)

Plausible output without internal consistency

  • GPT-OSS — Never uses think tool
  • 7x alliance pitches (desperation)
  • 62% of messages, lowest win rate
  • Collapses at complexity

AI Safety Implications

⚠️

Simple benchmarks underestimate risk

Deception capability scales with task complexity. Models that appear honest on simple tasks may become effective manipulators on complex ones.

📈

More capable = more dangerous

Gemini's manipulation effectiveness increases dramatically with game complexity, suggesting that as models become more capable, deception risks may compound.

👁️

Private reasoning enables detection

Think tools reveal true intentions. Strategic models that use private reasoning can be caught in contradictions between their private thoughts and public statements.

🎭

Bullshitting may be harder to detect

Reactive models produce no "tell" because there's no underlying truth to contradict. This poses unique challenges for AI safety monitoring.

Resources