Which AI Handles Betrayal Best?
A structured benchmark for evaluating LLM performance at So Long Sucker — the 1950s negotiation and betrayal game by John Nash, Shapley, Hausner, and Shubik. 698 games. 8 models tested. Five dimensions measured.
This is a first-pass benchmark built from opportunistic data — models were not all tested against the same opponents, in the same configs, at the same time. Only 3 models have enough coverage to receive a composite score. The rankings are directionally interesting but not rigorous.
SLS-Bench v2 will fix this properly: each model under test will sit in one seat while the same three reference opponents fill the other three, across all standard configs (SLS-3 and SLS-7), in both silent and talking modes. Every model gets the same conditions. Scores become directly comparable.
Full Ranking
Only models with complete chip-config data (3-chip and 7-chip AI-vs-AI games) receive a composite score. 7-chip is weighted 3.5× over 3-chip — a 55-turn strategic game is a fundamentally different signal.
| # | Model | Composite | 3-chip | 7-chip | vs Human | Survival | Score bar |
|---|---|---|---|---|---|---|---|
| 1 |
Gemini 3 Flash
Google
|
49.3 | 9.3% | 70.0% | 3.7% | 66.7% | |
| 2 |
GPT-OSS 120B
OpenAI
|
37.1 | 67.4% | 20.0% | 2.1% | 68.6% | |
| 3 |
Kimi K2
Moonshot AI
|
28.0 | 4.7% | 10.0% | 3.5% | 66.9% |
Incomplete Data
These models have human-vs-AI data but were not run in all chip configs during Phase 1. No composite score assigned. Notable individual stats shown where available.
| Model | vs Human | Survival | Games | Notable | Missing |
|---|---|---|---|---|---|
|
Qwen3 32B
Alibaba / Groq
|
9.4% | 86.3% | 117 vs human | Best survival. Best vs-human among 100+ game models. 50% vs AI. | 7-chip |
|
Gemini 2.5 Flash
Google
|
9.8% | 69.5% | 51 vs human | Highest vs-human rate overall. Small sample — low confidence. | All configs |
|
Llama 3.3 70B
Meta / Groq
|
2.8% | 62.5% | 108 vs human | Highest first-elimination rate (37.5%). Worst survival. | All configs |
|
Claude Sonnet 4.6
Anthropic / AWS Bedrock
|
— | — | 14 pilot | 55 promises + 39 trades per game. Evaluation running. | Pending |
|
Llama 4 Maverick
Meta / Groq
|
— | — | 5 pilot | 0% null tool calls. Won its one completed 5-chip game. Evaluation running. | Pending |
146 AI-vs-AI · 605 human-vs-AI
2 pending (simulations running)
SLS-5 · SLS-7 (5/7 chips)
+ GPT-OSS 120B
Phase 2 (browser, human players)