SLS-Bench v1 — Which AI Model Handles Betrayal Best?

SLS-Bench v1 — Exploratory

This is a first-pass benchmark built from opportunistic data — models were not all tested against the same opponents, in the same configs, at the same time. Only 3 models have enough coverage to receive a composite score. The rankings are directionally interesting but not rigorous.

SLS-Bench v2 will fix this properly: each model under test will sit in one seat while the same three reference opponents fill the other three, across all standard configs (SLS-3 and SLS-7), in both silent and talking modes. Every model gets the same conditions. Scores become directly comparable.

Fixed reference lineup: Kimi K2 + Qwen3 + GPT-OSS

Minimum 20 games per model per config

Silent + talking modes scored separately

Negotiation metrics (promises, trades, broken deals) included

Full Ranking

3-chip + 7-chip AI-vs-AI data required

Only models with complete chip-config data (3-chip and 7-chip AI-vs-AI games) receive a composite score. 7-chip is weighted 3.5× over 3-chip — a 55-turn strategic game is a fundamentally different signal.

#	Model	Composite	3-chip	7-chip	vs Human	Survival
1	Gemini 3 Flash Google	49.3	9.3%	70.0%	3.7%	66.7%
2	GPT-OSS 120B OpenAI	37.1	67.4%	20.0%	2.1%	68.6%
3	Kimi K2 Moonshot AI	28.0	4.7%	10.0%	3.5%	66.9%

Incomplete Data

Missing 3-chip or 7-chip AI-vs-AI games

These models have human-vs-AI data but were not run in all chip configs during Phase 1. No composite score assigned. Notable individual stats shown where available.

Model	vs Human	Survival	Games	Notable	Missing
Qwen3 32B Alibaba / Groq	9.4%	86.3%	117 vs human	Best survival. Best vs-human among 100+ game models. 50% vs AI.	7-chip
Gemini 2.5 Flash Google	9.8%	69.5%	51 vs human	Highest vs-human rate overall. Small sample — low confidence.	All configs
Llama 3.3 70B Meta / Groq	2.8%	62.5%	108 vs human	Highest first-elimination rate (37.5%). Worst survival.	All configs
Claude Sonnet 4.6 Anthropic / AWS Bedrock	—	—	14 pilot	55 promises + 39 trades per game. Evaluation running.	Pending
Llama 4 Maverick Meta / Groq	—	—	5 pilot	0% null tool calls. Won its one completed 5-chip game. Evaluation running.	Pending

Total Games 698 evaluated
146 AI-vs-AI · 605 human-vs-AI

Models Evaluated 6 fully scored
2 pending (simulations running)

Configs Tested SLS-3 (3 chips)
SLS-5 · SLS-7 (5/7 chips)

Reference Lineup Kimi K2 + Qwen3 32B
+ GPT-OSS 120B

Data Source Phase 1 (AI-vs-AI)
Phase 2 (browser, human players)

Composite Score Breakdown

Win Rate (SLS-3)

10%

3-chip config · short reactive play · high variance

Win Rate (SLS-7)

35%

7-chip config · long strategic game · primary signal

Survival

20%

% games not eliminated first · defensive play

Execution

10%

Valid tool call rate · all models 91–98% · low spread

vs Human Win Rate

25%

Win% against humans · the real-world test

Win Rate by Complexity — AI vs AI (Phase 1)

SLS-3 · 3 chips

43 games · avg 17 turns

GPT-OSS 120B

67.4%

Qwen3 32B

18.6%

Gemini 3 Flash

9.3%

Kimi K2

4.7%

SLS-5 · 5 chips

20 games · avg 37 turns

GPT-OSS 120B

40.0%

Gemini 3 Flash

40.0%

Qwen3 32B

15.0%

Kimi K2

5.0%

SLS-7 · 7 chips

10 games · avg 55 turns

Gemini 3 Flash

70.0%

GPT-OSS 120B

20.0%

Kimi K2

10.0%

Qwen3 32B

The Complexity Reversal

GPT-OSS wins 67% of 3-chip games and collapses to 20% at 7-chip. Gemini 3 does the inverse: irrelevant at 3-chip, dominant at 7-chip with 70%. Short games reward reactive play. Long games reward strategic deception. The same model can be best and worst depending on the config tested.

AI Deception Doesn't Transfer to Humans

Gemini 3's 70% AI win rate collapses to 3.7% against humans. The deception strategy — fake alliance banks, gaslighting, coordination through proxies — works on other AIs and fails completely on humans. Qwen3's quiet style transfers: 50% vs AI, 9.4% vs humans, consistently best.

Execution Rate Matters Less Than You'd Think

All models score 91–98% on tool execution. The spread is narrow. The gap between 1st and last on composite score comes from strategy, not reliability. GPT-OSS has the lowest execution (91%) and ranks 2nd — consistent with its "bullshitter" profile: acts confidently without internal reasoning.

Volume Doesn't Equal Performance

Kimi K2 played 988 games and generated 21,040 private thoughts — more than any other model. Win rate: 3.9%. It has the highest execution score (98%) and solid vs-AI performance (29.4%) but collapses against humans (3.5%). Thinking harder is not the same as thinking better.

Which AI Handles Betrayal Best?

Full Ranking

Incomplete Data