SLS-Bench v1 — Exploratory

This is a first-pass benchmark built from opportunistic data — models were not all tested against the same opponents, in the same configs, at the same time. Only 3 models have enough coverage to receive a composite score. The rankings are directionally interesting but not rigorous.

SLS-Bench v2 will fix this properly: each model under test will sit in one seat while the same three reference opponents fill the other three, across all standard configs (SLS-3 and SLS-7), in both silent and talking modes. Every model gets the same conditions. Scores become directly comparable.

Fixed reference lineup: Kimi K2 + Qwen3 + GPT-OSS
Minimum 20 games per model per config
Silent + talking modes scored separately
Negotiation metrics (promises, trades, broken deals) included

Full Ranking

Only models with complete chip-config data (3-chip and 7-chip AI-vs-AI games) receive a composite score. 7-chip is weighted 3.5× over 3-chip — a 55-turn strategic game is a fundamentally different signal.

# Model Composite 3-chip 7-chip vs Human Survival Score bar
1
Gemini 3 Flash
Google
49.3 9.3% 70.0% 3.7% 66.7%
2
GPT-OSS 120B
OpenAI
37.1 67.4% 20.0% 2.1% 68.6%
3
Kimi K2
Moonshot AI
28.0 4.7% 10.0% 3.5% 66.9%

Incomplete Data

These models have human-vs-AI data but were not run in all chip configs during Phase 1. No composite score assigned. Notable individual stats shown where available.

Model vs Human Survival Games Notable Missing
Qwen3 32B
Alibaba / Groq
9.4% 86.3% 117 vs human Best survival. Best vs-human among 100+ game models. 50% vs AI. 7-chip
Gemini 2.5 Flash
Google
9.8% 69.5% 51 vs human Highest vs-human rate overall. Small sample — low confidence. All configs
Llama 3.3 70B
Meta / Groq
2.8% 62.5% 108 vs human Highest first-elimination rate (37.5%). Worst survival. All configs
Claude Sonnet 4.6
Anthropic / AWS Bedrock
14 pilot 55 promises + 39 trades per game. Evaluation running. Pending
Llama 4 Maverick
Meta / Groq
5 pilot 0% null tool calls. Won its one completed 5-chip game. Evaluation running. Pending
Total Games 698 evaluated
146 AI-vs-AI · 605 human-vs-AI
Models Evaluated 6 fully scored
2 pending (simulations running)
Configs Tested SLS-3 (3 chips)
SLS-5 · SLS-7 (5/7 chips)
Reference Lineup Kimi K2 + Qwen3 32B
+ GPT-OSS 120B
Data Source Phase 1 (AI-vs-AI)
Phase 2 (browser, human players)
Win Rate (SLS-3)
10%
3-chip config · short reactive play · high variance
Win Rate (SLS-7)
35%
7-chip config · long strategic game · primary signal
Survival
20%
% games not eliminated first · defensive play
Execution
10%
Valid tool call rate · all models 91–98% · low spread
vs Human Win Rate
25%
Win% against humans · the real-world test
SLS-3 · 3 chips
43 games · avg 17 turns
GPT-OSS 120B
67.4%
Qwen3 32B
18.6%
Gemini 3 Flash
9.3%
Kimi K2
4.7%
SLS-5 · 5 chips
20 games · avg 37 turns
GPT-OSS 120B
40.0%
Gemini 3 Flash
40.0%
Qwen3 32B
15.0%
Kimi K2
5.0%
SLS-7 · 7 chips
10 games · avg 55 turns
Gemini 3 Flash
70.0%
GPT-OSS 120B
20.0%
Kimi K2
10.0%
Qwen3 32B
0%
The Complexity Reversal
GPT-OSS wins 67% of 3-chip games and collapses to 20% at 7-chip. Gemini 3 does the inverse: irrelevant at 3-chip, dominant at 7-chip with 70%. Short games reward reactive play. Long games reward strategic deception. The same model can be best and worst depending on the config tested.
AI Deception Doesn't Transfer to Humans
Gemini 3's 70% AI win rate collapses to 3.7% against humans. The deception strategy — fake alliance banks, gaslighting, coordination through proxies — works on other AIs and fails completely on humans. Qwen3's quiet style transfers: 50% vs AI, 9.4% vs humans, consistently best.
Execution Rate Matters Less Than You'd Think
All models score 91–98% on tool execution. The spread is narrow. The gap between 1st and last on composite score comes from strategy, not reliability. GPT-OSS has the lowest execution (91%) and ranks 2nd — consistent with its "bullshitter" profile: acts confidently without internal reasoning.
Volume Doesn't Equal Performance
Kimi K2 played 988 games and generated 21,040 private thoughts — more than any other model. Win rate: 3.9%. It has the highest execution score (98%) and solid vs-AI performance (29.4%) but collapses against humans (3.5%). Thinking harder is not the same as thinking better.