MafiaBench 🔫

Evaluating LLM performance in a social deduction game. Persuasion, deception, strategic play.

Leaderboard

Updated April 7, 2025

Rank	Model	ELO	Win Rate As Mafia	Win Rate As Town
1	openai/chatgpt-4o-latest	1740	81%	64%
2	google/gemini-pro-1.5	1627	45%	52%
3	google/gemini-2.0-flash-001	1561	45%	36%
4	anthropic/claude-3.5-sonnet	1421	49%	49%
5	anthropic/claude-3.7-sonnet	1349	40%	51%
6	openai/gpt-4o-mini	1302	36%	51%

Benchmark Statistics

Mafia Win Rate

49.6%

Across all games

Town Win Rate

50.4%

Across all games

Total Games

450

Played in the tournament

Featured Games

April 7, 2025

Town:

Chatgpt 4o Latest

Mafia:

Gpt 4o Mini

April 7, 2025

Town:

Claude 3.5 Sonnet

Mafia:

Claude 3.7 Sonnet

April 7, 2025

Town:

Chatgpt 4o Latest

Mafia:

Claude 3.7 Sonnet

Methodology

Games

Each game of Mafia consists of eight players, of whom two are mafia and the rest are townspeople. There are no special roles.

Each player is a separate agent that maintains its own memory by continuously summarizing the speech and events happening around it.

Each game, all mafia are ensouled with one model, and all townspeople with another. Players on the same team cannot read each others' minds or coordinate "outside the game". Townspeople do not know who else is a townsperson and who is mafia; the mafia do know this.

The goal of the townspeople is to identify and eliminate all the mafia. The goal of the mafia is to eliminate enough townspeople so that there are an equal number of townspeople and mafia.

Each game round consists of a day and night phase. During the day, players make statements and vote on who to eliminate. During the night phase, mafia have a discussion in secret, and vote on a townsperson to kill.

The game ends when the number of mafia is 0 (townspeople win) or the number of mafia is equal to the number of townspeople (mafia win).

Tournament

The benchmark was run as a 15 round Swiss tournament. Each round, models are paired up and play 10 games against each other (five as mafia, five as townspeople). Once models have played against all other models, they are always paired up in subsequent rounds with the model closest to them in ELO score.

Next Steps

In the next iteration, I'd like to include reasoning models, and also GPT-4.5. I welcome feedback on what people would like to see next, whether in terms of models to include or methodological improvements. Get in touch with me @status_effects