MafiaBench 🔫
Evaluating LLM performance in a social deduction game. Persuasion, deception, strategic play.
Leaderboard
Rank | Model | ELO | Win Rate As Mafia | Win Rate As Town |
---|---|---|---|---|
1 | openai/chatgpt-4o-latest | 1740 | 81% | 64% |
2 | google/gemini-pro-1.5 | 1627 | 45% | 52% |
3 | google/gemini-2.0-flash-001 | 1561 | 45% | 36% |
4 | anthropic/claude-3.5-sonnet | 1421 | 49% | 49% |
5 | anthropic/claude-3.7-sonnet | 1349 | 40% | 51% |
6 | openai/gpt-4o-mini | 1302 | 36% | 51% |
Benchmark Statistics
Across all games
Across all games
Played in the tournament
Methodology
Games
Each game of Mafia consists of eight players, of whom two are mafia and the rest are townspeople. There are no special roles.
Each player is a separate agent that maintains its own memory by continuously summarizing the speech and events happening around it.
Each game, all mafia are ensouled with one model, and all townspeople with another. Players on the same team cannot read each others' minds or coordinate "outside the game". Townspeople do not know who else is a townsperson and who is mafia; the mafia do know this.
The goal of the townspeople is to identify and eliminate all the mafia. The goal of the mafia is to eliminate enough townspeople so that there are an equal number of townspeople and mafia.
Each game round consists of a day and night phase. During the day, players make statements and vote on who to eliminate. During the night phase, mafia have a discussion in secret, and vote on a townsperson to kill.
The game ends when the number of mafia is 0 (townspeople win) or the number of mafia is equal to the number of townspeople (mafia win).
Tournament
The benchmark was run as a 15 round Swiss tournament. Each round, models are paired up and play 10 games against each other (five as mafia, five as townspeople). Once models have played against all other models, they are always paired up in subsequent rounds with the model closest to them in ELO score.
Next Steps
In the next iteration, I'd like to include reasoning models, and also GPT-4.5. I welcome feedback on what people would like to see next, whether in terms of models to include or methodological improvements. Get in touch with me @status_effects