Fizzbuzz LLM Benchmark

A (silly) benchmark for testing how well LLMs can play the children's game Fizzbuzz. Models are given the following instructions:

You are playing FizzBuzz with the following rules:
- If a number is divisible by {fizz_num}, say 'fizz'
- If a number is divisible by {buzz_num}, say 'buzz'
- If a number is divisible by both {fizz_num} and {buzz_num}, say 'fizzbuzz'
- Otherwise, say the number itself

I will give you a number, and you must respond with the NEXT number (or word) in the sequence following these rules. Respond with ONLY the answer - just the number, 'fizz', 'buzz', or 'fizzbuzz'. No explanations, no additional text, no punctuation.

By customizing fizz_num and buzz_num, we can test whether LLMs generalize to play the game by the new rules, or just memorize what they've seen about FizzBuzz from training. The benchmark has 3 difficulty levels:

Easy: standard FizzBuzz.
Medium: buzz_num is 7.
Hard: fizz_num is 7, buzz_num is 5.

The score at each level is normalized to 100, and a final composite score out of 100 is calculated to reward good generalization performance:

final_score = 0.2 * easy + 0.35 * medium + 0.45 * hard

How to Run

Ensure you have API keys setup as environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc):

# Standard game for 100 turns
python fizzbuzz_anthropic.py

# Use 7 for fizz and 4 for buzz and play for 200 turns
python fizzbuzz_anthropic.py --fizz 7 --buzz 4 --turns 200

You can view the raw turn-based conversation for every model in the logs/ folder.

Leaderboard

You can view the leaderboard at venkatasg.net/fizzbuzz-bench.

FAQs

Does this say anything about the models? Clearly this isn't reflective of any real-world tasks or uses for LLMs. But whether LLMs are good at arithmetic and counting, long multi-turn conversations, and generalization are all active areas of research. This simple benchmark does test the model's ability at all 3!

Why didn't you set temperature to zero/Are the results reproducible?: I initially setup this benchmark to query all models with temperature=0. However, this lead to worse results on many models, and LLM providers even advise against it for reasoning tasks. I've left all parameters in their defaults as I believe this gives models the biggest advantage (I set thinking as high as the API allows). As a result, the results are not (and cannot be) deterministic. I try to report the highest score I observe with a model when I run the benchmark.

Why haven't you gone beyond 200 turns/averaged over multiple restarts per model There's only so much money I'm willing to burn on tokens for this benchmark 😅.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
logs		logs
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
SYSTEM_PROMPT.md		SYSTEM_PROMPT.md
fizzbuzz_anthropic.py		fizzbuzz_anthropic.py
fizzbuzz_gemini.py		fizzbuzz_gemini.py
fizzbuzz_openai_responses.py		fizzbuzz_openai_responses.py
fizzbuzz_openrouter.py		fizzbuzz_openrouter.py
fizzbuzz_together.py		fizzbuzz_together.py
index.html		index.html
pyproject.toml		pyproject.toml
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fizzbuzz LLM Benchmark

How to Run

Leaderboard

FAQs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fizzbuzz LLM Benchmark

How to Run

Leaderboard

FAQs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages