BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?
Abstract
We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.
Community
Does your LLM truly comprehend the complexity of the code it generates? 🥰
Introducing our new non-saturated (for at least the coming week? 😉) benchmark:
✨BigO(Bench)✨ - Can LLMs Generate Code with Controlled Time and Space Complexity?
📸 Quick snapshot of our results ! 🏅
BigO(Bench) evaluates high-level reasoning skills in coding, revealing that top-scoring models on Code Contests often struggle when required to both write and reason about their code.
Extra-Space Complexity seems particularly challenging!
Beyond generating code solutions, can LLMs answer the final Time/Space Complexity question of coding interviews ? 👨🏫
We investigate the performance of LLMs on 3 tasks:
✅ Time/Space Complexity Prediction
✅ Time/Space Complexity Generation
✅ Time/Space Complexity Ranking
Our Complexity Framework can analyze arbitrary Python code snippets, employing input generation methods to empirically measure runtime and memory footprint, thereby inferring complexity classes and corresponding curve coefficients without reliance on oracle models.
LLMs struggle with Complexity Generation - generating code that meets specific complexity requirements -, underperforming in comparison to Complexity Prediction - predicting complexity of existing code - or generating code alone.
Token-space reasoning models perform best !
Nevertheless, even reasoning models struggle with the ambiguity of higher-level reasoning tasks, especially when there is no explicit verifier they were reinforced upon.
Do they really ‘think’ about notions they ‘know’, or do they only learn by heart patterns of ‘thoughts’ during training?
In the context of newly released benchmarks getting quickly saturated, BigO(Bench) aims at evaluating high-level reasoning skills that stay out-of-scope of current LLMs and are hard to train/reinforce upon, bringing their performance down.
Thanks to everyone involved, Baptiste Roziere, Benoit Sagot and Gabriel Synnaeve !
🏡Homepage: http://facebookresearch.github.io/BigOBench
📚ArXiv: http://arxiv.org/abs/2503.15242
💻Github: http://github.com/facebookresearch/bigobench
🤗Huggingface: http://huggingface.co/datasets/facebook/BigOBench
🏆Leadboard: http://facebookresearch.github.io/BigOBench/leaderboard.html
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation (2025)
- CodeArena: A Collective Evaluation Platform for LLM Code Generation (2025)
- LLM4EFFI: Leveraging Large Language Models to Enhance Code Efficiency and Correctness (2025)
- Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation (2025)
- A Survey On Large Language Models For Code Generation (2025)
- CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation (2025)
- Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF中国镜像站 checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper