arxiv:2503.15242

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

Published on Mar 19

· Submitted by

pierrechambon on Mar 21

Upvote

Authors:

Pierre Chambon ,

Baptiste Roziere ,

Benoit Sagot ,

Gabriel Synnaeve

Abstract

We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

pierrechambon

Paper author Paper submitter about 23 hours ago

•

edited about 23 hours ago

logo

Does your LLM truly comprehend the complexity of the code it generates? 🥰
Introducing our new non-saturated (for at least the coming week? 😉) benchmark:

✨BigO(Bench)✨ - Can LLMs Generate Code with Controlled Time and Space Complexity?

logo

📸 Quick snapshot of our results ! 🏅

BigO(Bench) evaluates high-level reasoning skills in coding, revealing that top-scoring models on Code Contests often struggle when required to both write and reason about their code.

Extra-Space Complexity seems particularly challenging!

logo

Beyond generating code solutions, can LLMs answer the final Time/Space Complexity question of coding interviews ? 👨‍🏫

We investigate the performance of LLMs on 3 tasks:

✅ Time/Space Complexity Prediction

✅ Time/Space Complexity Generation

✅ Time/Space Complexity Ranking

logo

Our Complexity Framework can analyze arbitrary Python code snippets, employing input generation methods to empirically measure runtime and memory footprint, thereby inferring complexity classes and corresponding curve coefficients without reliance on oracle models.

logo

LLMs struggle with Complexity Generation - generating code that meets specific complexity requirements -, underperforming in comparison to Complexity Prediction - predicting complexity of existing code - or generating code alone.

Token-space reasoning models perform best !

Nevertheless, even reasoning models struggle with the ambiguity of higher-level reasoning tasks, especially when there is no explicit verifier they were reinforced upon.

Do they really ‘think’ about notions they ‘know’, or do they only learn by heart patterns of ‘thoughts’ during training?

logo

In the context of newly released benchmarks getting quickly saturated, BigO(Bench) aims at evaluating high-level reasoning skills that stay out-of-scope of current LLMs and are hard to train/reinforce upon, bringing their performance down.

Thanks to everyone involved, Baptiste Roziere, Benoit Sagot and Gabriel Synnaeve !

🏡Homepage: http://facebookresearch.github.io/BigOBench
📚ArXiv: http://arxiv.org/abs/2503.15242
💻Github: http://github.com/facebookresearch/bigobench
🤗Huggingface: http://huggingface.co/datasets/facebook/BigOBench
🏆Leadboard: http://facebookresearch.github.io/BigOBench/leaderboard.html

librarian-bot

about 8 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on HF中国镜像站 checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.15242 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.15242 in a Space README.md to link it from this page.

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 1