Lightnews — Scholar-powered news

Pierre Chambon @pierrechambon.bsky.social · Apr 16

As they use no reasoning tokens and leverage MoE with only 17B active tokens, both Maverick and Scout are much faster compared to reasoning models 🏎️🏁.

To generate ~50% of Time Complexity Generation responses, QwQ takes ~30mn whereas Llama 4 needs only a few dozens seconds 🥳.

1

Pierre Chambon @pierrechambon.bsky.social · Apr 16

🏡Homepage: facebookresearch.github.io/BigOBench
🏆Leaderboard: facebookresearch.github.io/BigOBench/leaderboard.html
💻Github: github.com/facebookresearch/bigobench
🤗Huggingface: huggingface.co/datasets/facebook/BigOBench
📚ArXiv: arxiv.org/abs/2503.15242

BigO(Bench)

BigOBench: Can LLMs Generate Code with Controlled Time and Space Complexity?

facebookresearch.github.io

1 1

Pierre Chambon @pierrechambon.bsky.social · Apr 16

Llama 4 results out on ✨BigO(Bench)✨!

Llama 4 Maverick is top 4 all@1 on Time Complexity Generation and top 2🥈coeffFull on Time Complexity Ranking (beating R1, though not using any reasoning tokens).

The model is less performant on Space Complexity.

👇All links below👇

1 1

Pierre Chambon @pierrechambon.bsky.social · Apr 10

🤲OpenHands LM is not a reasoning model, which de facto makes its inference cost way lower than the SOTA models on BigO(Bench).

It is best on Complexity Predictions tasks, where it even outperforms o1-mini!🎉 But it falls behind on Generation and Ranking tasks.

Pierre Chambon @pierrechambon.bsky.social · Apr 10

🧑‍💻DeepCoder model displays impressive performance, but suffered from limited inference compute on BigO(Bench).

Though our inference budget is large (enough for reasoning models like QwQ, R1 or Nemotron-Ultra 🥵), DeepCoder responses seemed to take even longer.

1

Pierre Chambon @pierrechambon.bsky.social · Apr 10

🏆NVIDIA Nemotron include an 8B, 49B and 253B, the latter being the one benchmarked, with deep thinking on.

Nemotron-Ultra 253B displays high and consistent performance on BigO(Bench) (very often on the podium). It takes the lead on Space Complexity Generation and Ranking!🥳

1

Pierre Chambon @pierrechambon.bsky.social · Apr 10

🏡Homepage: facebookresearch.github.io/BigOBench
🏆Leaderboard: facebookresearch.github.io/BigOBench/leaderboard.html
💻Github: github.com/facebookresearch/bigobench
🤗Huggingface: huggingface.co/datasets/facebook/BigOBench
📚ArXiv: arxiv.org/abs/2503.15242

BigO(Bench)

BigOBench: Can LLMs Generate Code with Controlled Time and Space Complexity?

facebookresearch.github.io

1

Pierre Chambon @pierrechambon.bsky.social · Apr 10

✨BigO(Bench)✨ Leaderboard Update!

3 models added to our benchmark:
🏆 nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
🧑‍💻 agentica-org/DeepCoder-14B-Preview
🤲 all-hands/openhands-lm-32b-v0.1

Thanks @vllm_project and @huggingface for quickly supporting inference!

👇All links below👇

1 1

Pierre Chambon @pierrechambon.bsky.social · Apr 3

🔥Very happy to introduce BigO(Bench) dataset on @hf.co 🤗

✨3,105 coding problems and 1,190,250 solutions from CodeContests

✨Time/Space Complexity labels and curve coefficients

✨Up to 5k Runtime/Memory Footprint measures for each solution

huggingface.co/datasets/fac...

1 2

Pierre Chambon @pierrechambon.bsky.social · Mar 27

Thanks to everyone involved, Baptiste Roziere, @inriaparisnlp.bsky.social, @bensagot.bsky.social, @syhw.bsky.social!

📚ArXiv: arxiv.org/abs/2503.15242
🏡Homepage: facebookresearch.github.io/BigOBench
🤗Huggingface: huggingface.co/datasets/facebook/BigOBench

🧵6/6

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. ...

arxiv.org

2

Pierre Chambon @pierrechambon.bsky.social · Mar 27

🤔Would you like to see any other model, newly released or more established in the LLM community, benchmarked on ✨BigO(Bench)✨?

👇Happy to provide details/help with any suggestion!

🧵5/6

1 1

Pierre Chambon @pierrechambon.bsky.social · Mar 27

🥈DeepseekV3-0324 performance is impressive as it uses no reasoning tokens, even outperforming DeepSeek on Time Complexity Generation by ~45% All@1.

These tasks usually require extensive reasoning skills; "thinking" steps easily take ~20k tokens.

🧵4/6

1 2

Pierre Chambon @pierrechambon.bsky.social · Mar 27

🥇QwQ displays impressive performance, pushing the SOTA on Time and Space Complexity Generation by 100% All@1 and 50% All@1 respectively.

On Time Complexity Ranking, QwQ also beats DeepSeekR1 distilled models by ~30% coeffFull, while being on par with DeepSeek on Space.

🧵3/6

1 1

Pierre Chambon @pierrechambon.bsky.social · Mar 27

📸Results snapshot !🏅

All these models have similar active parameters, DeepSeekV3-0324 being MoE with 37B active parameters.

Whereas DeepSeekR1 and QwQ use reasoning tokens (and therefore way more inference tokens), Gemma3 and DeepSeekV3-0324 directly output the result.

🧵2/6

1 1

Pierre Chambon @pierrechambon.bsky.social · Mar 27

New leaderboard for ✨BigO(Bench)✨!

🥇Qwen QwQ new SOTA on Complexity Generation/Ranking
🥈DeepseekV3-0324 on par with reasoning models!
🥉Gemma3 strong on Complexity Prediction

💻Github: github.com/facebookresearch/bigobench
🏆Leaderboard: facebookresearch.github.io/BigOBench/leaderboard.html

🧵1/6

GitHub - facebookresearch/BigOBench: BigOBench assesses the capacity of Large Language Models (LLMs) to comprehend time-space computational complexity of input or generated code.

BigOBench assesses the capacity of Large Language Models (LLMs) to comprehend time-space computational complexity of input or generated code. - facebookresearch/BigOBench

github.com

1 1

Pierre Chambon @pierrechambon.bsky.social · Mar 20

Big thanks to Baptiste Roziere, @inriaparisnlp.bsky.social, @bensagot.bsky.social and @syhw.bsky.social!

📚ArXiv: arxiv.org/abs/2503.15242
🏡🏆Homepage: facebookresearch.github.io/BigOBench
💻Github: github.com/facebookresearch/bigobench
🤗Huggingface: huggingface.co/datasets/facebook/BigOBench

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. ...

arxiv.org

1

Pierre Chambon @pierrechambon.bsky.social · Mar 20

Limitations remain, notably the complexity framework which is prone to errors, as for specific problems it can potentially miss worst-complexity edge cases. The measures it takes remain noisy, still relying on real CPU runtimes and using statistical measuring tools.

1 1

Pierre Chambon @pierrechambon.bsky.social · Mar 20

In the context of newly released benchmarks getting quickly saturated, BigO(Bench) aims at evaluating high-level reasoning skills that stay out-of-scope of current LLMs and are hard to train/reinforce upon, bringing their performance down.

1

Pierre Chambon @pierrechambon.bsky.social · Mar 20

Reasoning models struggle with the ambiguity of higher-level reasoning tasks, especially when there is no explicit verifier they were reinforced upon.

Do they really ‘think’ about notions they ‘know’, or do they only learn by heart patterns of ‘thoughts’ during training?

1

Pierre Chambon @pierrechambon.bsky.social · Mar 20

Models tend to under-perform on non-optimal complexity classes, compared to the most optimized class of every problem. This seems counterintuitive for
any human programmer, usually accustomed to easily
finding non-optimized solutions, but struggling at the
best ones.

1 1

Pierre Chambon @pierrechambon.bsky.social · Mar 20

LLMs struggle with Complexity Generation - generating code that meets specific complexity requirements -, underperforming in comparison to Complexity Prediction - predicting complexity of existing code - or generating code alone.

Token-space reasoning models perform best !

1 1

Pierre Chambon @pierrechambon.bsky.social · Mar 20

Utilizing 3,105 coding problems and 1,190,250 solutions from Code Contests, we applied the Complexity Framework to derive two test sets: one for Time Complexity (311 problems) and one for Space Complexity (308 problems), each problem comprising multiple complexity classes.

1 1

Pierre Chambon @pierrechambon.bsky.social · Mar 20

Our Complexity Framework can analyze arbitrary Python code snippets, employing input generation methods to empirically measure runtime and memory footprint, thereby inferring complexity classes and corresponding curve coefficients without reliance on oracle models.

1 1

Pierre Chambon @pierrechambon.bsky.social · Mar 20

First, we developed a novel Dynamic Complexity Inference tool to measure Time/Space Complexity of code snippets 👉 Code is released!

The framework ran on ~1M Code Contests solutions 👉 Data is public too!

Lastly, we designed test sets and evaluated LLMs 👉 Leaderboard is out!

1 1

Pierre Chambon @pierrechambon.bsky.social · Mar 20

Beyond generating code solutions, can LLMs answer the final Time/Space Complexity question of coding interviews ? 👨‍🏫

We investigate the performance of LLMs on 3 tasks:

✅ Time/Space Complexity Prediction

✅ Time/Space Complexity Generation

✅ Time/Space Complexity Ranking

1 1