Pierre Chambon
@pierrechambon.bsky.social
24 followers 66 following 27 posts
PhD at FAIR (Meta) and INRIA Former researcher at Stanford University
Posts Media Videos Starter Packs
pierrechambon.bsky.social
As they use no reasoning tokens and leverage MoE with only 17B active tokens, both Maverick and Scout are much faster compared to reasoning models 🏎️🏁.

To generate ~50% of Time Complexity Generation responses, QwQ takes ~30mn whereas Llama 4 needs only a few dozens seconds 🥳.
pierrechambon.bsky.social
Llama 4 results out on ✨BigO(Bench)✨!

Llama 4 Maverick is top 4 all@1 on Time Complexity Generation and top 2🥈coeffFull on Time Complexity Ranking (beating R1, though not using any reasoning tokens).

The model is less performant on Space Complexity.

👇All links below👇
pierrechambon.bsky.social
🤲OpenHands LM is not a reasoning model, which de facto makes its inference cost way lower than the SOTA models on BigO(Bench).

It is best on Complexity Predictions tasks, where it even outperforms o1-mini!🎉 But it falls behind on Generation and Ranking tasks.
pierrechambon.bsky.social
🧑‍💻DeepCoder model displays impressive performance, but suffered from limited inference compute on BigO(Bench). 

Though our inference budget is large (enough for reasoning models like QwQ, R1 or Nemotron-Ultra 🥵), DeepCoder responses seemed to take even longer.
pierrechambon.bsky.social
🏆NVIDIA Nemotron include an 8B, 49B and 253B, the latter being the one benchmarked, with deep thinking on.

Nemotron-Ultra 253B displays high and consistent performance on BigO(Bench) (very often on the podium). It takes the lead on Space Complexity Generation and Ranking!🥳
pierrechambon.bsky.social
✨BigO(Bench)✨ Leaderboard Update!

3 models added to our benchmark:
🏆 nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
🧑‍💻 agentica-org/DeepCoder-14B-Preview
🤲 all-hands/openhands-lm-32b-v0.1

Thanks @vllm_project and @huggingface for quickly supporting inference!

👇All links below👇
pierrechambon.bsky.social
🔥Very happy to introduce BigO(Bench) dataset on @hf.co 🤗

✨3,105 coding problems and 1,190,250 solutions from CodeContests

✨Time/Space Complexity labels and curve coefficients

✨Up to 5k Runtime/Memory Footprint measures for each solution

huggingface.co/datasets/fac...
pierrechambon.bsky.social
🤔Would you like to see any other model, newly released or more established in the LLM community, benchmarked on ✨BigO(Bench)✨?

👇Happy to provide details/help with any suggestion!

🧵5/6
pierrechambon.bsky.social
🥈DeepseekV3-0324 performance is impressive as it uses no reasoning tokens, even outperforming DeepSeek on Time Complexity Generation by ~45% All@1.

These tasks usually require extensive reasoning skills; "thinking" steps easily take ~20k tokens.

🧵4/6
pierrechambon.bsky.social
🥇QwQ displays impressive performance, pushing the SOTA on Time and Space Complexity Generation by 100% All@1 and 50% All@1 respectively.

On Time Complexity Ranking, QwQ also beats DeepSeekR1 distilled models by ~30% coeffFull, while being on par with DeepSeek on Space.

🧵3/6
pierrechambon.bsky.social
📸Results snapshot !🏅

All these models have similar active parameters, DeepSeekV3-0324 being MoE with 37B active parameters.

Whereas DeepSeekR1 and QwQ use reasoning tokens (and therefore way more inference tokens), Gemma3 and DeepSeekV3-0324 directly output the result.

🧵2/6
pierrechambon.bsky.social
Limitations remain, notably the complexity framework which is prone to errors, as for specific problems it can potentially miss worst-complexity edge cases. The measures it takes remain noisy, still relying on real CPU runtimes and using statistical measuring tools.
pierrechambon.bsky.social
In the context of newly released benchmarks getting quickly saturated, BigO(Bench) aims at evaluating high-level reasoning skills that stay out-of-scope of current LLMs and are hard to train/reinforce upon, bringing their performance down.
pierrechambon.bsky.social
Reasoning models struggle with the ambiguity of higher-level reasoning tasks, especially when there is no explicit verifier they were reinforced upon.

Do they really ‘think’ about notions they ‘know’, or do they only learn by heart patterns of ‘thoughts’ during training?
pierrechambon.bsky.social
Models tend to under-perform on non-optimal complexity classes, compared to the most optimized class of every problem. This seems counterintuitive for
any human programmer, usually accustomed to easily
finding non-optimized solutions, but struggling at the
best ones.
pierrechambon.bsky.social
LLMs struggle with Complexity Generation - generating code that meets specific complexity requirements -, underperforming in comparison to Complexity Prediction - predicting complexity of existing code - or generating code alone.

Token-space reasoning models perform best !
pierrechambon.bsky.social
Utilizing 3,105 coding problems and 1,190,250 solutions from Code Contests, we applied the Complexity Framework to derive two test sets: one for Time Complexity (311 problems) and one for Space Complexity (308 problems), each problem comprising multiple complexity classes.
pierrechambon.bsky.social
Our Complexity Framework can analyze arbitrary Python code snippets, employing input generation methods to empirically measure runtime and memory footprint, thereby inferring complexity classes and corresponding curve coefficients without reliance on oracle models.
pierrechambon.bsky.social
First, we developed a novel Dynamic Complexity Inference tool to measure Time/Space Complexity of code snippets 👉 Code is released!

The framework ran on ~1M Code Contests solutions 👉 Data is public too!

Lastly, we designed test sets and evaluated LLMs 👉 Leaderboard is out!
pierrechambon.bsky.social
Beyond generating code solutions, can LLMs answer the final Time/Space Complexity question of coding interviews ? 👨‍🏫

We investigate the performance of LLMs on 3 tasks:

✅ Time/Space Complexity Prediction

✅ Time/Space Complexity Generation

✅ Time/Space Complexity Ranking