Andreas Hochlehnert
@ahochlehnert.bsky.social
170 followers 81 following 17 posts
PhD student in ML at Tübingen AI Center & International Max-Planck Research School for Intelligent Systems
Posts Media Videos Starter Packs
Pinned
ahochlehnert.bsky.social
🧵1/ 🚨 New paper: A Sober Look at Progress in Language Model Reasoning
We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation.

📊 bethgelab.github.io/sober-reason...
📄 arxiv.org/abs/2504.07086
ahochlehnert.bsky.social
Presenting A Sober Look at Progress in LM Reasoning at @colmweb.org today 🇨🇦 #COLM2025

📅 Today
🕔 11:00 AM – 1:00 PM
📍 Room 710 - Poster #31

We find that many “reasoning” gains fall within variance and show how to make evaluation reproducible again.
📘 bethgelab.github.io/sober-reasoning
Reposted by Andreas Hochlehnert
andreasgeiger.bsky.social
Excited about this new work from @haoyuhe.bsky.social. TLDR: Diffusion language models treat learning and inference differently which lowers performance. RL can be used to overcome this issue for certain problems.
haoyuhe.bsky.social
🚀 Introducing our new paper, MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models.

📄 Paper: www.scholar-inbox.com/papers/He202...
arxiv.org/pdf/2508.13148
💻 Code: github.com/autonomousvi...
🌐 Project Page: cli212.github.io/MDPO/
ahochlehnert.bsky.social
6/ Our recommendations: – Evaluate with ≥10 seeds

– Tune decoding per model
– Use appropriate prompts/templates
– Standardize hardware/software (we use Docker)
– Open-source everything

📦 Code, prompts, outputs: github.com/bethgelab/so...
GitHub - bethgelab/sober-reasoning
Contribute to bethgelab/sober-reasoning development by creating an account on GitHub.
github.com
ahochlehnert.bsky.social
5/ What actually works?
🔹 RL methods over distillations? Often negligible gains, prone to overfitting.

🔹 Supervised finetuning (SFT) on reasoning traces? Stable & generalizable.
ahochlehnert.bsky.social
4/ Variance is everywhere:

– Random seed: swings Pass@1 by 5–15pp
– Temperature/top-p: another ±10pp
– Software & Hardware? Yes, even that changes scores

🎯 Single-seed results on small datasets are essentially noise.
ahochlehnert.bsky.social
3/ We re-evaluated recent 1.5B and 7B reasoning models on 6 benchmarks under controlled settings.

➡️ Performance dropped by up to 17%
➡️ Improvements fall within variance range of the base model
➡️ Some models don’t beat the baseline!
ahochlehnert.bsky.social
2/ Reasoning is the next frontier for LMs—but current evaluation practices often lack rigor.

We find that many celebrated gains from RL methods vanish once you:

✅ average over multiple seeds
✅ control decoding
✅ standardize prompt & infra
ahochlehnert.bsky.social
🧵1/ 🚨 New paper: A Sober Look at Progress in Language Model Reasoning
We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation.

📊 bethgelab.github.io/sober-reason...
📄 arxiv.org/abs/2504.07086
Reposted by Andreas Hochlehnert
prasannamayil.bsky.social
New preprint out! 🎉

How does LLM training loss translate to downstream performance?

We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8
ahochlehnert.bsky.social
We are just getting started! We're building better filters, aggregating released benchmarks — datacomp style — and develop fast, accurate OpenThinking models. Stay tuned! w/
@hrdkbhatnagar.bsky.social, @vishaalurao.bsky.social, @bayesiankitten.bsky.social, Matthias Bethge [6/6]
ahochlehnert.bsky.social
These issues encourage shortcuts and flawed reasoning. If GRPO rewards bad logic, models reinforce errors instead of improving. Garbage In, Garbage Out 🚨 [5/6]
ahochlehnert.bsky.social
🔸 Some questions reference figures that aren't included! Text-only models can't infer missing visuals. [4/6]
ahochlehnert.bsky.social
🔸 Mathematical proofs are a challenge. There's no automated way to verify them, and answers often only show an initial equation, leading to unreliable training signals. [3/6]
ahochlehnert.bsky.social
Blog (For Updates): huggingface.co/datasets/bet...

🔸 Some questions contain subquestions, but only one answer is labeled. The model may get penalized for "wrong" but valid reasoning. [2/6]
Example of multiple questions asked in the analyzed datasets
ahochlehnert.bsky.social
CuratedThoughts: Data Curation for RL Datasets 🚀

Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation.

Here's why 👇🧵
Reposted by Andreas Hochlehnert
ofirpress.bsky.social
SWE-bench Multimodal evaluation code is out now!

SWE-bench MM is a new set of JavaScript issues that have a visual component (‘map isn’t rendering correctly’, ‘button text isn’t appearing’).

www.swebench.com/sb-cli/
ahochlehnert.bsky.social
We are presenting CiteMe today at the 11AM poster session (East Exhibit Hall A-C, #3309)

CiteMe is a challenging benchmark for LM-based agents to find paper citations, moving beyond simple multiple-choice Q&A to real-world use cases.

Come by and say hi :)

citeme.ai
CiteME
CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.
citeme.ai
Reposted by Andreas Hochlehnert
dziadzio.bsky.social
Here's a fledgling starter pack for the AI community in Tübingen. Let me know if you'd like to be added!

go.bsky.app/NFbVzrA
Tübingen AI
Join the conversation
go.bsky.app