Florian Dorner
flodorner.bsky.social
Florian Dorner
@flodorner.bsky.social
PhD student in CS @ ETHZ / MPI-IS

Theory of ML evaluation https://flodorner.github.io/
In the second paper (arxiv.org/abs/2410.13341), we show that LLM judges weaker than the models they evaluate are of limited use for benchmarking, even if their judgments are processed in a statistically optimal way. Correspondingly, we cannot rely on LLM judges for evaluating frontier models.
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...
arxiv.org
December 5, 2025 at 8:57 AM
In the first paper (arxiv.org/abs/2507.12399), we characterize how LLM judge errors affect test-time-scaling via Best-of-N based on the verifier ROC curve. Our results point towards more efficient alternatives to Best-of-N, and explain why scaling laws for test-time-scaling are unreliable.
ROC-n-reroll: How verifier imperfection affects test-time scaling
Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sam...
arxiv.org
December 5, 2025 at 8:57 AM
Also, from time to time, the wrong proofs it suggests for more complicated things seem to contain non-trivial insights and are "fixable".
October 25, 2025 at 3:41 PM
Not much of a step up compared to the o1/o3 "thinking" versions of GPT-4. But quite a big step compared to base GPT-4. It still makes a lot of mistakes, but often produces correct proofs for simple Lemmata (not so much for more complicated stuff).
October 25, 2025 at 3:38 PM
Assuming all problems are actually solvable...
October 17, 2025 at 9:58 PM
Is that not trivially true, since LLMs assign nonzero probability to any possible string?
October 17, 2025 at 9:58 PM
Do you have a list of the best ones? I vaguely recall reading things in this direction, but cannot really remember specific titles.
September 21, 2025 at 8:11 PM
The focus on evaluating checkpoints during a training run rather than different trained models is super interesting!
September 17, 2025 at 5:16 AM
Interesting work! Can you comment a bit on what you do different compared to previous IRT-based LLM evaluation methods?

We recently did some work confirming IRTs efficacy for in-distribution models, but also found it to be quite brittle when it comes to novel models arxiv.org/abs/2506.07673
How Benchmark Prediction from Fewer Data Misses the Mark
Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM ev...
arxiv.org
September 17, 2025 at 5:11 AM
I guess in terms of the notation from section 4 in the paper, does this plot Type X risk, or Type X Error Feasibility rate?
September 14, 2025 at 2:52 PM
, at least for large n. So I am trying to understand whether the asymptotics kick in a lot slower than I would have thought, or whether I am missing something else about the setup., at least for large n.
September 14, 2025 at 2:44 PM
Thank you! Do I understand correctly that these results are independent/orthogonal from the success hacking ones? I guess my confusion stems from asymptotic theory for PPI (and by extension seemingly for DSL) suggesting that both type 1 and type 2 errors should be lower/at most very similar
September 14, 2025 at 2:44 PM
Are the reported errors for the case of selecting the model with the most significant results, post-hoc?
September 12, 2025 at 7:18 PM
Interesting work! Can you comment a bit more on the setup for the regression correction methods? As far as I understand, PPI++ (which should be quite similar to DSL) relatively reliably reduces variance compared to ground truth only, while remaining quite close to unbiased.
September 12, 2025 at 7:18 PM
Super interesting field, but worth keeping in mind that this usually only buys you a relatively small fraction of "extra ground truth labels" (this does not cover active sampling strategies, but I haven not seen them yielding much larger improvements in practice, either) arxiv.org/abs/2410.13341
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...
arxiv.org
July 23, 2025 at 1:28 PM
Do you have a source re: attendance requirement? 👀
July 17, 2025 at 5:28 PM
Not sure this can ethically be done retroactively (due to participant consent). But given that 20% of data is shared with model providers, privacy concerns with instead sharing this data publically in the future seem surmountable.
May 10, 2025 at 8:59 AM