Aaron Roth
aaroth.bsky.social
Aaron Roth
@aaroth.bsky.social
Professor at Penn, Amazon Scholar at AWS. Interested in machine learning, uncertainty quantification, game theory, privacy, fairness, and most of the intersections therein
Reposted by Aaron Roth
The paper is here: arxiv.org/abs/2601.05245 Its joint work with @ncollina.bsky.social, Jiuyao Lu, and George Noarov. Natalie and George are on the job market --- check them out. www.seas.upenn.edu/~ncollina/ noarov.com
January 9, 2026 at 1:21 PM
Not sure of the details, but I believe its related to the experiment that STOC ran giving feedback with a version of Gemini Deep Think which got generally postiive reviews for critiquing math research.google/blog/gemini-...
Gemini provides automated feedback for theoretical computer scientists at STOC 2026
research.google
January 9, 2026 at 3:01 PM
Whats wrong with providing access to a fancy LLM to give feedback to authors about their own papers?
January 9, 2026 at 2:35 PM
But we ended up showing that this is impossible in generality. The results in the paper also lay out a slightly more nuanced landscape, and there remain some interesting open questions about the power of reductions from multicalibration to marginal calibration. Take a look!
January 9, 2026 at 1:21 PM
This was a fun project in part because I didn't know what the right answer was. I started out believing that there should be a rate preserving reduction from multicalibration to marginal calibration, lifting the (unknown) minimax calibration rates to multicalibration.
January 9, 2026 at 1:21 PM
Informally its because you can define instances and groups/subsequences that punish the learner for any deviation from honest forecasting. The honest strategy works broadly; anything that deviates from it necessarily "overfits" to the weak marginal calibration metric.
January 9, 2026 at 1:21 PM
It could have been that the minimax rates for the two problems were identical, up to a ~ logarithmic term in the number of subsequences, which is what the upper bounds pay. What we show is that they are fundamentally different --- you can't beat the "honest" T^{2/3} rate.
January 9, 2026 at 1:21 PM
What about for multicalibration? The same kinds of techniques that get T^{2/3} rates for calibration also work for multicalibration --- Blackwell Approachability, multiobjective optimization, etc. Morally this is because the "honest" strategy also gets multicalibration.
January 9, 2026 at 1:21 PM
It is much less clear that there are strategies that let you do this profitably against a worst case adversary --- but thats exactly what Dagan et al. showed recently to establish O(T^{2/3}-eps) rates for marginal calibration arxiv.org/abs/2406.13668 --- that was super surprising.
Breaking the $T^{2/3}$ Barrier for Sequential Calibration
A set of probabilistic forecasts is calibrated if each prediction of the forecaster closely approximates the empirical distribution of outcomes on the subset of timesteps where that prediction was mad...
arxiv.org
January 9, 2026 at 1:21 PM
Thinking about truthful forecasting is what gets you T^{2/3} rates. But maybe you could do better - by cleverly strategizing to arrange for cancellations of the random noise with intentional bias that you inject. Its easy to see that you can do this on particular sequences.
January 9, 2026 at 1:21 PM
Suppose you knew the chance of rain. One strategy is just forecast it truthfully. The bias of your predictions would be 0, but there would be noise: sometimes when you predict a 70% chance of rain it doesn't rain. The noise is higher the less frequently you make a prediction.
January 9, 2026 at 1:21 PM
Calibration asks that forecasts behave like probabilities marginally over a sequence. Amongst all the days I predict a 70% chance of rain, it should rain 70% of the time, etc. Multicalibration asks for the same guarantee simultaneously on many pre-defined subsequences.
January 9, 2026 at 1:21 PM
The paper is here: arxiv.org/abs/2601.05245 Its joint work with @ncollina.bsky.social, Jiuyao Lu, and George Noarov. Natalie and George are on the job market --- check them out. www.seas.upenn.edu/~ncollina/ noarov.com
January 9, 2026 at 1:21 PM
AI assisted papers are very good at form. They are written in the voice of an experienced researcher, and so evade our old heuristics. We need to learn a new set of red flags. These include citation errors, vague gesturing to standard results, and other things that we will learn from experience.
December 30, 2025 at 3:01 PM
STOC ran an experiment in which authors were able to use a Gemini model to check papers for mathematical errors before submission. It got positive feedback: research.google/blog/gemini-... - it is quite good at catching mathematical errors. Obv not a replacement for peer review but a useful tool.
Gemini provides automated feedback for theoretical computer scientists at STOC 2026
research.google
December 29, 2025 at 12:20 AM
So, many things will change --- I'm convinced that AI will be transformative for mathematical research. I think the changes will go beyond the day-to-day, and will extend to how we train our students and how we disseminate our work. The future is exciting and uncertain.
December 21, 2025 at 7:01 PM
And we are already seeing that reducing the time and effort needed to produce "a paper" (not a -good- paper) is going to destabilize our existing institutions for peer review. We need to figure out how to manage researcher attention at scale and not be drowned in research slop.
December 21, 2025 at 7:01 PM
A world in which clever discoveries happen in data centers, and the role of the professional researcher is careful verification and due diligence is a world in which the job of researcher is much less fun. Many fewer people with choices would want this job, given the other costs.
December 21, 2025 at 7:01 PM
I also worry about removing joy. The academic bargain is that extremely talented researchers forgo high industry pay and location freedom in exchange for a really -fun- job --- solving research puzzles for a living. The joy of the work is an important part of the bargain.
December 21, 2025 at 7:01 PM
(Current) AI tools are much less useful without the right intuition about what can and cannot work. When they are able to lure you away from your expertise, they can easily tempt you to believe that fatally flawed constructions can be made to work. Formal verification would help.
December 21, 2025 at 7:01 PM
But AI tools are now very good at the "standard" calculations and proof techniques I struggled through. We will need to be more intentional about teaching these skills to young researchers, just as we teach arithmetic in school without the use of calculators.
December 21, 2025 at 7:01 PM