Mathieu Blondel
@mblondel.bsky.social
250 followers 58 following 20 posts
Research scientist, Google DeepMind
Posts Media Videos Starter Packs
Reposted by Mathieu Blondel
arnosolin.bsky.social
📣 Please share: We invite submissions to the 29th International Conference on Artificial Intelligence and Statistics (#AISTATS 2026) and welcome paper submissions at the intersection of AI, machine learning, statistics, and related areas. [1/3]
Reposted by Mathieu Blondel
pierrealquier.bsky.social
I'm not on TV yet, but I'm on YouTube 😊 talking about research, ML, how I prepare talks and the difference between Bayesian and frequentist statistics.

Many thanks to Charles Riou who already posted many videos of interviews of ML & stats researchers on his YouTube channel "ML New Papers"!! 🙏
Reposted by Mathieu Blondel
gokul.dev
1.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to 🤿:
mblondel.bsky.social
Am I the only one who feels this is awful? If someone wants to remain anonymous, people should respect that...
mblondel.bsky.social
Schrödinger's snack
motonobu-kanagawa.bsky.social
There is a trace of someone's sadness.
mblondel.bsky.social
Cool work! We recently found that Tsallis q=1.5 (alpha=1.5 in our notation) seems to works really well across several datasets for language modeling arxiv.org/abs/2501.18537 It would be great to find some theoretical justification for why 1.5 seems to be a sweet spot.
Loss Functions and Operators Generated by f-Divergences
The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language...
arxiv.org
Reposted by Mathieu Blondel
han-b.bsky.social
🧗‍♂️Why GD converges beyond [step size]<2/[smoothness]? We investigate loss functions and identify their *separation margin* is an important factor. Surprisingly Renyi 2-entropy yields super fast rate T=Ω(ε^{-1/3})!
arxiv.org/abs/2502.04889
Reposted by Mathieu Blondel
ramealexandre.bsky.social
Modern post-training is essentially distillation then RL. While reward hacking is well-known and feared, could there be such a thing as teacher hacking? Our latest paper confirms it. Fortunately, we also show how to mitigate it! The secret: diversity and onlineness! arxiv.org/abs/2502.02671
mblondel.bsky.social
The reason for this is because the usual duality theory still works when working in the spaces of functions and probability measures, while it doesn't if we work in the space of network parameters. We need to apply duality first and then parameterize, not the other way around!
mblondel.bsky.social
The EBM paper below parameterizes dual variables as neural nets. This idea (which has been used in other contexts such as OT or GANs) is very powerful and may be *the* way duality can be useful for neural nets (or rather, neural nets can be useful for duality!).
mblondel.bsky.social
Really proud of these two companion papers by our team at GDM:

1) Joint Learning of Energy-based Models and their Partition Function
arxiv.org/abs/2501.18528

2) Loss Functions and Operators Generated by f-Divergences
arxiv.org/abs/2501.18537

A thread.
mblondel.bsky.social
Surprisingly, we found that we still obtain good performance even if we use the classical softargmax at inference time and our losses at train time. This means that we can keep the inference code the same and just change the training code, which is useful e.g. for open-weight LMs
mblondel.bsky.social
We obtain good performance across several language modeling tasks with the alpha-divergence, for alpha=1.5.
mblondel.bsky.social
The table below summarizes the link between some entropies and f-divergences.
mblondel.bsky.social
2) We instantiate Fenchel-Young losses with f-divergence regularization. This generalizes the cross-entropy loss in two directions: i) by replacing the KL with f-divergences and ii) by allowing non-uniform prior class weights. Each loss is associated with a f-softargmax operator.
mblondel.bsky.social
Our approach naturally generalizes to Fenchel-Young losses, allowing us to obtain the first tractable approach for optimizing the sparsemax loss in general combinatorial spaces.
mblondel.bsky.social
We propose a new joint formulation for learning the EBM and the log-partition, and a MCMC-free doubly stochastic optimization scheme with unbiased gradients.
mblondel.bsky.social
Pushing this idea a little bit further, we can parameterize the log-partition as a separate neural network. This allows us to evaluate the *learned* log-partition on new data points.
mblondel.bsky.social
By treating the log-partition not as a quantity to compute but as a variable to optimize, we no longer need it to be exact (in machine learning we never look for exact solutions to optimization problems!).
mblondel.bsky.social
1) EBMs are generally challenging to train due to the partition function (normalization constant). At first, learning the partition function seems weird O_o But the log-partition exactly coincides with the Lagrange multiplier (dual variable) associated with equality constraints.
mblondel.bsky.social
Really proud of these two companion papers by our team at GDM:

1) Joint Learning of Energy-based Models and their Partition Function
arxiv.org/abs/2501.18528

2) Loss Functions and Operators Generated by f-Divergences
arxiv.org/abs/2501.18537

A thread.
Reposted by Mathieu Blondel
gdalle.bsky.social
Sparser, better, faster, stronger
adrianhill.de
You think Jacobian and Hessian matrices are prohibitively expensive to compute on your problem? Our latest preprint with @gdalle.bsky.social might change your mind!
arxiv.org/abs/2501.17737
🧵1/8
Figure comparing automatic differentiation (AD) and automatic sparse differentiation (ASD).

(a) Given a function f, AD backends return a function computing vector-Jacobian products (VJPs). (b) Standard AD computes Jacobians row-by-row by evaluating VJPs with all standard basis vectors. (c) ASD reduces the number of VJP evaluations by first detecting a sparsity pattern of non-zero values, coloring orthogonal rows in the pattern and simultaneously evaluating VJPs of orthogonal rows. The concepts shown in this figure directly translate to forward-mode, which computes Jacobians column-by-column instead of row-by-row.
mblondel.bsky.social
Former French minister of Education and "philosopher" Luc Ferry, who said a few years ago that maths was useless, wrote a book on artificial intelligence 😂