ML Research Template (https://github.com/CLAIRE-Labo/python-ml-research-template)
Trained on 15T tokens in 1,000+ languages, it’s built for transparency, responsibility & the public good.
Read more: actu.epfl.ch/news/apertus...
📰 Paper: arxiv.org/abs/2507.08068
Hidden gems and open questions in the 30+ page appendix💎
🧑💻 Code: github.com/CLAIRE-Labo/...
🌐 Blog: claire-labo.github.io/quantile-rewar
📰 Paper: arxiv.org/abs/2507.08068
Hidden gems and open questions in the 30+ page appendix💎
🧑💻 Code: github.com/CLAIRE-Labo/...
🌐 Blog: claire-labo.github.io/quantile-rewar
We show equivalence of a family of transformations allowing us to qualitatively interpret the quantile reward optimal as a Best-of-N policy 🎯
Empirically, each transformation brings different dynamics, and it's an open question to compare all of them! 🕵️
We show equivalence of a family of transformations allowing us to qualitatively interpret the quantile reward optimal as a Best-of-N policy 🎯
Empirically, each transformation brings different dynamics, and it's an open question to compare all of them! 🕵️
We derive a framework around QRPO for using transformations on top of the quantile reward.
Each transformation reshapes the reward distribution and affects the properties of the optimal policy, while having a tractable partition function.
We derive a framework around QRPO for using transformations on top of the quantile reward.
Each transformation reshapes the reward distribution and affects the properties of the optimal policy, while having a tractable partition function.
This is simply because the target policy is much further away from the training support 🎯
This is simply because the target policy is much further away from the training support 🎯
Our understanding of the KL-regularized closed-form solution gives insights into the "DPO chosen probabilities decreasing" problem! 🤔
Our understanding of the KL-regularized closed-form solution gives insights into the "DPO chosen probabilities decreasing" problem! 🤔
But when compressed to preferences for DPO and SimPO, it leads to the typical length bias trend, despite the reduction in mean length.
But when compressed to preferences for DPO and SimPO, it leads to the typical length bias trend, despite the reduction in mean length.
* QRPO does not need many reference rewards to estimate quantiles: for high-quality offline datasets, 1-3 are enough!
* And you can scale this number for off-policy data generated from the reference model! 📈
* QRPO does not need many reference rewards to estimate quantiles: for high-quality offline datasets, 1-3 are enough!
* And you can scale this number for off-policy data generated from the reference model! 📈
🚀 The result: Quantile Reward Policy Optimization!
QRPO transforms rewards to quantile rewards for which we derive Z, and can then fit the closed-form optimal RL solution with a simple regression! 📉
🚀 The result: Quantile Reward Policy Optimization!
QRPO transforms rewards to quantile rewards for which we derive Z, and can then fit the closed-form optimal RL solution with a simple regression! 📉
2️⃣ Knowing the reward distribution => knowing the MGF => knowing Z 🔐
2️⃣ Knowing the reward distribution => knowing the MGF => knowing Z 🔐
This is the problem that limits DPO-like methods to pairwise data. We solve it thanks to 3 insights! 💡
This is the problem that limits DPO-like methods to pairwise data. We solve it thanks to 3 insights! 💡
❌ You want rewards, but GRPO only works online?
❌ You want offline, but DPO is limited to preferences?
✅ QRPO can do both!
🧵Here's how we do it:
❌ You want rewards, but GRPO only works online?
❌ You want offline, but DPO is limited to preferences?
✅ QRPO can do both!
🧵Here's how we do it:
She will be joining University of Zurich as a professor this summer, and hiring PhD students and postdocs. You should apply to her group!
Her website: koloskova.github.io
She will be joining University of Zurich as a professor this summer, and hiring PhD students and postdocs. You should apply to her group!
Her website: koloskova.github.io
Make sure to check it out to learn why training with PPO for too long makes your agent collapse!
Jiaheng Hu of UTexas on Unsupervised Skill Discovery for HRL
@skandermoalla.bsky.social of EPFL: Representation and Trust in PPO
Adil Zouitine of IRT Saint Exupery/Hugging Face : Time-Constrained Robust MDPs
Make sure to check it out to learn why training with PPO for too long makes your agent collapse!
This will be the official account of the Eastern European Machine Learning (EEML) community.
Follow us for news regarding our summer schools, workshops, education/community initiatives, and more!
@caglarai.bsky.social
🧑💻 github.com/CLAIRE-Labo/...
@caglarai.bsky.social
🧑💻 github.com/CLAIRE-Labo/...
📰 Paper: arxiv.org/abs/2405.00662
🧑💻 Code: github.com/CLAIRE-Labo/...
📰 Paper: arxiv.org/abs/2405.00662
🧑💻 Code: github.com/CLAIRE-Labo/...
Wed 11 Dec 11 am - 2 pm PST
West Ballroom A-D #6403
@caglarai.bsky.social @andreamiele.bsky.social @razvan-pascanu.bsky.social
Wed 11 Dec 11 am - 2 pm PST
West Ballroom A-D #6403
@caglarai.bsky.social @andreamiele.bsky.social @razvan-pascanu.bsky.social