Lightnews — Scholar-powered news

Reposted by Aditi Mavalankar

Tom Schaul @schaul.bsky.social · Aug 27

Where do some of Reinforcement Learning's great thinkers stand today?

Find out! Keynotes of the RL Conference are online:
www.youtube.com/playlist?lis...

Wanting vs liking, Agent factories, Theoretical limit of LLMs, Pluralist value, RL teachers, Knowledge flywheels
(guess who talked about which!)

1 23 75

Aditi Mavalankar @aditimavalankar.bsky.social · Jul 13

On my way to #ICML2025 to present our algorithm that strongly scales with inference compute, in both performance and sample diversity! 🚀

Reach out if you’d like to chat more!

2 8

Reposted by Aditi Mavalankar

Abhinav Moudgil @amoudgl.bsky.social · Jun 15

New side project!

assayer: A simple Python-RQ based tool to automatically monitor and evaluate ML model checkpoints offline during training.

1 1 4

Reposted by Aditi Mavalankar

Tom Schaul @schaul.bsky.social · May 22

Ever thought of joining DeepMind's RL team? We're recruiting for a research engineering role in London:
job-boards.greenhouse.io/deepmind/job...
Please spread the word!

Research Engineer, Reinforcement Learning

London, UK

job-boards.greenhouse.io

1 8 28

Aditi Mavalankar @aditimavalankar.bsky.social · May 2

Accepted to #ICML2025
See you in Vancouver!

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

Excited to share our recent work, AuPair, an inference-time technique that builds on the premise of in-context learning to improve LLM coding performance!
arxiv.org/abs/2502.18487

🧵

AuPair: Golden Example Pairs for Code Repair

Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additio...

arxiv.org

Reposted by Aditi Mavalankar

Tom Schaul @schaul.bsky.social · Mar 20

When faced with a challenge (like debugging) it helps to think back to examples of how you've overcome challenges in the past. Same for LLMs!

The method we introduce in this paper is efficient because examples are chosen for their complementarity, leading to much steeper inference-time scaling! 🧪

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

Excited to share our recent work, AuPair, an inference-time technique that builds on the premise of in-context learning to improve LLM coding performance!
arxiv.org/abs/2502.18487

🧵

AuPair: Golden Example Pairs for Code Repair

Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additio...

arxiv.org

5 18

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

This was a really fun collaboration with my brilliant collaborators Hassan Mansoor, Zita Marinho, Masha Samsikova, and @schaul.bsky.social!

1

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

In addition to this, AuPair has been shown to work better across CodeForces difficulty levels and preserve coverage of problem categories from the training data distribution (see paper for more details).

1 1

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

4) the responses produced by the model have high diversity for the more performant models.

1 1

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

3) our approach exhibits strong scaling with inference-time compute, and even after 100+ LLM calls, we do not see plateauing in the scaling curve;

1 1

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

2) we observe strong generalisation across datasets and models, implying that the process of curating these examples can be performed once and the benefits in performance can be reaped multiple times;

1 1

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

Injecting different examples into the prompt has several benefits: 1) we see significant gains in performance compared to best-of-N and self-repair baselines on multiple model families: Gemini, Gemma, and GPT;

1 1

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

Fun fact: the title “AuPair” has multiple interpretations: at a higher level, it guides LLMs to better behaviour with a predefined set of examples; it is also a conjunction of Au, the chemical symbol for gold, and pair, i.e. golden pairs!

1 1

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

For the coding domain, a golden example pair, or AuPair, contains the problem description, an incorrect guess, and a fix that improves the solution.

1 1

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

Our submodular approach yields a fixed ordered set of complementary and useful AuPairs. For a budget of N LLM calls, the model is given N different prompts to answer the same question, where each prompt contains a different golden example.

1 1

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

The key idea underlying our approach is simple: our approach curates a fixed set of golden examples (AuPairs) that are provided as 1-shot in-context examples during inference. We show that using AuPairs significantly improves code repair performance and scales well with inference compute!

1 1

Aditi Mavalankar @aditimavalankar.bsky.social · Mar 17

Excited to share our recent work, AuPair, an inference-time technique that builds on the premise of in-context learning to improve LLM coding performance!
arxiv.org/abs/2502.18487

🧵

AuPair: Golden Example Pairs for Code Repair

Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additio...

arxiv.org

1 4 12

Reposted by Aditi Mavalankar

Tom Schaul @schaul.bsky.social · Nov 28

Are there limits to what you can learn in a closed system? Do we need human feedback in training? Is scale all we need? Should we play language games? What even is "recursive self-improvement"?

Thoughts about this and more here:
arxiv.org/abs/2411.16905

Boundless Socratic Learning with Language Games

An agent trained within a closed system can master any desired capability, as long as the following three conditions hold: (a) it receives sufficiently informative and aligned feedback, (b) its covera...

arxiv.org

7 18 110

Aditi Mavalankar @aditimavalankar.bsky.social · Nov 23

😃😃😃

1