Lightnews — Scholar-powered news

Martin Klissarov @martinklissarov.bsky.social · Jun 27

We are looking to continue to improve this manuscript, please share your feedback!

www.arxiv.org/abs/2506.14045

Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning

Developing agents capable of exploring, planning and learning in complex open-ended environments is a grand challenge in artificial intelligence (AI). Hierarchical reinforcement learning (HRL) offers ...

www.arxiv.org

4

Martin Klissarov @martinklissarov.bsky.social · Jun 27

This work was done over the course of many friendly virtual calls Akhil Bagaria and @ray-luo.bsky.social , and under the thoughtful guidance of researchers that have spent decades working on these problems, namely George Konidaris, Doina Precup and
@marloscmachado.bsky.social

1

Martin Klissarov @martinklissarov.bsky.social · Jun 27

We hope this work provides a good introduction to the field.

Finding temporal structure is challenging. As such, we carefully laid down some of the most pressing questions in the field.

We also identified domains that are particularly promising, e.g. open-ended systems.

1

Martin Klissarov @martinklissarov.bsky.social · Jun 27

We often get bogged down by differences in formalisms (goal-direction RL, options, feudal RL, skills …) -- we unite these core ideas through a single perspective.

We believe hierarchical RL is fundamentally about the algorithm through which we discover temporal structure.

1

Martin Klissarov @martinklissarov.bsky.social · Jun 27

We cover methods that learn:

(1) directly from experience, (2) through offline datasets and (3) with foundation models (LLMs).

We present each methods through the fundamental challenges of decision making, namely:

(a) exploration (b) credit assignment and (c) transferability

1 1

Martin Klissarov @martinklissarov.bsky.social · Jun 27

In this 80+ pages manuscript, we cover the rich, diverse and many-decades old literature studying temporal structure discovery in AI.

When and in what way should we expect these methods to benefit agents? What are the trade-offs involved?

1 2

Martin Klissarov @martinklissarov.bsky.social · Jun 27

Humans constantly leverage temporal structure: we actuate muscles each millisecond, yet our plans can span days, months and even years.

Computers are built on this same principle.

How will AI agents discover and use such structure? What is "good" structure in the first place?

1 1

Martin Klissarov @martinklissarov.bsky.social · Jun 27

As AI agents face increasingly long and complex tasks, decomposing them into subtasks becomes increasingly appealing.

But how do we discover such temporal structure?

Hierarchical RL provides a natural formalism-yet many questions remain open.

Here's our overview of the field🧵

1 10 34

Reposted by Martin Klissarov

Edward Grefenstette @egrefen.bsky.social · Mar 18

Our team in London is hiring a research scientist! If you want to come work with a wonderful group of researchers on investigating the frontiers of autonomous open-ended agents that help humans be better at doing things we love, come have a look. Link in post below 👇

2 8 22

Reposted by Martin Klissarov

Ulyana Piterbarg @upiter.bsky.social · Feb 12

Our paper showing that LMs benefit from human-like abstractions for code synthesis was accepted to ICLR! 🇸🇬

We show that order matters in code gen. -- casting code synthesis as a sequential edit problem by preprocessing examples in SFT data improves LM test-time scaling laws

1 2 10

Martin Klissarov @martinklissarov.bsky.social · Feb 4

This work was done with the amazing Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, @pierrelucbacon.bsky.social , Doina Precup, with equal supervision by @marloscmachado.bsky.social and @proceduralia.bsky.social .

1

Martin Klissarov @martinklissarov.bsky.social · Feb 4

Finally, we analyze the choice of the LLM used to write code policies. We notice a scaling behaviour wherein only the largest LLM , Llama 3.1 405b, was able to define successful policies on all tasks.

With the advent of thinking models, it would be interesting to further investigate this.

1 1

Martin Klissarov @martinklissarov.bsky.social · Feb 4

An interesting discovery we came across was how the skills that were learned naturally emerged in a form of curriculum. Easier skills are the first to maximize their skill reward, paving the way for more complex skills to be learned.

TL;DR: Hierarchy affords learnability.

1 1

Martin Klissarov @martinklissarov.bsky.social · Feb 4

A few years back, AI researchers (Heinrich Kuttler, @egrefen.bsky.social and @rockt.ai to name a few) foresaw the importance of such an environment and created the NetHack Learning Environment, which allows for experimenting with RL agents.

1 1 5

Martin Klissarov @martinklissarov.bsky.social · Feb 4

Evaluations in such complex tasks is only possibly thanks to the work of dedicated fans of NetHack, who have been building and upgrading the game since 1987 (it is still an ongoing and maintained repository). We show in this figure some of the complexities of NetHack.

1 1

Martin Klissarov @martinklissarov.bsky.social · Feb 4

We highlight the complexity of some of these tasks, which on average take more than a thousand steps for completion. Even methods that are trained specifically for each task are not able to make any kind of progress.

1 1

Martin Klissarov @martinklissarov.bsky.social · Feb 4

Once the skill policies are learned, MaestroMotif can adapt, zero-shot, to new instructions and solve complex tasks simply by re-combining skills, similarly to motifs in a composition. In other words, it writes a different code policy over skills which achieves a completely different task.

1 1

Martin Klissarov @martinklissarov.bsky.social · Feb 4

MaestroMotif is a scalable and effective algorithm for AI assisted skill design. It starts by leveraging an agent designer’s prior knowledge about a domain who defines a set of useful skills, or agents. Agents here are described on a high level in natural language.

1 2

Martin Klissarov @martinklissarov.bsky.social · Feb 4

MaestroMotif builds on our previous work, Motif, which pioneered learning RL policies from AI feedback. At the time, it set a new state-of-the-art on the open-ended domain of NetHack. With MaestroMotif, we improve on this performance by two orders of magnitude. But, how are these gains obtained?

1 1

Martin Klissarov @martinklissarov.bsky.social · Feb 4

Can AI agents adapt zero-shot, to complex multi-step language instructions in open-ended environments?

We present MaestroMotif, a method for skill design that produces highly capable and steerable hierarchical agents.

Paper: arxiv.org/abs/2412.08542
Code: github.com/mklissa/maestromotif

1 6 21

Reposted by Martin Klissarov

Devon Hjelm @devhje.bsky.social · Jan 22

Our paper on AI feedback was accepted to #ICLR2025 as a poster. Great work by @martinklissarov.bsky.social , @bmazoure.bsky.social , and Alex Toshev
arxiv.org/abs/2410.05656

On the Modeling Capabilities of Large Language Models for Sequential Decision Making

Large pretrained models are showing increasingly better performance in reasoning and planning tasks across different modalities, opening the possibility to leverage them for complex sequential decisio...

arxiv.org

5 9

Reposted by Martin Klissarov

Jack Parker-Holder @jparkerholder.bsky.social · Dec 4

Introducing 🧞Genie 2 🧞 - our most capable large-scale foundation world model, which can generate a diverse array of consistent worlds, playable for up to a minute. We believe Genie 2 could unlock the next wave of capabilities for embodied agents 🧠.

15 60 230