Gaspard Lambrechts
@gsprd.be
2.1K followers 650 following 26 posts
PhD Student doing RL in POMDP at the University of Liège - Intern at McGill - gsprd.be
Posts Media Videos Starter Packs
gsprd.be
4) Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures.

With Adrien Bolland and Damien Ernst, we propose a new intrinsic reward. Instead of encouraging visiting states uniformly, we encourage visiting *future* states uniformly, from every state.
gsprd.be
3) Behind the Myth of Exploration in Policy Gradients.

With Adrien Bolland and Damien Ernst, we decided to frame the exploration problem for policy-gradient methods from the optimization point of view.
gsprd.be
2) Informed Asymmetric Actor-Critic: Theoretical Insights and Open Questions.

With Daniel Ebi and Damien Ernst, we looked for a reason why asymmetric actor-critic was performing better, even when using RNN-based policies with the full observation history as input (no aliasing).
gsprd.be
In AsymAC, while the policy maintains an agent state based on observations only, the critic also takes the state as input. Its better performance is linked to eventual "aliasing" in the agent state, hurting TD learning in the symmetric case only.

arxiv.org/abs/2501.19116
A Theoretical Justification for Asymmetric Actor-Critic Algorithms
In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learning paradigm. This paradigm leverages additional state inform...
arxiv.org
gsprd.be
1) A Theoretical Justification for Asymmetric Actor-Critic Algorithms.

With Damien Ernst and Aditya Mahajan, we looked for a reason why asymmetric actor-critic algorithms are performing better than their symmetric counterparts.
gsprd.be
At #EWRL, we presented 4 papers, which we summarize below.

- A Theoretical Justification for AsymAC Algorithms.
- Informed AsymAC: Theoretical Insights and Open Questions.
- Behind the Myth of Exploration in Policy Gradients.
- Off-Policy MaxEntRL with Future State-Action Visitation Measures.
Reposted by Gaspard Lambrechts
Reposted by Gaspard Lambrechts
claireve.bsky.social
Such an inspiring talk by @arkrause.bsky.social at #ICML today. The role of efficient exploration in Scientific discovery is fundamental and I really like how Andreas connects the dots with RL (theory).
gsprd.be
At #ICML2025, we will present a theoretical justification for the benefits of « asymmetric actor-critic » algorithms (#W1008 Wednesday at 11am).

📝 Paper: hdl.handle.net/2268/326874
💻 Blog: damien-ernst.be/2025/06/10/a...
ICML poster of the paper « A Theoretical Justification for Asymmetric Actor-Critic Algorithms » by Gaspard Lambrechts, Damien Ernst and Aditya Mahajan.
Reposted by Gaspard Lambrechts
ricczamboni.bsky.social
🌟🌟Good news for the explorers🗺️!
Next week we will present our paper “Enhancing Diversity in Parallel Agents: A Maximum Exploration Story” with V. De Paola, @mircomutti.bsky.social and M. Restelli at @icmlconf.bsky.social!
(1/N)
gsprd.be
Last week, I gave an invited talk on "asymmetric reinforcement learning" at the BeNeRL workshop. I was happy to draw attention to this niche topic, which I think can be useful to any reinforcement learning researcher.

Slides: hdl.handle.net/2268/333931.
gsprd.be
Two months after my PhD defense on RL in POMDP, I finally uploaded the final version of my thesis :)

You can find it here: hdl.handle.net/2268/328700 (manuscript and slides).

Many thanks to my advisors and to the jury members.
Cover page of the PhD thesis "Reinforcement Learning in Partially Observable Markov Decision Processes: Learning to Remember the Past by Learning to Predict the Future" by Gaspard Lambrechts
gsprd.be
While this work has considered fixed feature z = f(h) with linear approximators, we discuss possible generalizations in the conclusion.

Despite not matching the usual recurrent actor-critic setting, this analysis still provides insights into the effectiveness of asymmetric actor-critic algorithms.
gsprd.be
The conclusion is that asymmetric learning is less sensitive to aliasing than symmetric learning.

Now, what is aliasing exactly?

The aliasing and inference terms arise from z = f(h) not being Markovian. They can be bounded by the difference between the approximate p(s|z) and exact p(s|h) beliefs.
gsprd.be
Now, as far as the actor suboptimality is concerned, we obtained the following finite-time bounds.

In addition to the average critic error, which is also present in the actor bound, the symmetric actor-critic algorithm suffers from an additional "inference term".
Theorem showing the finite-time suboptimality bound for the asymmetric and symmetric actor-critic algorithms. The asymmetric algorithm has four terms: the natural actor-critic term, the gradient estimation term, the residual gradient term, and the average critic error. The symmetric algorithm has an additional term: the inference term.
gsprd.be
By adapting the finite-time bound from the symmetric setting to the asymmetric setting, we obtain the following error bounds for the critic estimates.

The symmetric temporal difference learning algorithm has an additional "aliasing term".
Theorem showing the finite-time error bound for the asymmetric and symmetric temporal difference learning algorithms. The asymmetric algorithm has three terms: the temporal difference learning term, the function approximation term, and the bootstrapping shift term. The symmetric algorithm has an additional term: the aliasing term.
gsprd.be
While this algorithm is valid/unbiased (Baisero & Amato, 2022), a theoretical justification for its benefit is still missing.

Does it really learn faster than symmetric learning?

In this paper, we provide theoretical evidence for this, based on an adapted finite-time analysis (Cayci et al., 2024).
Title page of the paper "A Theoretical Justification for Asymmetric Actor-Critic Algorithms", written by Gaspard Lambrechts, Damien Ernst and Aditya Mahajan.
gsprd.be
However, with actor-critic algorithms, it can be noticed that the critic is not needed at execution!

As a result, the state can be an input of the critic, which becomes Q(s, z, a) in the asymmetric setting instead of Q(z, a) in the symmetric setting.
Figure showing the policy being passed the feature z = f(h) of the history, and the critic being passed both the feature z = f(h) of the history and the state s as input. The asymmetric critic and the policy are used together to form the sample policy-gradient expression.
gsprd.be
In a POMDP, the goal is to find an optimal policy π(a|z) that maps a feature z = f(h) of the history h to an action a.

In a privileged POMDP, the state can be used to learn a policy π(a|z) faster.

But note that the state cannot be an input of the policy, since it is not available at execution.
Figure showing the history h being compressed into a feature z = f(h) for then being passed to a policy g(a | z).
gsprd.be
Typically, classical RL methods assume:
- MDP: full state observability (too optimistic),
- POMDP: partial state observability (too pessimistic).

Instead, asymmetric RL methods assume:
- Privileged POMDP: asymmetric state observability (full at training, partial at execution).
Table showing that the MDP assumes full state observability both during training and execution and that the POMDP assumes partial state observability both during training and execution, while the privileged POMDP assumes full state observability during training but partial state observability during execution.
gsprd.be
📝 Our paper "A Theoretical Justification for Asymmetric Actor-Critic Algorithms" was accepted at #ICML!

Never heard of "asymmetric actor-critic" algorithms? Yet, many successful #RL applications use them (see image).

But these algorithms are not fully understood. Below, we provide some insights.
Slide showing three recent successes of reinforcement learning that have used an asymmetric actor-critic algorithm:
 - Magnetic Control of Tokamak Plasma through Deep RL (Degrave et al., 2022).
 - Champion-Level Drone Racing using Deep RL (Kaufmann et al., 2023).
 - A Super-Human Vision-Based RL Agent in Gran Turismo (Vasco et al., 2024).