Lightnews — Scholar-powered news

Daniel Palenicek @daniel-palenicek.bsky.social · 6d

Read the full preprint here:
👉 arxiv.org/pdf/2509.25174
Code coming soon.
We’d love feedback & discussion! 💬

Daniel Palenicek @daniel-palenicek.bsky.social · 6d

Key takeaway:
Well-conditioned optimization > raw scale.

XQC proves principled architecture choices can outperform larger, more complex ones

1 1

Daniel Palenicek @daniel-palenicek.bsky.social · 6d

📊 Results across 70 tasks (55 proprioception + 15 vision-based):

⚡️ Matches/outperforms SimbaV2, BRO, BRC, MRQ, and DRQ-V2

🌿~4.5× fewer parameters and 1/10 FLOP/s of SimbaV2

💪Especially strong on the hardest tasks: HumanoidBench, DMC Hard & DMC Humanoids from pixels

1

Daniel Palenicek @daniel-palenicek.bsky.social · 6d

This leads to XQC, a streamlined extension of Soft Actor-Critic with
✅ only 4 hidden layers
✅ BN after each linear layer
✅ WN projection
✅ CE critic loss

Simplicity + principled design = efficiency ⚡️

1

Daniel Palenicek @daniel-palenicek.bsky.social · 6d

🔑 Insight: A simple synergy—BatchNorm + WeightNorm + Cross-Entropy loss—makes critics dramatically more well-conditioned.

➡️Result: Stable effective learning rates and smoother optimization.

1 1

Daniel Palenicek @daniel-palenicek.bsky.social · 6d

Instead of "bigger is better," we ask:
Can better conditioning beat scaling?

By analyzing the Hessian eigenspectrum of critic networks, we uncover how different architectural choices shape optimization landscapes.

1

Daniel Palenicek @daniel-palenicek.bsky.social · 6d

🚀 New preprint! Introducing XQC— a simple, well-conditioned actor-critic that achieves SOTA sample efficiency in #RL
✅ ~4.5× fewer parameters than SimbaV2
✅ Scales to vision-based RL
👉 arxiv.org/pdf/2509.25174

Thanks to Florian Vogt @joemwatson.bsky.social @jan-peters.bsky.social

1 1 3

Daniel Palenicek @daniel-palenicek.bsky.social · May 23

Thanks to my co-authors Florian Vogt, @joemwatson.bsky.social @jan-peters.bsky.social

@hessianai.bsky.social @ias-tudarmstadt.bsky.social @dfki.bsky.social @cs-tudarmstadt.bsky.social
#RL #ML #AI

1 4

Daniel Palenicek @daniel-palenicek.bsky.social · May 23

If you're working on RL stability, plasticity, or sample efficiency, this might be relevant for you.

We'd love to hear your thoughts and feedback!

Come talk to us at RLDM in June in Dublin (rldm.org)

RLDM | The Multi-disciplinary Conference on Reinforcement Learning and Decision Making

rldm.org

1

Daniel Palenicek @daniel-palenicek.bsky.social · May 23

📚 TL;DR: We combine BN + WN in CrossQ for stable high-UTD training and SOTA performance on challenging RL benchmarks. No need for network resets, no critic ensembles, no other tricks... Simple regularization, big gains.

Paper: t.co/Z6QrMxZaPY

https://arxiv.org/abs/2502.07523v2

t.co

1

Daniel Palenicek @daniel-palenicek.bsky.social · May 23

⚖️ Simpler ≠ Weaker: Compared to SOTA baselines like BRO our method:
✅ Needs 90% fewer parameters (~600K vs. 5M)
✅ Avoids parameter resets
✅ Scales stably with compute.

We also compare strongly to the concurrent SIMBA algorithm.

No tricks—just principled normalization. ✨

1

Daniel Palenicek @daniel-palenicek.bsky.social · May 23

🔬 The Result: CrossQ + WN scales reliably with increasing UTD—no more resets, no critic ensembles, no other tricks.
We match or outperform SOTA on 25 continuous control tasks from DeepMind Control Suite & MyoSuite, including dog 🐕 and humanoid🧍‍♂️tasks across UTDS.

1

Daniel Palenicek @daniel-palenicek.bsky.social · May 23

➡️ With growing weight norm, the effective learning rate decreases, and learning slows down/stops.

💡Solution: After each gradient update, we rescale parameters to the unit sphere, preserving plasticity and keeping the effective learning rate stable.

1

Daniel Palenicek @daniel-palenicek.bsky.social · May 23

🧠Key Idea: BN improves sample efficiency, but fails to reliably scale with complex tasks & high UTDs due to growing weight norms.
However, BN regularized networks are scale invariant w.r.t. their weights; yet, the gradient scales inversely proportional (Van Laarhoven 2017)

1

Daniel Palenicek @daniel-palenicek.bsky.social · May 23

🔍 Background: Off-policy RL methods like CrossQ (Bhatt* & Palenicek* et al. 2024) are sample-efficient but struggle to scale to high update-to-data (UTD) ratios.

We identify why scaling CrossQ fails—and fix it with a surprisingly effective tweak: Weight Normalization (WN). 🏋️

1

Daniel Palenicek @daniel-palenicek.bsky.social · May 23

🚀 New preprint "Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization"🤖

We propose CrossQ+WN, a simple yet powerful off-policy RL for more sample-efficiency and scalability to higher update-to-data ratios. 🧵 t.co/Z6QrMxZaPY

#RL @ias-tudarmstadt.bsky.social

https://arxiv.org/abs/2502.07523v2

t.co

1 1 7

Daniel Palenicek @daniel-palenicek.bsky.social · Mar 19

Check out our latest work, where we train an omnidirectional locomotion policy directly on a real quadruped robot in just a few minutes based on our CrossQ RL algorithm 🚀
Shoutout @nicobohlinger.bsky.social, Jonathan Kinzel.

@ias-tudarmstadt.bsky.social @hessianai.bsky.social

Nico Bohlinger @nicobohlinger.bsky.social · Mar 18

⚡️ Do you think training robot locomotion needs large scale simulation? Think again!

We train an omnidirectional locomotion policy directly on a real quadruped in just a few minutes 🚀
Top speeds of 0.85 m/s, two different control approaches, indoor and outdoor experiments, and more! 🤖🏃‍♂️

2