Lightnews — Scholar-powered news

Wanyun Xie @wanyunxie.bsky.social · Jul 16

3/
Experiments show CHAMELEON:
- Improves generalization in pretraining (both perplexity and downstream tasks).
- Adapts seamlessly to new domains with minimal cost (1% retraining cost).
- Boosts performance in finetuning tasks.

Find our code in github.com/LIONS-EPFL/C...

GitHub - LIONS-EPFL/Chameleon: Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning, ICML 2025

Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning, ICML 2025 - LIONS-EPFL/Chameleon

github.com

Wanyun Xie @wanyunxie.bsky.social · Jul 16

2/
From a data-centric perspective, CHAMELEON quantifies domain importance using Kernel Ridge Leverage Scores (KRLS) on learned domain embeddings. This allows us to directly adapt to new data without costly proxy retraining, drastically cutting compute!

Wanyun Xie @wanyunxie.bsky.social · Jul 16

1/
We believe an ideal data-mixing method should:
- 🚀 Improve universal generalization;
- 🔄 Adapt to domain modifications;
- ✨ Handle different training stages (pretraining & finetuning).
CHAMELEON achieves all three!

Wanyun Xie @wanyunxie.bsky.social · Jul 16

We'll present our work, "CHAMELEON: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning," at #ICML2025! This is joint work with Francesco Tonin and @CevherLIONS.

📍 Find us at Poster E-2807 from 11 AM today. Excited to connect and discuss!

3

Reposted by Wanyun Xie

Volkan Cevher @cevherlions.bsky.social · Feb 13

arxiv.org/abs/2502.07529
🚀 Key results:
- Based on conditional gradient method
- Beats Muon+Adam on NanoGPT (tested up to 3B params)
- Zero-shot learning rate transfer across model size
- Uses WAY less memory (just one set of params + half-precision grads)
- Provides explicit norm control

1 1 4

Reposted by Wanyun Xie

Volkan Cevher @cevherlions.bsky.social · Feb 13

🔥 Want to train large neural networks WITHOUT Adam while using less memory and getting better results? ⚡
Check out SCION: a new optimizer that adapts to the geometry of your problem using norm-constrained linear minimization oracles (LMOs): 🧵👇

3 6 17

Wanyun Xie @wanyunxie.bsky.social · Feb 13

We use the optimal configuration from the 124M proxy model to scale up to 3B, increasing both width & depth. (UNCONSTRAINED) SCION outperforms Adam & Muon! 🚀

Wanyun Xie @wanyunxie.bsky.social · Dec 11

You can find our paper at arxiv.org/abs/2410.10683, and our code at github.com/LIONS-EPFL/S...

SAMPa: Sharpness-aware Minimization Parallelized

Sharpness-aware minimization (SAM) has been shown to improve the generalization of neural networks. However, each SAM update requires \emph{sequentially} computing two gradients, effectively doubling ...

arxiv.org

Wanyun Xie @wanyunxie.bsky.social · Dec 11

We break SAM’s sequential gradient steps with SAMPa, which:
✨ Parallelizes gradients for efficiency.
✨ Adds optimistic gradient descent for stability.
✨ Ensures convergence with fixed perturbations.
SAMPa halves computation time and outperforms vanilla SAM in generalization.🚀

1

Wanyun Xie @wanyunxie.bsky.social · Dec 11

We'll present "SAMPa: Sharpness-Aware Minimization Parallelized" at #NeurIPS24 on Thursday! This is joint work with Thomas Pethick and Volkan Cevher.
📍 Find us at Poster #5904 from 16:30 in the West Ballroom.

1 1 1