Wanyun Xie
@wanyunxie.bsky.social
7 followers 4 following 8 posts
Posts Media Videos Starter Packs
wanyunxie.bsky.social
3/
Experiments show CHAMELEON:
- Improves generalization in pretraining (both perplexity and downstream tasks).
- Adapts seamlessly to new domains with minimal cost (1% retraining cost).
- Boosts performance in finetuning tasks.

Find our code in github.com/LIONS-EPFL/C...
GitHub - LIONS-EPFL/Chameleon: Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning, ICML 2025
Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning, ICML 2025 - LIONS-EPFL/Chameleon
github.com
wanyunxie.bsky.social
2/
From a data-centric perspective, CHAMELEON quantifies domain importance using Kernel Ridge Leverage Scores (KRLS) on learned domain embeddings. This allows us to directly adapt to new data without costly proxy retraining, drastically cutting compute!
wanyunxie.bsky.social
1/
We believe an ideal data-mixing method should:
- 🚀 Improve universal generalization;
- 🔄 Adapt to domain modifications;
- ✨ Handle different training stages (pretraining & finetuning).
CHAMELEON achieves all three!
wanyunxie.bsky.social
We'll present our work, "CHAMELEON: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning," at #ICML2025! This is joint work with Francesco Tonin and @CevherLIONS.

📍 Find us at Poster E-2807 from 11 AM today. Excited to connect and discuss!
Reposted by Wanyun Xie
cevherlions.bsky.social
arxiv.org/abs/2502.07529
🚀 Key results:
- Based on conditional gradient method
- Beats Muon+Adam on NanoGPT (tested up to 3B params)
- Zero-shot learning rate transfer across model size
- Uses WAY less memory (just one set of params + half-precision grads)
- Provides explicit norm control
Hyper-parameter transfer on NanonGPT.
Reposted by Wanyun Xie
cevherlions.bsky.social
🔥 Want to train large neural networks WITHOUT Adam while using less memory and getting better results? ⚡
Check out SCION: a new optimizer that adapts to the geometry of your problem using norm-constrained linear minimization oracles (LMOs): 🧵👇
wanyunxie.bsky.social
We use the optimal configuration from the 124M proxy model to scale up to 3B, increasing both width & depth. (UNCONSTRAINED) SCION outperforms Adam & Muon! 🚀
wanyunxie.bsky.social
We break SAM’s sequential gradient steps with SAMPa, which:
✨ Parallelizes gradients for efficiency.
✨ Adds optimistic gradient descent for stability.
✨ Ensures convergence with fixed perturbations.
SAMPa halves computation time and outperforms vanilla SAM in generalization.🚀
wanyunxie.bsky.social
We'll present "SAMPa: Sharpness-Aware Minimization Parallelized" at #NeurIPS24 on Thursday! This is joint work with Thomas Pethick and Volkan Cevher.
📍 Find us at Poster #5904 from 16:30 in the West Ballroom.