Lightnews — Scholar-powered news

Volkan Cevher @cevherlions.bsky.social · Feb 13

It turns out that the algorithm is closely related to the continuous greedy algorithm used in submodular optimization.

Reposted by Volkan Cevher

Tony S.F. @tonysf.bsky.social · Feb 13

We also provide the first convergence rate analysis that I'm aware of for stochastic unconstrained Frank-Wolfe (i.e., without weight decay), which directly covers the muon optimizer (and much more)!

Volkan Cevher @cevherlions.bsky.social · Feb 13

🔥 Want to train large neural networks WITHOUT Adam while using less memory and getting better results? ⚡
Check out SCION: a new optimizer that adapts to the geometry of your problem using norm-constrained linear minimization oracles (LMOs): 🧵👇

1 1 10

Volkan Cevher @cevherlions.bsky.social · Feb 13

This is a joint work that I am very grateful to have worked on with the exceptionally talented team of Thomas Pethick, @wanyunxie.bsky.social, Kimon Antonakopoulos, Zhenyu Zhu at LIONS@EPFL and @tonysf.bsky.social from CentraleSupélec.

2 3

Volkan Cevher @cevherlions.bsky.social · Feb 13

🧑‍🍳 We provide a complete cookbook for choosing the right LMO for your architecture: 📚
- Input layers (1-hot vs image)
- Hidden layers (spectral norms)
- Output layers (flexible norm choices)
All with explicit formulas and guidance for when to use each one.

1 3

Volkan Cevher @cevherlions.bsky.social · Feb 13

🌟 It turns out many popular optimizers (SignSGD, Muon, etc.) are special cases of our framework - just with different norm choices.
Our unified analysis reveals deep connections between seemingly different approaches and provides new insights into why they work 🤔

1 2

Volkan Cevher @cevherlions.bsky.social · Feb 13

📝 Check out the preprint: arxiv.org/abs/2502.07529
Worst-case convergence analysis with rates, guarantees for learning rate transfer, and practical advice on how to properly choose norms adapted to network geometry, backed by theory 🎯

1 2

Volkan Cevher @cevherlions.bsky.social · Feb 13

🕵️ It’s “just” stochastic conditional gradient. The secret sauce? Don't treat your weight matrices like they're flat vectors! SCION adapts to the geometry of matrices using LMOs with respect to the correct norm: the induced operator norm.

1 2

Volkan Cevher @cevherlions.bsky.social · Feb 13

arxiv.org/abs/2502.07529
🚀 Key results:
- Based on conditional gradient method
- Beats Muon+Adam on NanoGPT (tested up to 3B params)
- Zero-shot learning rate transfer across model size
- Uses WAY less memory (just one set of params + half-precision grads)
- Provides explicit norm control

1 1 4

Volkan Cevher @cevherlions.bsky.social · Feb 13

🔥 Want to train large neural networks WITHOUT Adam while using less memory and getting better results? ⚡
Check out SCION: a new optimizer that adapts to the geometry of your problem using norm-constrained linear minimization oracles (LMOs): 🧵👇

3 6 17

Volkan Cevher @cevherlions.bsky.social · Feb 13

It was a fun panel. Quite informative.

EPFL AI Center @epfl-ai-center.bsky.social · Feb 13

A thought-provoking panel with Scarlet of the EPFL AI Center, @cevherlions.bsky.social and Thomas Schneider from OFCOM - looking at the state of regulations, the business case for GenAI & the opportunities for Swiss research & innovation... a fine balance between talent, data and hardware. #AMLD

1 1

Volkan Cevher @cevherlions.bsky.social · Jan 5

Timeo professores machinae discendi et dona ferentes.

Mathieu Alain @miniapeur.bsky.social · Jan 5

8

Volkan Cevher @cevherlions.bsky.social · Jan 5

Timeo professores machinae discendi et dona ferentes.

1

Reposted by Volkan Cevher

Eugene Vinitsky 🍒 @eugenevinitsky.bsky.social · Dec 25

An illustrated guide to never learning anything

6 20 150

Reposted by Volkan Cevher

Wanyun Xie @wanyunxie.bsky.social · Dec 11

We'll present "SAMPa: Sharpness-Aware Minimization Parallelized" at #NeurIPS24 on Thursday! This is joint work with Thomas Pethick and Volkan Cevher.
📍 Find us at Poster #5904 from 16:30 in the West Ballroom.

1 1 1

Reposted by Volkan Cevher

Moritz Haas @mohaas.bsky.social · Dec 10

Stable model scaling with width-independent dynamics?

Thrilled to present 2 papers at #NeurIPS 🎉 that study width-scaling in Sharpness Aware Minimization (SAM) (Th 16:30, #2104) and in Mamba (Fr 11, #7110). Our scaling rules stabilize training and transfer optimal hyperparams across scales.

🧵 1/10

1 5 21

Reposted by Volkan Cevher

Moritz Haas @mohaas.bsky.social · Dec 10

This is joint work with wonderful collaborators @leenacvankadara.bsky.social , @cevherlions.bsky.social and Jin Xu during our time at Amazon.

🧵 10/10

arxiv.org

1 3

Volkan Cevher @cevherlions.bsky.social · Nov 29

@iclr-conf.bsky.social: Please incorporate this ACL style of feedback for reviewers:

aclrollingreview.org/authors#step...

Authors Guidelines

A peer review platform for the Association for Computational Linguistics

aclrollingreview.org

3

Reposted by Volkan Cevher

Peyman Milanfar @docmilanfar.bsky.social · Nov 15

Reviewers take note:
57% of people rejected their own argument when they thought it was someone else's. So take it easy with the criticism.

9 32