Leena C Vankadara
leenacvankadara.bsky.social
Leena C Vankadara
@leenacvankadara.bsky.social
Lecturer @GatsbyUCL; Previously Applied Scientist @AmazonResearch; PhD @MPI-IS @UniTuebingen
This may explain the practical success of CE over MSE!

CE admits larger LRs → richer feature learning. MSE is restricted to Lazy regime.

Validation: Under µP (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)
December 3, 2025 at 5:37 PM
At the edge of this regime (where η ∝ 1/√m), there exists a well-defined infinite-width limit where feature learning persists in all hidden layers.

This Feature Learning Limit closely matches the behavior of optimally tuned finite-width networks under CE loss. (6/10)
December 3, 2025 at 5:37 PM
We resolve this via a fine-grained analysis of the regime previously considered unstable (and therefore uninteresting).

Under CE loss, we find this regime comprises two distinct sub-regimes: A Catastrophically Unstable Regime and A benign Controlled Divergence regime. (4/10)
December 3, 2025 at 5:37 PM
We find this discrepancy persists even accounting for finite-width effects due to Catapult/EOS, Large Depth, Alignment Violations.

In fact, infinite-width alignment predictions hold robustly when measured with sufficient granularity.

So what explains this discrepancy? (3/10)
December 3, 2025 at 5:37 PM
Most nets use He/Lecun init with single LR η. As width m→∞, theory says

η∈O(1/m)⟹Kernel; η∈ω(1/m)⟹Unstable.

Thus max stable LR∝1/m.

Practice violates this. Optimal LRs are larger (e.g.∝1/√m) & models admit feature learning; contradicts kernel predictions. Why? (2/10)
December 3, 2025 at 5:37 PM