Catch our Spotlight at #NeurIPS2025 Today!
📅 Wed Dec 3 🕟 4:30 - 7:30 PM 📍 Exhibit Hall C,D,E — Poster #3903
Huge thanks to my amazing collaborators: @mohaas.bsky.social @sbordt.bsky.social @ulrikeluxburg.bsky.social
Catch our Spotlight at #NeurIPS2025 Today!
📅 Wed Dec 3 🕟 4:30 - 7:30 PM 📍 Exhibit Hall C,D,E — Poster #3903
Huge thanks to my amazing collaborators: @mohaas.bsky.social @sbordt.bsky.social @ulrikeluxburg.bsky.social
This offers a new lens: Empirical quirks (like aggressive LR scaling) are not mere finite-width artefacts - they are faithful reflections of the true scaling limit. (9/10)
This offers a new lens: Empirical quirks (like aggressive LR scaling) are not mere finite-width artefacts - they are faithful reflections of the true scaling limit. (9/10)
Caveat: Controlled Divergence can still cause overconfidence and floating-point instabilities (precision failure) at scale! (8/10)
Caveat: Controlled Divergence can still cause overconfidence and floating-point instabilities (precision failure) at scale! (8/10)
CE admits larger LRs → richer feature learning. MSE is restricted to Lazy regime.
Validation: Under µP (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)
CE admits larger LRs → richer feature learning. MSE is restricted to Lazy regime.
Validation: Under µP (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)
This Feature Learning Limit closely matches the behavior of optimally tuned finite-width networks under CE loss. (6/10)
This Feature Learning Limit closely matches the behavior of optimally tuned finite-width networks under CE loss. (6/10)
This regime, however, does not exist under MSE. (5/10)
This regime, however, does not exist under MSE. (5/10)
Under CE loss, we find this regime comprises two distinct sub-regimes: A Catastrophically Unstable Regime and A benign Controlled Divergence regime. (4/10)
Under CE loss, we find this regime comprises two distinct sub-regimes: A Catastrophically Unstable Regime and A benign Controlled Divergence regime. (4/10)
In fact, infinite-width alignment predictions hold robustly when measured with sufficient granularity.
So what explains this discrepancy? (3/10)
In fact, infinite-width alignment predictions hold robustly when measured with sufficient granularity.
So what explains this discrepancy? (3/10)
η∈O(1/m)⟹Kernel; η∈ω(1/m)⟹Unstable.
Thus max stable LR∝1/m.
Practice violates this. Optimal LRs are larger (e.g.∝1/√m) & models admit feature learning; contradicts kernel predictions. Why? (2/10)
η∈O(1/m)⟹Kernel; η∈ω(1/m)⟹Unstable.
Thus max stable LR∝1/m.
Practice violates this. Optimal LRs are larger (e.g.∝1/√m) & models admit feature learning; contradicts kernel predictions. Why? (2/10)