Eshaan Nichani
@eshaannichani.bsky.social
23 followers 75 following 10 posts
phd student @ princeton · deep learning theory eshaannichani.com
Posts Media Videos Starter Packs
eshaannichani.bsky.social
Altogether, our work provides theoretical justification for the additive model hypothesis in gradient-based feature learning of shallow neural networks.

Check out our paper to learn more! (10/10)
eshaannichani.bsky.social
Compared to prior scaling laws theory, we study the high-dim feature learning regime, and don't assume the learning of different tasks can be decoupled a priori.

Instead, decoupling of different tasks (and thus emergence) arises from a "deflation" mechanism induced by SGD (9/10)
eshaannichani.bsky.social
Indeed, training two-layer nets in practice matches the theoretical scaling law: (8/10)
eshaannichani.bsky.social
As a corollary, when the a_p follow a power law, then the population loss exhibits a power law decay in the runtime/sample size and student width.

Matches functional form of empirical neural scaling laws (eg. Chinchilla)! (7/10)
eshaannichani.bsky.social
We train a 2-homogeneous two-layer student neural net via online SGD on the squared loss.

Main Theorem: to recover the top P ≤ P* = d^c directions, student width m = Θ(P*) and sample size poly(d, 1/a_{P*}, P) suffice.

Polynomial complexity with a single-stage algorithm! (6/10)
eshaannichani.bsky.social
Additive model target is thus a width P two-layer neural network.

Prior works either assume P = O(1) (multi-index model) or require complexity exponential in κ=a_1/a_P.

But to get a smooth scaling law, we need to handle many tasks (P→∞) with varying strengths (κ→∞) (5/10)
eshaannichani.bsky.social
We study an idealized setting where each “skill” is a Gaussian single-index model f*(x) = aσ(w•x).

Prior work (Ben Arous et al ’21) shows that SGD exhibits emergence: a long “search phase” with a loss plateau is followed by a rapid “descent phase” where loss converges. (4/10)
eshaannichani.bsky.social
One explanation is the additive model hypothesis:
- The cumulative loss can be decomposed into many distinct skills, each of which individually exhibits emergence.
- The juxtaposition of many learning curves at varying timescales leads to a smooth power law in the loss. (3/10)
eshaannichani.bsky.social
LLMs demonstrate “emergent capabilities”: acquisition of a single task/skill exhibits sharp transition as compute increases.

Yet “neural scaling laws” posit that increasing compute leads to predictable power law decay in the loss.

How do we reconcile these two phenomena? (2/10)
eshaannichani.bsky.social
Excited to announce a new paper with Yunwei Ren, Denny Wu,
@jasondeanlee.bsky.social!

We prove a neural scaling law in the SGD learning of extensive width two-layer neural networks.

arxiv.org/abs/2504.19983

🧵below (1/10)