Deniz Bayazit
@bayazitdeniz.bsky.social
8 followers 10 following 7 posts
#NLProc PhD student @EPFL #interpretability
Posts Media Videos Starter Packs
bayazitdeniz.bsky.social
6/ Concurrently, recent work shows broad phases of concept evolution (statistical→feature learning) with sparse crosscoders; we track causal dynamics of specific concepts over time and across languages with RelIE, giving a fuller and deeper view.

arxiv.org/abs/2509.17196
Evolution of Concepts in Language Model Pre-Training
Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-trai...
arxiv.org
bayazitdeniz.bsky.social
5/ Looking closer, feature sharing has limits: in Hindi & Arabic, overlap stays low even at 341B tokens. This may be due to richer agreement systems (e.g., verbs agreeing w/ subjects & objects) forcing BLOOM to keep language-specific features—or simply data scarcity!
bayazitdeniz.bsky.social
4/ In #multilingual models, cross-language feature overlap starts low and rises with training. At 6B tokens in BLOOM, most detectors are language-specific or for punctuation; by 341B tokens shared crosslingual features emerge, capturing syntactic abstractions over token patterns.
bayazitdeniz.bsky.social
3/ Which features matter early but fade, and which gain importance later? In Pythia, token-level detectors drop out, while higher-level grammatical features—like plural-noun detectors and nouns formed from verbs (e.g., runner from run)—strengthen by 286B tokens.
bayazitdeniz.bsky.social
2/ We align critical checkpoints for a task with sparse crosscoders, measure each feature’s causal role, and introduce RelIE to compare their influence across checkpoints. This lets us trace how internal features shift—and when they matter—in models like Pythia, OLMo, and BLOOM.
bayazitdeniz.bsky.social
1/🚨 New preprint

How do #LLMs’ inner features change as they train? Using #crosscoders + a new causal metric, we map when features appear, strengthen, or fade across checkpoints—opening a new lens on training dynamics beyond loss curves & benchmarks.

#interpretability