David Brandfonbrener
@brandfonbrener.bsky.social
350 followers 110 following 12 posts
Research scientist at Meta on the llama team Thinking about language models Past: PhD at NYU, fellow at Harvard’s Kempner Institute
Posts Media Videos Starter Packs
Reposted by David Brandfonbrener
blackhc.bsky.social
I want to reshare @brandfonbrener.bsky.social's @NeurIPSConf 2024 paper on CoLoR-Filter: A simple yet powerful method for selecting high-quality data for language model pre-training!

With @hlzhang109.bsky.social @schwarzjn.bsky.social @shamkakade.bsky.social
brandfonbrener.bsky.social
I’m heading to NeurIPS Wednesday through Sunday. DM me if you want to meet up!
brandfonbrener.bsky.social
Definitely one of my favorites too!
Reposted by David Brandfonbrener
kempnerinstitute.bsky.social
NEW: we have an exciting opportunity for a tenure-track professor at the #KempnerInstitute and the John A. Paulson School of Engineering and Applied Sciences (SEAS). Read the full description & apply today: academicpositions.harvard.edu/postings/14362
#ML #AI
brandfonbrener.bsky.social
Loss-to-loss prediction lets us do things like translate a scaling law from only 8 models on a new dataset by leveraging data from a prior run. Full results and more applications in the paper!
brandfonbrener.bsky.social
Generalizing further, we can predict from test-to-test. These predictions pair up models trained on two different datasets with the same budget and then compare test loss on a third dataset.
brandfonbrener.bsky.social
Next, we can use a similar methodology to go from train-to-test. These predictions describe how a single function transfers from performance on the training loss to performance on any test loss.
brandfonbrener.bsky.social
We can fit these curves to sets of 88 models of varying model sizes and dataset sizes. In total we fit over 500 models for these experiments and also release all of the models.

The fits extrapolate well to models with 20x more FLOPs
brandfonbrener.bsky.social
First, we consider how to translate scaling laws from one dataset to another and from one loss to another.

We find that we can fit a curve to map loss on dataset 0 to loss on dataset 1 for N-parameter models on D tokens (where E_i is the estimated irreducible loss on dataset i)
brandfonbrener.bsky.social
How does test loss change as we change the training data? And how does this interact with scaling laws?

We propose a methodology to approach these questions by showing that we can predict the performance across datasets and losses with simple shifted power law fits.