Alex Wettig
@awettig.bsky.social
120 followers 84 following 9 posts
PhD@Princeton trying to make sense of language models and their training data
Posts Media Videos Starter Packs
awettig.bsky.social
Presenting two posters at ICML over the next two days:
- Both at 11am - 1:30pm
- Both about how to improve pre-training with domains
- Both at stall # E-2600 in East Exhibition Hall A-B (!)

Tomorrow: WebOrganizer w/ @soldaini.net & @kylelo.bsky.social
Thursday: MeCo by @gaotianyu1350.bsky.social
awettig.bsky.social
Our domains also shine a light on which type of content is implicitly upsampled when using quality filters!

💡 FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)
awettig.bsky.social
Instead of sampling from the domains, we can also pick the best documents according to quality filters, which improves the overall performance of two strong quality filters.

✅ Domain mixing complements quality filtering by being able to calibrate the training distribution!
awettig.bsky.social
We test these domain mixtures by training 1B models and find that they improve performance across a range of tasks.

And we can combine the topic and format predictions to curate data with even better performance! 📈
awettig.bsky.social
How useful are these domains for data curation in practice?

We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"

Prediction: Heavily upsample domains such as Science or Tutorials!
awettig.bsky.social
We distill the LLM outputs into small domain classifiers to annotate data at scale!

Interesting finding: our topics and formats co-occur almost independently!
awettig.bsky.social
Modern pre-training relies on crawling the web to collect trillions of tokens

We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages

🔍 Explore our domains and see examples at weborganizer.allen.ai
awettig.bsky.social
🤔 Ever wondered how prevalent some type of web content is during LM pre-training?

In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐

Key takeaway: domains help us curate better pre-training data! 🧵/N
Reposted by Alex Wettig
liujch1998.bsky.social
Want to predict the task performance of LMs before pretraining them?

We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.