Lightnews — Scholar-powered news

Alex Wettig @awettig.bsky.social · Jul 16

Presenting two posters at ICML over the next two days:
- Both at 11am - 1:30pm
- Both about how to improve pre-training with domains
- Both at stall # E-2600 in East Exhibition Hall A-B (!)

Tomorrow: WebOrganizer w/ @soldaini.net & @kylelo.bsky.social
Thursday: MeCo by @gaotianyu1350.bsky.social

1 4

Alex Wettig @awettig.bsky.social · Feb 18

📜 Paper: arxiv.org/pdf/2502.10341
🌐 Website (feat. Domain Explorer): weborganizer.allen.ai
🤖 Models and Data: huggingface.co/WebOrganizer
💾 Code: github.com/CodeCreator...

w/amazing co-authors @kylelo.bsky.social @sewonm.bsky.social @hanna-nlp.bsky.social @danqi-chen.bsky.social @soldaini.net

GitHub - CodeCreator/WebOrganizer: Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation - CodeCreator/WebOrganizer

github.com

4

Alex Wettig @awettig.bsky.social · Feb 18

Our domains also shine a light on which type of content is implicitly upsampled when using quality filters!

💡 FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)

1 2

Alex Wettig @awettig.bsky.social · Feb 18

Instead of sampling from the domains, we can also pick the best documents according to quality filters, which improves the overall performance of two strong quality filters.

✅ Domain mixing complements quality filtering by being able to calibrate the training distribution!

1 2

Alex Wettig @awettig.bsky.social · Feb 18

We test these domain mixtures by training 1B models and find that they improve performance across a range of tasks.

And we can combine the topic and format predictions to curate data with even better performance! 📈

1 1

Alex Wettig @awettig.bsky.social · Feb 18

How useful are these domains for data curation in practice?

We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"

Prediction: Heavily upsample domains such as Science or Tutorials!

1 1

Alex Wettig @awettig.bsky.social · Feb 18

We distill the LLM outputs into small domain classifiers to annotate data at scale!

Interesting finding: our topics and formats co-occur almost independently!

1 1

Alex Wettig @awettig.bsky.social · Feb 18

Modern pre-training relies on crawling the web to collect trillions of tokens

We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages

🔍 Explore our domains and see examples at weborganizer.allen.ai

1 1

Alex Wettig @awettig.bsky.social · Feb 18

🤔 Ever wondered how prevalent some type of web content is during LM pre-training?

In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐

Key takeaway: domains help us curate better pre-training data! 🧵/N

1 8 26

Reposted by Alex Wettig

Jiacheng Liu @liujch1998.bsky.social · Dec 9

Want to predict the task performance of LMs before pretraining them?

We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.

2 14 33