Tyler Chang
@tylerachang.bsky.social
210 followers 65 following 12 posts
PhD student at UC San Diego. He/him/his. https://tylerachang.github.io/
Posts Media Videos Starter Packs
Pinned
tylerachang.bsky.social
We scaled training data attribution (TDA) methods ~1000x to find influential pretraining examples for thousands of queries in an 8B-parameter LLM over the entire 160B-token C4 corpus!
medium.com/people-ai-re...
Reposted by Tyler Chang
catherinearnett.bsky.social
Did you know?

❌77% of language models on @hf.co are not tagged for any language
📈For 95% of languages, most models are multilingual
🚨88% of models with tags are trained on English

In a new blog post, @tylerachang.bsky.social and I dig into these trends and why they matter! 👇
Reposted by Tyler Chang
mrl-workshop.bsky.social
We have over 200 volunteers now for 90+ languages! We are hoping to expand the diversity of our language coverage and are still looking for participants who speak these languages. Check out how to get involved below, and please help us spread the word!
We are still actively looking for volunteers speaking the following languages (or other languages not listed):
Afrikaans, Aymara, Basque, Bosnian, Breton, Burmese, Cebuano, Guarani, Haitian Creole, Hmong, Hungarian, Icelandic, Inuktitut, Irish, Karakalpak, Khmer, Kirghiz, Lao, Latvian, Macedonian, Malagasy, Maltese, Maori, Mongolian, Nahuatl, Navajo/Diné, Norwegian Nynorsk, Quechua, Romanian, Samoan, Scottish Gaelic, Shona, Somali, Tatar, Tibetan, Tigrinya, Waray, Walloon, Welsh, Yiddish, Zulu.
Reposted by Tyler Chang
mrl-workshop.bsky.social
With six weeks left before the deadline, we have had over 50 volunteers sign up to contribute for over 30 languages. If you don’t see your language represented on the map, this is your sign to get involved!
tylerachang.bsky.social
We're organizing a shared task to develop a multilingual physical commonsense reasoning evaluation dataset! Details on how to submit are at: sigtyp.github.io/st2025-mrl.h...
tylerachang.bsky.social
of course, there are some scenarios where you would want to really check all the training examples, e.g. for detecting data contamination, or for rare facts, etc.
tylerachang.bsky.social
I think you could still make interesting inferences about what *types* of training examples influence the target! You'd essentially be getting a sample of the actual top-k retrievals
tylerachang.bsky.social
The biggest compute cost is computing gradients for every training example (~= cost of training) -- happy to chat more, especially if you know anyone interested in putting together an open-source implementation!
tylerachang.bsky.social
Presenting our work on training data attribution for pretraining this morning: iclr.cc/virtual/2025... -- come stop by in Hall 2/3 #526 if you're here at ICLR!
tylerachang.bsky.social
Play with it yourself: see influential pretraining examples from our method for facts, factual errors, commonsense reasoning, arithmetic, and open-ended generation: github.com/PAIR-code/pr...
tylerachang.bsky.social
As models increase in size and pretraining tokens, "influence" more closely resembles "attribution". I.e. "better" models do seem to rely more on entailing examples.
tylerachang.bsky.social
Many influential examples do not entail a fact, but instead appear to reflect priors on common entities for certain relation types, or guesses based on first or last names.
tylerachang.bsky.social
In a fact tracing task, we find that classical retrieval methods (e.g. BM25) are still much better for retrieving examples that *entail* factual predictions (factual "attribution"), but TDA methods retrieve examples that have greater *influence* on model predictions.
tylerachang.bsky.social
Our method, TrackStar, refines existing gradient-based approaches to scale to much larger settings: over 100x more queries and a 30x larger retrieval corpus than previous work at this model size.
tylerachang.bsky.social
We scaled training data attribution (TDA) methods ~1000x to find influential pretraining examples for thousands of queries in an 8B-parameter LLM over the entire 160B-token C4 corpus!
medium.com/people-ai-re...
Reposted by Tyler Chang
catherinearnett.bsky.social
The Goldfish models were trained on byte-premium-scaled dataset sizes, such that if a language needs more bytes to encode a given amount of information, we scaled up the dataset according the byte premium. Read about how we (@tylerachang.bsky.social) trained the models: arxiv.org/pdf/2408.10441
Reposted by Tyler Chang
catherinearnett.bsky.social
Tyler Chang and my paper got awarded outstanding paper at #EMNLP2024! Thanks to the award committee for the recognition!