Lightnews — Scholar-powered news

Clara Na @clarana.bsky.social · May 6

Yes! tbh this method is probably much more immediately useful for helping one understand subtle differences between [models trained on] subtly different data subsets, vs a loftier goal of helping one find "the" best data mixture -- to anyone considering this method, please feel free to reach out :)

Ted Underwood @tedunderwood.com · May 5

The method in this paper was designed to find an optimal data mixture. But researchers in the human sciences who are training models *in order to understand the effect of the data* might also consider this as a clever way of evaluating hundreds of subsets without training hundreds of models. #MLSky

Figure showing a modular training strategy for evaluating domain importance in training data.
At the top, a question is posed: “Which domain is most beneficial to add to the training data?” Below, the left panel labeled Modular Training displays colored blocks representing separate models trained on distinct data partitions. Each block corresponds to a “base unit” of data, and blocks of different colors represent different domains. The right panel labeled Evaluation shows overlapping combinations of these trained models being evaluated together. The strategy allows for reuse of modularly trained models and performs evaluation on parameter averages, enabling efficient simulation of many data mixtures without retraining full models for each. A legend at the bottom explains that each block represents one model trained on x billion tokens, and each outlined group represents one evaluation.

1 2

Clara Na @clarana.bsky.social · May 6

I almost never use these so I always thought that they were cute little things that let seatmates watch the same movie

2

Clara Na @clarana.bsky.social · May 5

Congrats Lucy!!

5

Clara Na @clarana.bsky.social · Apr 26

Come through! #492 in Hall 2!, 10am-12:30pm

1 6

Reposted by Clara Na

Emma Strubell @strubell.bsky.social · Apr 25

Our paper documenting the environmental impacts of creating OLMo language models is the most honest and comprehensive characterization I know of, including training, development (!) and inference costs. If you're at ICLR chat with @jacobcares.bsky.social & @clarana.bsky.social Sat morning 10-12:30!

Jacob Morrison @jacobcares.bsky.social · Apr 23

📜Paper: arxiv.org/abs/2503.05804
✍️Thanks to my illustrious coauthors @clarana.bsky.social @jaredfern.bsky.social timdettmers.com @strubell.bsky.social @jessedodge.bsky.social, t'was a fun project 🌏

Holistically Evaluating the Environmental Impact of Creating Language Models

As the performance of artificial intelligence systems has dramatically increased, so too has the environmental impact of creating these systems. While many model developers release estimates of the po...

arxiv.org

3 21

Reposted by Clara Na

Jacob Morrison @jacobcares.bsky.social · Apr 23

📜Paper: arxiv.org/abs/2503.05804
✍️Thanks to my illustrious coauthors @clarana.bsky.social @jaredfern.bsky.social timdettmers.com @strubell.bsky.social @jessedodge.bsky.social, t'was a fun project 🌏

Holistically Evaluating the Environmental Impact of Creating Language Models

As the performance of artificial intelligence systems has dramatically increased, so too has the environmental impact of creating these systems. While many model developers release estimates of the po...

arxiv.org

4 9

Reposted by Clara Na

Jacob Morrison @jacobcares.bsky.social · Apr 23

I'm in Singapore for @iclr-conf.bsky.social ! Come check out our spotlight paper on the environmental impact of training OLMo (link in next tweet) during the Saturday morning poster session from 10-12:30 -- happy to chat about this or anything else! DMs should be open, email works too

1 5 10

Reposted by Clara Na

Data Rescue Project #DataRescue @datarescueproject.org · Apr 3

We've received multiple notes that NOAA research services (Office of Oceanic and Atmospheric Research) may go offline at midnight. @safeguardingdata.bsky.social is working on web archiving, but if others want to nominate on this, that might be good: digital2.library.unt.edu/nomination/G...

Nomination Tool: Project URL Nomination

digital2.library.unt.edu

1 22 47

Reposted by Clara Na

Alicia DeVrio @uhleeeeeeeshuh.bsky.social · Mar 6

How can we better think and talk about human-like qualities attributed to language technologies like LLMs? In our #CHI2025 paper, we taxonomize how text outputs from cases of user interactions with language technologies can contribute to anthropomorphism. arxiv.org/abs/2502.09870 1/n

Image of the first page of the CHI 2025 paper titled "A Taxonomy of Linguistic Expressions That Contribute To Anthropomorphism of Language Technologies" by authors Alicia DeVrio, Myra Cheng, Lisa Egede, Alexandra Olteanu, & Su Lin Blodgett

2 11 42

Reposted by Clara Na

Akhila Yerukola @akhilayerukola.bsky.social · Feb 26

Did you know? Gestures used to express universal concepts—like wishing for luck—vary DRAMATICALLY across cultures?
🤞means luck in US but deeply offensive in Vietnam 🚨

📣 We introduce MC-SIGNS, a test bed to evaluate how LLMs/VLMs/T2I handle such nonverbal behavior!

📜: arxiv.org/abs/2502.17710

Figure showing that interpretations of gestures vary dramatically across regions and cultures. ‘Crossing your fingers,’ commonly used in the US to wish for good luck, can be deeply offensive to female audiences in parts of Vietnam. Similarly, the 'fig gesture,' a playful 'got your nose' game with children in the US, carries strong sexual connotations in Japan and can be highly offensive.

1 7 33

Reposted by Clara Na

Kyle Lo @ COLM 2025 🍁 @kylelo.bsky.social · Dec 10

the science of LMs should be fully open✨

today @akshitab.bsky.social @natolambert.bsky.social and I are giving our #neurips2024 tutorial on language model development.

everything from data, training, adaptation. published or not, no secrets 🫡

tues, 12/10, 9:30am PT ☕️

neurips.cc/virtual/2024...

NeurIPS Tutorial Opening the Language Model Pipeline: A Tutorial on Data Preparation, Model Training, and AdaptationNeurIPS 2024

neurips.cc

5 17 150

Reposted by Clara Na

Casilli @casilli.bsky.social · Dec 3

How open is “open” AI, really?
It isn’t just about making models reusable. If the origin of data is opaque, if labor is hidden & exploited, if frameworks are dominated by Big Tech, if computational power is mastered by an oligopoly…‘open’ is just a label.

Meredith Whittaker & friends in Nature.

Meredith Whittaker @meredithmeredith.bsky.social · Dec 2

📢NEW: 'Open' AI systems aren't open. The vague term, combined w frothy AI hype is (mis)shaping policy & practice, assuming 'open source' AI democratizes access & addresses power concentration. It doesn't.

@smw.bsky.social, @davidthewid.bsky.social & I correct the record👇
nature.com/articles/s41...

Why ‘open’ AI systems are actually closed, and why this matters - Nature

A review of the literature on artificial intelligence systems to examine openness reveals that open AI systems are actually closed, as they are highly dependent on the resources of a few large corpora...

nature.com

15 53

Reposted by Clara Na

Marc Marone @marcmarone.com · Nov 23

I noticed a lot of starter packs skewed towards faculty/industry, so I made one of just NLP & ML students: go.bsky.app/vju2ux

Students do different research, go on the job market, and recruit other students. Ping me and I'll add you!

100 54 180

Reposted by Clara Na

Lindia Tjuatja @lindiatjuatja.bsky.social · Nov 20

💬 Have you or a loved one compared LM probabilities to human linguistic acceptability judgments? You may be overcompensating for the effect of frequency and length!
🌟 In our new paper, we rethink how we should be controlling for these factors 🧵:

Screenshot of the paper title "What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length"

1 19 84

Clara Na @clarana.bsky.social · Nov 14

@jaredfern.bsky.social is at 162

1

Clara Na @clarana.bsky.social · Nov 14

Hi I am at 232 in the back of the riverfront room!

3

Clara Na @clarana.bsky.social · Nov 13

I'm at EMNLP! Presenting the poster for this paper on Thursday morning (10:30-12), Session F Riverfront Hall, come say hi :)

Clara Na @clarana.bsky.social · Nov 5

Building/customizing your own LLM? You'll want to curate training data for it, but how do you know what makes the data good?
You can try out recipes👩‍🍳 iterate on ✨vibes✨ but we can't actually test all possible combos of tweaks,,, right?? 🙅‍♂️WRONG! arxiv.org/abs/2410.15661 (1/n) 🧵

4

Reposted by Clara Na

Lindia Tjuatja @lindiatjuatja.bsky.social · Nov 8

(Hehe first bsky post!) I'll be at #EMNLP2024 💃🌴! Happy to chat about (among other things):
✨linguistically+cognitively motivated evaluation
✨NLP for low-resource+endangered languages
✨figuring out what features of language data LMs are *actually* learning
I'll be presenting two posters 🧵:

1 6 30

Clara Na @clarana.bsky.social · Nov 9

scrolling,,, minimal doom ?!

3

Reposted by Clara Na

Vagrant Gautam @dippedrusk.com · Nov 8

Understanding “Democratization” in NLP and ML Research - joint work @arjunsubgraph.bsky.social and I co-led with Dietrich Klakow and @zeerak.bsky.social
aclanthology.org/2024.emnlp-m...

Understanding “Democratization” in NLP and ML Research

Arjun Subramonian, Vagrant Gautam, Dietrich Klakow, Zeerak Talat. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.

aclanthology.org

5 5 12

Clara Na @clarana.bsky.social · Nov 8

hi ! :)

2

Reposted by Clara Na

Maria Antoniak @mariaa.bsky.social · Nov 4

A starter pack for #NLP #NLProc researchers! 🎉

go.bsky.app/SngwGeS

45 100 250

Clara Na @clarana.bsky.social · Nov 5

I'll be presenting our paper at #EMNLP2024 next week -- see y'all in Miami🌴! This was my Summer 2023 work @ai2.bsky.social Grateful to my wonderful collaborators @ianmagnusson.bsky.social @ananyahjha93.bsky.social @tomsherborne.bsky.social & mentors @strubell.bsky.social Jesse, and Pradeep (6/n)

1 6

Clara Na @clarana.bsky.social · Nov 5

Check out the paper for details and our specific recommendations!
🤗Data and models: huggingface.co/collections/...
👩‍💻Repo: github.com/clarana/ez-d...
📄Paper again: arxiv.org/abs/2410.15661
(5/n)

Scalable Data Ablations - a claran Collection

Datasets and models for EMNLP paper "Scalable Data Ablation Approximations for Language Models through Modular Training and Merging"

huggingface.co

1 3

Clara Na @clarana.bsky.social · Nov 5

We can even predict larger model perplexity scores w/ smaller model proxy evals, AND the relationship holds even when the actual ppl scores are high (4/n)

1 2