Lightnews — Scholar-powered news

Reposted by Lindia Tjuatja

Aaron Steven White @aaronstevenwhite.io · 11d

I've found it kind of a pain to work with resources like VerbNet, FrameNet, PropBank (frame files), and WordNet using existing tools. Maybe you have too. Here's a little package that handles data management, loading, and cross-referencing via either a CLI or a python API.

GitHub - aaronstevenwhite/glazing: Unified data models and interfaces for syntactic and semantic frame ontologies.

Unified data models and interfaces for syntactic and semantic frame ontologies. - aaronstevenwhite/glazing

github.com

3 7 27

Reposted by Lindia Tjuatja

Hadas Kotek 🦄 @hadaskotek.bsky.social · Aug 8

Good news (for me!) my gender bias paper from 2023 still replicates with GPT-5.
Bad news (for everyone!) my gender bias paper from 2023 still replicates with GPT-5.
arxiv.org/pdf/2308.14921
hkotek.com/blog/gender-...

1 47 150

Lindia Tjuatja @lindiatjuatja.bsky.social · Jul 25

Whoops my b, 11-12:30

1

Lindia Tjuatja @lindiatjuatja.bsky.social · Jul 25

🇦🇹I'll be at #ACL2025! Recently I've been thinking about:
✨linguistically + cognitively-motivated evals (as always!)
✨understanding multilingualism + representation learning (new!)

I'll also be presenting a poster for BehaviorBox on Wed @ Poster Session 4 (Hall 4/5, 10-11:30)!

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs:

🧵1/9

1 4 18

Reposted by Lindia Tjuatja

Emma Strubell @strubell.bsky.social · Jul 14

I did an interview w/ Pittsburgh's NPR station to share some of my views on the topic of the McCormick/Trump AI & Energy summit at CMU tomorrow. Despite being hosted at the university, there will not be opportunities for our university experts to contribute viewpoints at the event.

WESA @wesa.fm · Jul 14

President Donald Trump travels to Carnegie Mellon University Tuesday for a summit on energy and artificial intelligence. Leaders say Western Pennsylvania's universities and natural-gas deposits could be vital to both industries. But researchers are concerned about AI's energy demands.

With Trump set to attend AI & energy summit, CMU professor worries climate issues will be lost

Carnegie Mellon University professor Emma Strubell says that while AI is promising, the threat of climate change "does keep me up at night a lot"

www.wesa.fm

1 8 16

Reposted by Lindia Tjuatja

Alexandra Olteanu @aolteanu.bsky.social · Jun 18

We have to talk about rigor in AI work and what it should entail. The reality is that impoverished notions of rigor do not only lead to some one-off undesirable outcomes but can have a deeply formative impact on the scientific integrity and quality of both AI research and practice 1/

Print screen of the first page of a paper pre-print titled "Rigor in AI: Doing Rigorous AI Work Requires a Broader, Responsible AI-Informed Conception of Rigor" by Olteanu et al. Paper abstract: "In AI research and practice, rigor remains largely understood in terms of methodological rigor -- such as whether mathematical, statistical, or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception -- in addition to a more expansive understanding of (1) methodological rigor -- should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also aim to provide useful language and a framework for much-needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders."

2 18 61

Reposted by Lindia Tjuatja

Graham Neubig @gneubig.bsky.social · Jun 9

Where does one language model outperform the other?

We examine this from first principles, performing unsupervised discovery of "abilities" that one model has and the other does not.

Results show interesting differences between model classes, sizes and pre-/post-training.

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs:

🧵1/9

2 5

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

Curious to learn more? Check out our preprint, or find me at #acl2025!

arxiv.org/abs/2506.02204

Grateful for my advisor @gneubig.bsky.social for sticking with me through this project (which took 1.5 years), and everyone who gave me advice and encouragement while writing it! 🌟

BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models

Language model evaluation is a daunting task: prompts are brittle, corpus-level perplexities are vague, and the choice of benchmarks are endless. Finding examples that show meaningful, generalizable d...

arxiv.org

1 6

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

9/9 Finally, we found that we can use the discovered features to distinguish actual *generations* from these models, showing the connection between features from a predetermined corpus and the actual output behavior of the model!

1 1

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

8/9 Furthermore, models that show a small diff in perplexity can have a large number of features where they differ!

Comparisons between Llama2 and OLMo2 models of the same size (which barely show a diff in perplexity) had the greatest number of discovered features.

1 3

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

7/9 We apply BehaviorBox to models that differ in size, model-family, and post-training. We can find features showing that larger models are better at handling long-tail stylistic features (e.g. archaic spelling) and that RLHF-ed models are better at conversational expressions:

1 1

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

6/9 We filter for features that show a median diff in probability between the two LMs > a cutoff value, then automatically label these features with a strong LLM by providing the representative examples.

1

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

5/9 We then use the SAE to learn a higher-dim representation of these embeddings. Like previous work, we treat each learned dim as a feature, with the group of words leading to the highest activation of the feature as representative examples.

1 1

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

4/9 To find features that describe a performance difference between two LMs, we train a SAE on *performance-aware embeddings*: contextual word embeddings from a separate pre-trained LM, concatenated with probabilities of these words under the LMs being evaluated.

1 1

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

3/9 Our method BehaviorBox both 🔍finds and ✍️describes these fine-grained features at the word level. We use (*gasp*) SAEs as our method to find said features.

What makes our method distinct is the data we use as input to the SAE, which allows us to find *comparative features*.

1 2

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

2/9 While corpus-level perplexity is a standard metric, it often hides fine-grained differences. Given a particular corpus, how can we find features of text that describe where model A > model B, and vice versa?

1

Lindia Tjuatja @lindiatjuatja.bsky.social · Jun 9

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs:

🧵1/9

2 21 70

Lindia Tjuatja @lindiatjuatja.bsky.social · Apr 30

Hanging around NAACL and presenting this Thurs, 4:15 @ ling theories oral session (ballroom 🅱️). Come say hi, will also be eating many a sopapilla

Lindia Tjuatja @lindiatjuatja.bsky.social · Nov 20

💬 Have you or a loved one compared LM probabilities to human linguistic acceptability judgments? You may be overcompensating for the effect of frequency and length!
🌟 In our new paper, we rethink how we should be controlling for these factors 🧵:

Screenshot of the paper title "What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length"

1 4

Lindia Tjuatja @lindiatjuatja.bsky.social · Jan 25

wow that sounds and looks delicious

1 2

Reposted by Lindia Tjuatja

Ted Underwood @tedunderwood.com · Jan 23

I wasn’t super excited by o1, but as reasoning models go open-weights I’m starting to see how they make this interesting again. The 2022-24 “just scale up” period was both very effective and very boring.

2 2 15

Lindia Tjuatja @lindiatjuatja.bsky.social · Jan 22

Accept to NAACL main! See yall in NM ☀️

Lindia Tjuatja @lindiatjuatja.bsky.social · Nov 20

💬 Have you or a loved one compared LM probabilities to human linguistic acceptability judgments? You may be overcompensating for the effect of frequency and length!
🌟 In our new paper, we rethink how we should be controlling for these factors 🧵:

1 16

Lindia Tjuatja @lindiatjuatja.bsky.social · Dec 31

cat

6

Lindia Tjuatja @lindiatjuatja.bsky.social · Dec 29

I am once again asking for {cafe, food, work spots, things to see and do} for a place I will be visiting: the baaaay 🌁

(My first time visiting NorCal *ever* so the regular tourist spots are welcome!)

1

Lindia Tjuatja @lindiatjuatja.bsky.social · Dec 24

Paperlike! I’ve been using mine for years and I like it

1 4

Reposted by Lindia Tjuatja

Joey Stanley @joeystanley.com · Dec 18

I don't remember who created this, where I got it from, or how long I've had it, but I have it on my slides as students walk in the first time we talk about Labov's NYC study. And it makes me chuckle every time I see it for some reason.

"Very rhotic. Very stratified." 😆

The movie poster for "Love Actually" but changed to "Labov Actually" with his face pasted over everyone else's and fun changes throughout like "very romantic, very comedy" changed to "very rhotic, very stratified."

3 28 130