Lindia Tjuatja
@lindiatjuatja.bsky.social
2.1K followers 430 following 49 posts
a natural language processor and “sensible linguist”. PhD-ing LTI@CMU, previously BS-ing Ling+ECE@UTAustin 🤠🤖📖 she/her lindiatjuatja.github.io
Posts Media Videos Starter Packs
Pinned
lindiatjuatja.bsky.social
When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs:

🧵1/9
Reposted by Lindia Tjuatja
aaronstevenwhite.io
I've found it kind of a pain to work with resources like VerbNet, FrameNet, PropBank (frame files), and WordNet using existing tools. Maybe you have too. Here's a little package that handles data management, loading, and cross-referencing via either a CLI or a python API.
GitHub - aaronstevenwhite/glazing: Unified data models and interfaces for syntactic and semantic frame ontologies.
Unified data models and interfaces for syntactic and semantic frame ontologies. - aaronstevenwhite/glazing
github.com
Reposted by Lindia Tjuatja
hadaskotek.bsky.social
Good news (for me!) my gender bias paper from 2023 still replicates with GPT-5.
Bad news (for everyone!) my gender bias paper from 2023 still replicates with GPT-5.
arxiv.org/pdf/2308.14921
hkotek.com/blog/gender-...
lindiatjuatja.bsky.social
🇦🇹I'll be at #ACL2025! Recently I've been thinking about:
✨linguistically + cognitively-motivated evals (as always!)
✨understanding multilingualism + representation learning (new!)

I'll also be presenting a poster for BehaviorBox on Wed @ Poster Session 4 (Hall 4/5, 10-11:30)!
lindiatjuatja.bsky.social
When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs:

🧵1/9
Reposted by Lindia Tjuatja
strubell.bsky.social
I did an interview w/ Pittsburgh's NPR station to share some of my views on the topic of the McCormick/Trump AI & Energy summit at CMU tomorrow. Despite being hosted at the university, there will not be opportunities for our university experts to contribute viewpoints at the event.
wesa.fm
WESA @wesa.fm · Jul 14
President Donald Trump travels to Carnegie Mellon University Tuesday for a summit on energy and artificial intelligence. Leaders say Western Pennsylvania's universities and natural-gas deposits could be vital to both industries. But researchers are concerned about AI's energy demands.
With Trump set to attend AI & energy summit, CMU professor worries climate issues will be lost
Carnegie Mellon University professor Emma Strubell says that while AI is promising, the threat of climate change "does keep me up at night a lot"
www.wesa.fm
Reposted by Lindia Tjuatja
aolteanu.bsky.social
We have to talk about rigor in AI work and what it should entail. The reality is that impoverished notions of rigor do not only lead to some one-off undesirable outcomes but can have a deeply formative impact on the scientific integrity and quality of both AI research and practice 1/
Print screen of the first page of a paper pre-print titled "Rigor in AI: Doing Rigorous AI Work Requires a Broader, Responsible AI-Informed Conception of Rigor" by Olteanu et al.  Paper abstract: "In AI research and practice, rigor remains largely understood in terms of methodological rigor -- such as whether mathematical, statistical, or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception -- in addition to a more expansive understanding of (1) methodological rigor -- should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also aim to provide useful language and a framework for much-needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders."
Reposted by Lindia Tjuatja
gneubig.bsky.social
Where does one language model outperform the other?

We examine this from first principles, performing unsupervised discovery of "abilities" that one model has and the other does not.

Results show interesting differences between model classes, sizes and pre-/post-training.
lindiatjuatja.bsky.social
When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs:

🧵1/9
lindiatjuatja.bsky.social
9/9 Finally, we found that we can use the discovered features to distinguish actual *generations* from these models, showing the connection between features from a predetermined corpus and the actual output behavior of the model!
lindiatjuatja.bsky.social
8/9 Furthermore, models that show a small diff in perplexity can have a large number of features where they differ!

Comparisons between Llama2 and OLMo2 models of the same size (which barely show a diff in perplexity) had the greatest number of discovered features.
lindiatjuatja.bsky.social
7/9 We apply BehaviorBox to models that differ in size, model-family, and post-training. We can find features showing that larger models are better at handling long-tail stylistic features (e.g. archaic spelling) and that RLHF-ed models are better at conversational expressions:
lindiatjuatja.bsky.social
6/9 We filter for features that show a median diff in probability between the two LMs > a cutoff value, then automatically label these features with a strong LLM by providing the representative examples.
lindiatjuatja.bsky.social
5/9 We then use the SAE to learn a higher-dim representation of these embeddings. Like previous work, we treat each learned dim as a feature, with the group of words leading to the highest activation of the feature as representative examples.
lindiatjuatja.bsky.social
4/9 To find features that describe a performance difference between two LMs, we train a SAE on *performance-aware embeddings*: contextual word embeddings from a separate pre-trained LM, concatenated with probabilities of these words under the LMs being evaluated.
lindiatjuatja.bsky.social
3/9 Our method BehaviorBox both 🔍finds and ✍️describes these fine-grained features at the word level. We use (*gasp*) SAEs as our method to find said features.

What makes our method distinct is the data we use as input to the SAE, which allows us to find *comparative features*.
lindiatjuatja.bsky.social
2/9 While corpus-level perplexity is a standard metric, it often hides fine-grained differences. Given a particular corpus, how can we find features of text that describe where model A > model B, and vice versa?
lindiatjuatja.bsky.social
When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs:

🧵1/9
lindiatjuatja.bsky.social
Hanging around NAACL and presenting this Thurs, 4:15 @ ling theories oral session (ballroom 🅱️). Come say hi, will also be eating many a sopapilla
lindiatjuatja.bsky.social
💬 Have you or a loved one compared LM probabilities to human linguistic acceptability judgments? You may be overcompensating for the effect of frequency and length!
🌟 In our new paper, we rethink how we should be controlling for these factors 🧵:
Screenshot of the paper title "What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length"
lindiatjuatja.bsky.social
wow that sounds and looks delicious
Reposted by Lindia Tjuatja
tedunderwood.com
I wasn’t super excited by o1, but as reasoning models go open-weights I’m starting to see how they make this interesting again. The 2022-24 “just scale up” period was both very effective and very boring.
lindiatjuatja.bsky.social
Accept to NAACL main! See yall in NM ☀️
lindiatjuatja.bsky.social
💬 Have you or a loved one compared LM probabilities to human linguistic acceptability judgments? You may be overcompensating for the effect of frequency and length!
🌟 In our new paper, we rethink how we should be controlling for these factors 🧵:
Screenshot of the paper title "What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length"
lindiatjuatja.bsky.social
I am once again asking for {cafe, food, work spots, things to see and do} for a place I will be visiting: the baaaay 🌁

(My first time visiting NorCal *ever* so the regular tourist spots are welcome!)
lindiatjuatja.bsky.social
Paperlike! I’ve been using mine for years and I like it
Reposted by Lindia Tjuatja
joeystanley.com
I don't remember who created this, where I got it from, or how long I've had it, but I have it on my slides as students walk in the first time we talk about Labov's NYC study. And it makes me chuckle every time I see it for some reason.

"Very rhotic. Very stratified." 😆
The movie poster for "Love Actually" but changed to "Labov Actually" with his face pasted over everyone else's and fun changes throughout like "very romantic, very comedy" changed to "very rhotic, very stratified."