Chau Minh Pham
@chautmpham.bsky.social
2.1K followers 560 following 33 posts
PhD student @umdcs | Long-form Narrative Generation & Analysis | Intern @AdobeResearch @MSFTResearch | https://chtmp223.github.io
Posts Media Videos Starter Packs
Pinned
chautmpham.bsky.social
🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts?

🧟 You get what we call a Frankentext!

💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.
Reposted by Chau Minh Pham
mariaa.bsky.social
Keynote at #COLM2025: Nicholas Carlini from Anthropic

"Are language models worth it?"

Explains that the prior decade of his work on adversarial images, while it taught us a lot, isn't very applied; it's unlikely anyone is actually altering images of cats in scary ways.
Reposted by Chau Minh Pham
valentinhofmann.bsky.social
📢 New #COLM2025 paper 📢

Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴

Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.

🧵
Reposted by Chau Minh Pham
mariaa.bsky.social
What are your favorite recent papers on using LMs for annotation (especially in a loop with human annotators), synthetic data for task-specific prediction, active learning, and similar?

Looking for practical methods for settings where human annotations are costly.

A few examples in thread ↴
Reposted by Chau Minh Pham
mariaa.bsky.social
I see this work as our answer to the "cultural alignment" and "cultural benchmarking" trends in NLP research. Instead of making decisions for people, we consider "culture" in a specific setting with specific people for a specific task, and we ask people directly about their cultural adaptations.
shaily99.bsky.social
🖋️ Curious how writing differs across (research) cultures?
🚩 Tired of “cultural” evals that don't consult people?

We engaged with interdisciplinary researchers to identify & measure ✨cultural norms✨in scientific writing, and show that❗LLMs flatten them❗

📜 arxiv.org/abs/2506.00784

[1/11]
An overview of the work “Research Borderlands: Analysing Writing Across Research Cultures” by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We  survey and interview interdisciplinary researchers (§3) to develop a framework of writing norms that vary across research cultures (§4) and operationalise them using computational metrics (§5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (§6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (§7).
chautmpham.bsky.social
We release code to facilitate future research on fine-grained detection of mixed-origin texts and human-AI cowriting.

Github: github.com/chtmp223/Fra...
Paper: arxiv.org/abs/2505.18128

Work done with @jennajrussell, @dzungvietpham, and @MohitIyyer!
GitHub - chtmp223/Frankentext: Frankentext: Stitching random text fragments into long-form narratives
Frankentext: Stitching random text fragments into long-form narratives - chtmp223/Frankentext
github.com
chautmpham.bsky.social
Room for improvement:

🔧 Frankentexts struggle with smooth narrative transitions and grammar, as noted by human annotators.
🔩 Non-fiction versions are coherent and faithful but tend to be overly anecdotal and lack factual accuracy.
chautmpham.bsky.social
Takeaway 2: Our controllable generation process provides a sandbox for human-AI co-writing research, with adjustable proportion, length, and diversity of human excerpts.

👫 Models can follow copy constraints, which is a proxy for % of human writing in co-authored texts.
chautmpham.bsky.social
Takeaway 1: Frankentexts don’t fit into the "AI vs. human" binary.

📉 Binary detectors misclassify them as human-written
👨‍👩‍👧 Humans can detect AI involvement more often
🔍 Mixed-authorship tools (Pangram) help, but still catch only 59%

We need better tools for this gray zone.
chautmpham.bsky.social
Automatic evaluation on 100 Frankentexts using LLM judges, text detectors, and a ROUGE-L-based metric shows that:

💪 Gemini-2.5-Pro, Claude-3.5-Sonnet, and R1 can generate Frankentexts that are up to 90% relevant, 70% coherent, and 75% traceable to the original human writings.
chautmpham.bsky.social
Frankentext generation presents an instruction-following task that challenges the limits of controllable generation, requiring each model to:

1️⃣ Produce a draft by selecting & combining human-written passages.
2️⃣ Iteratively revise the draft while maintaining a copy ratio.
chautmpham.bsky.social
🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts?

🧟 You get what we call a Frankentext!

💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.
chautmpham.bsky.social
We find that LLMs (e.g. GPT-4o, LLaMA-3.1) consistently recall book content across languages, even for texts without official translation in pre-training data!

Great work led by undergrads at UMass NLP 🥳
nhminhle.bsky.social
LLMs memorize novels 📚 in English. But what about existing translations? Or translations into new languages?

Our 🦉OWL dataset (31K/10 languages) shows GPT4o recognizes books:
92% English
83% official translations
69% unseen translations
75% as audio (EN)
Reposted by Chau Minh Pham
juand-r.bsky.social
One of the ways that LLMs can be inconsistent is the "generator-validator gap," where LLMs deem their own answers incorrect.

🎯 We demonstrate that ranking-based discriminator training can significantly reduce this gap, and improvements on one task often generalize to others!

🧵👇
A visualization of the generator-validator gap, where the LM likelihoods of for the generator and discriminator forms of questions are poorly correlated. Aligning the validator and generator rankings can fix it!
Reposted by Chau Minh Pham
natolambert.bsky.social
A very cool paper shows that you can use the RL loss to improve story generation by some clever setups on training on known texts (e.g. ground predictions versus a next chapter you know). RL starting to generalize already!
Learning to Reason for Long-Form Story Generation
Generating high-quality stories spanning thousands of tokens requires competency across a variety of skills, from tracking plot and character arcs to keeping a consistent and engaging style. Due to…
buff.ly
Reposted by Chau Minh Pham
markar.bsky.social
We have updated #nocha, a leaderboard for reasoning over long-context narratives 📖, with some new models including #Gemini 2.5 Pro which shows massive improvements over the previous version! Congrats to #Gemini team 🪄 🧙 Check 🔗 novelchallenge.github.io for details :)
Leaderboard showing performance of language models on claim verification task over book-length input. o1-preview is the best model with 67.36% accuracy followed by Gemini 2.5 Pro with 64.17% accuracy.
Reposted by Chau Minh Pham
siangooding.bsky.social
New paper from our team @GoogleDeepMind!

🚨 We've put LLMs to the test as writing co-pilots – how good are they really at helping us write? LLMs are increasingly used for open-ended tasks like writing assistance, but how do we assess their effectiveness? 🤔

arxiv.org/pdf/2503.19711
arxiv.org
Reposted by Chau Minh Pham
kennypeng.bsky.social
Our lab had a #dogathon 🐕 yesterday where we analyzed NYC Open Data on dog licenses. We learned a lot of dog facts, which I’ll share in this thread 🧵

1) Geospatial trends: Cavalier King Charles Spaniels are common in Manhattan; the opposite is true for Yorkshire Terriers.
Reposted by Chau Minh Pham
digthatdata.bsky.social
The high effort solution is to use an LLM to make a browser extension which tracks your academic reading and logs every paper you interact with to github, which builds and publishes a webapp to expose the data.

Which, clearly only a crazy weirdo would do.

dmarx.github.io/papers-feed/
ArXiv Paper Feed
dmarx.github.io
Reposted by Chau Minh Pham
rajmovva.bsky.social
💡New preprint & Python package: We use sparse autoencoders to generate hypotheses from large text datasets.

Our method, HypotheSAEs, produces interpretable text features that predict a target variable, e.g. features in news headlines that predict engagement. 🧵1/
chautmpham.bsky.social
Ask OpenAI Operator for bus routes from your home in Vietnam to a university and it likely fails because it refuses to use Google Maps! Our new BEARCUBS 🐻 benchmark shows CU agents still struggle with seemingly straightforward multimodal questions.
yixiaosong.bsky.social
Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web!
✅ Humans achieve 85% accuracy
❌ OpenAI Operator: 24%
❌ Anthropic Computer Use: 14%
❌ Convergence AI Proxy: 13%
Reposted by Chau Minh Pham
yekyung.bsky.social
Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers?

We create ONERULER 💍, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!

Our analysis across 26 languages 🧵👇
Reposted by Chau Minh Pham
mellymeldubs.bsky.social
Excited to share our preprint "Provocations from the Humanities for Generative AI Research”

We're open to feedback—read & share thoughts!

@laurenfklein.bsky.social @mmvty.bsky.social @docdre.distributedblackness.net @mariaa.bsky.social @jmjafrx.bsky.social @nolauren.bsky.social @dmimno.bsky.social
Screenshot of the first page of preprint, "Provocations from the Humanities for Generative AI Research," by Lauren Klein, Meredith Martin, Andre Brock, Maria Antoniak, Melanie Walsh, Jessica Marie Johnson, Lauren Tilton, and David Mimno
Reposted by Chau Minh Pham
nbalepur.bsky.social
🚨 New Position Paper 🚨

Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬

We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠

Here's why MCQA evals are broken, and how to fix them 🧵