Lightnews — Scholar-powered news

Yapei Chang @yapeichang.bsky.social · May 20

Paper: arxiv.org/pdf/2505.11080
Code: github.com/lilakk/BLEUB... (coming soon)

Work done with the amazing @yekyung.bsky.social from UMD, Michael Krumdick from Kensho, Amir Zadeh and Chuan Li from LambdaAI ,
@chriswtanner.bsky.social from Kensho, and @miyyer.bsky.social from UMD

arxiv.org

Yapei Chang @yapeichang.bsky.social · May 20

Beyond benchmarks, human annotators rate BLEUBERI outputs as comparable to those from GRPO-RM models.

1

Yapei Chang @yapeichang.bsky.social · May 20

Qualitatively, BLEUBERI models produce more factually grounded outputs, as measured by VeriScore on three diverse datasets. VeriScore extracts verifiable claims from responses and checks each one against Google Search.

1

Yapei Chang @yapeichang.bsky.social · May 20

The surprising effectiveness of BLEU extends to training. BLEUBERI first selects 5K low-BLEU examples, then trains LLMs with GRPO using BLEU as the reward. BLEUBERI models are competitive as those trained with GRPO-RM (8B) and SFT across 4 benchmarks.

1

Yapei Chang @yapeichang.bsky.social · May 20

When BLEU agrees with humans on a pair of model outputs, what n-grams contribute to this decision? Below is an example where it captures both format (the “Ukrainian” and “English” headers) and factuality (the number 6.1).

1

Yapei Chang @yapeichang.bsky.social · May 20

BLEU is often dismissed for weak human correlation in generation tasks. But on general instruction following, using BLEU to rank pairs of Chatbot Arena outputs—scored against references from strong LLMs—matches 8B & 27B reward models in human agreement, especially with more refs.

1

Yapei Chang @yapeichang.bsky.social · May 20

BLEU is widely used for machine translation (MT) eval. Given a reference and a generation, it computes modified n-gram precision (1–4 grams) and applies a brevity penalty to penalize short outputs. If given multiple references, it takes the max match per n-gram.

1

Yapei Chang @yapeichang.bsky.social · May 20

🤔 Can simple string-matching metrics like BLEU rival reward models for LLM alignment?

🔍 We show that given access to a reference, BLEU can match reward models in human preference agreement, and even train LLMs competitively with them using GRPO.

🫐 Introducing BLEUBERI:

1 1 5

Yapei Chang @yapeichang.bsky.social · Mar 12

🕵️‍♀️ agents are strong on many tasks, but are they good at interacting with the web? 🧸our BEARCUBS benchmark shows that they struggle on interactive tasks that seem trivial to humans! 📄 check out the paper for how to build robust evaluations & directions for future agent research

Yixiao Song @yixiaosong.bsky.social · Mar 12

Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web!
✅ Humans achieve 85% accuracy
❌ OpenAI Operator: 24%
❌ Anthropic Computer Use: 14%
❌ Convergence AI Proxy: 13%

2

Reposted by Yapei Chang

Yekyung Kim @yekyung.bsky.social · Mar 5

Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers?

We create ONERULER 💍, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!

Our analysis across 26 languages 🧵👇

1 5 14

Yapei Chang @yapeichang.bsky.social · Feb 21

current models struggle with complex long-range reasoning tasks 📚 how can we reliably create synthetic training data?

💽 check out CLIPPER, a pipeline that generates data conditioning on compressed forms of long input documents!

Chau Minh Pham @chautmpham.bsky.social · Feb 21

⚠️Current methods for generating instruction-following data fall short for long-range reasoning tasks like narrative claim verification.

We present CLIPPER ✂️, a compression-based pipeline that produces grounded instructions for ~$0.5 each, 34x cheaper than human annotations.

8

Reposted by Yapei Chang

Jenna Russell @jennarussell.bsky.social · Jan 28

People often claim they know when ChatGPT wrote something, but are they as accurate as they think?

Turns out that while general population is unreliable, those who frequently use ChatGPT for writing tasks can spot even "humanized" AI-generated text with near-perfect accuracy 🎯

10 66 190

Reposted by Yapei Chang

Mark J. Nelson @mm-jj-nn.bsky.social · Dec 19

Great blog post (by a 15-author team!) on their release of ModernBERT, the continuing relevance of encoder-only models, and how they relate to, say, GPT-4/llama. Accessible enough that I might use this as an undergrad reading.

Finally, a Replacement for BERT: Introducing ModernBERT

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1 19 75

Yapei Chang @yapeichang.bsky.social · Dec 8

i've been using this one: repo2txt.simplebasedomain.com it also lets you filter by file type and supports private/local repos

GitHub to Plain Text Converter

Convert GitHub repositories to plain text files easily. Transform code into a single formatted text file.

repo2txt.simplebasedomain.com

2

Reposted by Yapei Chang

Michael Saxon @ COLM🍁 @saxon.me · Dec 6

🚨I too am on the job market‼️🤯

I'm searching for faculty positions/postdocs in multilingual/multicultural NLP, vision+language models, and eval for genAI!

I'll be at #NeurIPS2024 presenting our work on meta-evaluation for text-to-image faithfulness! Let's chat there!

Papers in🧵, see more: saxon.me

1 9 48

Yapei Chang @yapeichang.bsky.social · Dec 2

🐠 what monday feels like..

😵 fish washed up on the shore of walden pond

8

Yapei Chang @yapeichang.bsky.social · Nov 26

private closed-source evals are the future 🫣

2

Yapei Chang @yapeichang.bsky.social · Nov 25

www.youtube.com/watch?v=afQT...

Tommy Guerrero Best Of | 最高の

YouTube video by partedoparque

www.youtube.com

2

Yapei Chang @yapeichang.bsky.social · Nov 25

i knew something like this had to exist but why did i only discover it now?? no more suffering from looking at my 10+ open arxiv tabs not knowing which one is which...

3 27

Yapei Chang @yapeichang.bsky.social · Nov 23

🙋🏻‍♀️

1

Reposted by Yapei Chang

Marc Marone @marcmarone.com · Nov 23

I noticed a lot of starter packs skewed towards faculty/industry, so I made one of just NLP & ML students: go.bsky.app/vju2ux

Students do different research, go on the job market, and recruit other students. Ping me and I'll add you!

100 54 180

Yapei Chang @yapeichang.bsky.social · Nov 21

i also got 10/10! the ones that rhyme too well feel very AI to me..

1 2

Yapei Chang @yapeichang.bsky.social · Nov 21

such a creative way of using long-context models! this sounds like a super hard evaluation task, but gemini is already so good at it...

Lynn Cherny @arnicas.bsky.social · Nov 21

Steve Johnson on using book text to build a text adventure in NotebookLM. “the game relies on three elements: the original text from my book; a large language model (… Gemini Pro 1.5); and a 400-word prompt that I wrote giving the model instructions on how to host the game” thelongcontext.com

You Exist In The Long Context

Thoughts on the quiet revolution of long-context AI models, from NotebookLM's Editorial Director Steven Johnson.

thelongcontext.com

1 5

Reposted by Yapei Chang

Andrew Drozdov @mrdrozdov.com · Nov 20

Mat is not on 🦋—posting on his behalf!

It's time to revisit common assumptions in IR! Embeddings have improved drastically, but mainstream IR evals have stagnated since MSMARCO + BEIR.

We ask: on private or tricky IR tasks, are rerankers better? Surely, reranking many docs is best?

A plot showing that reranking improves recall as we increase the number of reranked docs, but with increasing docs we diminishing returns and eventually a performance dip.

4 23 81