Lightnews — Scholar-powered news

Reposted by Ben Newman

Taylor Sorensen @taylor-sorensen.bsky.social · 17h

Did you know that LLMs suffer from serious mode collapse?

For example, if you ask models to tell you a joke, they almost always tell you the same joke? This is true across samples and even across model families!

Why does this happen? Can we improve it?

1 1 2

Reposted by Ben Newman

Kyle Lo @ COLM 2025 🍁 @kylelo.bsky.social · Nov 26

Excited to share OLMo 2!

🐟 7B and 13B weights, trained up to 4-5T tokens, fully open data, code, etc
🐠 better architecture and recipe for training stability
🐡 staged training, with new data mix Dolmino🍕 added during annealing
🦈 state-of-the-art OLMo 2 Instruct models

#nlp #mlsky

links below👇

A scatter plot comparing language models by performance (y-axis, measured in average performance on 10 benchmarks) versus training computational cost (x-axis, in approximate FLOPs). The plot shows OLMo 2 models (marked with stars) achieving Pareto-optimal efficiency among open models, with OLMo-2-13B and OLMo-2-7B sitting at the performance frontier relative to other open models like DCLM, Llama 3.1, StableLM 2, and Qwen 2.5. The x-axis ranges from 4x10^22 to 2x10^24 FLOPs, while the y-axis ranges from 35 to 70 benchmark points.

1 12 68

Reposted by Ben Newman

Maria Antoniak @mariaa.bsky.social · Nov 19

I'm recruiting 1-2 PhD students to work with me at the University of Colorado Boulder! Looking for creative students with interests in #NLP and #CulturalAnalytics.

Boulder is a lovely college town 30 minutes from Denver and 1 hour from Rocky Mountain National Park 😎

Apply by December 15th!

A photo of Boulder, Colorado, shot from above the university campus and looking toward the Flatirons.

10 140 310

Reposted by Ben Newman

Abhilasha Ravichander @lasha.bsky.social · Nov 11

✨I am on the faculty job market in the 2024-2025 cycle!✨

My research centers on advancing Responsible AI, specifically enhancing factuality, robustness, and transparency in AI systems.

If you have relevant positions, let me know! lasharavichander.github.io Please share/RT!

Abhilasha Ravichander - Home

lasharavichander.github.io

2 22 52

Reposted by Ben Newman

Valentina Pyatkin @valentinapy.bsky.social · Nov 7

Why and when do preference annotators disagree? And how do reward models + LLM-as-Judge evaluators handle disagreements?

Michael explored these questions in a new ✨preprint✨ from his @ai2.bsky.social internship with me!

1 8 29

Ben Newman @benn9.bsky.social · Nov 11

This is work with Yoonjoo Lee, @arnaik19.bsky.social @paopow.bsky.social, @juhokim.bsky.social, Dan Weld, @josephc.bsky.social, and @kylelo.bsky.social
at S2 @ai2.bsky.social, UW CSE and KAIST

For more, check out our
Dataset: github.com/bnewm0609/ar...
Paper: aclanthology.org/2024.emnlp-m...

ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models

Benjamin Newman, Yoonjoo Lee, Aakanksha Naik, Pao Siangliulue, Raymond Fok, Juho Kim, Daniel S Weld, Joseph Chee Chang, Kyle Lo. Proceedings of the 2024 Conference on Empirical Methods in Natural Lang...

aclanthology.org

5

Ben Newman @benn9.bsky.social · Nov 11

We also find that providing more table context (captions, in-text references) to models leads to higher recall when generating columns but does not help when generating values.

Two plots of recall versus threshold for determining a match: one for GPT-3.5 Turbo and another for Mixtral 8x22B. There are five lines in each plot. Each line travels from the top left to bottom right of the plot with y-intercepts that are generally in increasing order by the following types of context: generated caption, baseline, gold caption, in-context examples, caption + in-text references.

1 1

Ben Newman @benn9.bsky.social · Nov 11

We find that using decontextualization with SBERT leads to a better evaluator than Llama 3, which hallucinates alignments.

A plot of recall versus threshold for determining a match between column headers. Llama3 has the highest recall because it hallucinates matches, but Sentence Transformers does better.

1 2

Ben Newman @benn9.bsky.social · Nov 11

We propose a two-step procedure for generating tables given the input papers:
1️⃣ Generate the schemas (sets of columns)
2️⃣ Fill in the values.

A diagram showing two steps of table generation. There is text that says "Step 1: Schema Generation" with an arrow pointing to the column headers of a generated table. Under it, there is text that says "Step 2: Value Generation" with an arrow pointing to the body of the generated table.

1

Ben Newman @benn9.bsky.social · Nov 11

This table generation task takes as input multiple papers, and synthesizes them into a single output table. We collect a dataset of such tables and associated papers, and augment the tables with additional context such as their captions and in-text references.

An example literature review table with four rows and four columns. Each row is a paper (labeled Paper 1, Paper 2, etc.). Each column is a different aspect: ("Dataset", "Size", "Task", and "Annotations").

1

Ben Newman @benn9.bsky.social · Nov 11

✨EMNLP Paper! ✨
Have you ever constructed a table to organize your literature review process? Can we use LMs to generate these automatically?

We are excited to present ArxivDIGESTables 🍽️ a study of collecting, generating, and evaluating 🎓 scientific literature review tables 📃!

A screenshot of the first page of the paper discussed in the thread. Figure 1 contains a set of three cartoon papers with related text highlighted in three different colors. To its left, there's an arrow pointing to a cartoon table with a column corresponding to each color and a row corresponding to each paper.

2 2 29