Ben Newman
@benn9.bsky.social
740 followers 120 following 6 posts
NLP research - PhD student at UW
Posts Media Videos Starter Packs
Reposted by Ben Newman
taylor-sorensen.bsky.social
Did you know that LLMs suffer from serious mode collapse?

For example, if you ask models to tell you a joke, they almost always tell you the same joke? This is true across samples and even across model families!

Why does this happen? Can we improve it?
Reposted by Ben Newman
kylelo.bsky.social
Excited to share OLMo 2!

🐟 7B and 13B weights, trained up to 4-5T tokens, fully open data, code, etc
🐠 better architecture and recipe for training stability
🐡 staged training, with new data mix Dolmino🍕 added during annealing
🦈 state-of-the-art OLMo 2 Instruct models

#nlp #mlsky

links below👇
A scatter plot comparing language models by performance (y-axis, measured in average performance on 10 benchmarks) versus training computational cost (x-axis, in approximate FLOPs). The plot shows OLMo 2 models (marked with stars) achieving Pareto-optimal efficiency among open models, with OLMo-2-13B and OLMo-2-7B sitting at the performance frontier relative to other open models like DCLM, Llama 3.1, StableLM 2, and Qwen 2.5. The x-axis ranges from 4x10^22 to 2x10^24 FLOPs, while the y-axis ranges from 35 to 70 benchmark points.
Reposted by Ben Newman
mariaa.bsky.social
I'm recruiting 1-2 PhD students to work with me at the University of Colorado Boulder! Looking for creative students with interests in #NLP and #CulturalAnalytics.

Boulder is a lovely college town 30 minutes from Denver and 1 hour from Rocky Mountain National Park 😎

Apply by December 15th!
A photo of Boulder, Colorado, shot from above the university campus and looking toward the Flatirons.
Reposted by Ben Newman
lasha.bsky.social
✨I am on the faculty job market in the 2024-2025 cycle!✨

My research centers on advancing Responsible AI, specifically enhancing factuality, robustness, and transparency in AI systems.

If you have relevant positions, let me know! lasharavichander.github.io Please share/RT!
Abhilasha Ravichander - Home
lasharavichander.github.io
Reposted by Ben Newman
valentinapy.bsky.social
Why and when do preference annotators disagree? And how do reward models + LLM-as-Judge evaluators handle disagreements?

Michael explored these questions in a new ✨preprint✨ from his @ai2.bsky.social internship with me!
benn9.bsky.social
We also find that providing more table context (captions, in-text references) to models leads to higher recall when generating columns but does not help when generating values.
Two plots of recall versus threshold for determining a match: one for GPT-3.5 Turbo and another for Mixtral 8x22B. There are five lines in each plot. Each line travels from the top left to bottom right of the plot with y-intercepts that are generally in increasing order by the following types of context: generated caption, baseline, gold caption, in-context examples, caption + in-text references.
benn9.bsky.social
We find that using decontextualization with SBERT leads to a better evaluator than Llama 3, which hallucinates alignments.
A plot of recall versus threshold for determining a match between column headers. Llama3 has the highest recall because it hallucinates matches, but Sentence Transformers does better.
benn9.bsky.social
We propose a two-step procedure for generating tables given the input papers:
1️⃣ Generate the schemas (sets of columns)
2️⃣ Fill in the values.
A diagram showing two steps of table generation. There is text that says "Step 1: Schema Generation" with an arrow pointing to the column headers of a generated table. Under it, there is text that says "Step 2: Value Generation" with an arrow pointing to the body of the generated table.
benn9.bsky.social
This table generation task takes as input multiple papers, and synthesizes them into a single output table. We collect a dataset of such tables and associated papers, and augment the tables with additional context such as their captions and in-text references.
An example literature review table with four rows and four columns. Each row is a paper (labeled Paper 1, Paper 2, etc.). Each column is a different aspect: ("Dataset", "Size", "Task", and "Annotations").
benn9.bsky.social
✨EMNLP Paper! ✨
Have you ever constructed a table to organize your literature review process? Can we use LMs to generate these automatically?

We are excited to present ArxivDIGESTables 🍽️ a study of collecting, generating, and evaluating 🎓 scientific literature review tables 📃!
A screenshot of the first page of the paper discussed in the thread. Figure 1 contains a set of three cartoon papers with related text highlighted in three different colors. To its left, there's an arrow pointing to a cartoon table with a column corresponding to each color and a row corresponding to each paper.