Lightnews — Scholar-powered news

Amanda Bertsch

@abertsch.bsky.social

2.2K followers 540 following 20 posts

PhD student @ CMU LTI. working on text generation + long context

https://www.cs.cmu.edu/~abertsch/

Posts Replies Media Videos

Amanda Bertsch

@abertsch.bsky.social

Models show varying error patterns. Claude and some GPT-family models underperform on tasks that require outputting dates; Gemini and Deepseek-R1 frequently over-reason and fail to return an answer at all on Oolong-synth, although Gemini is the best model on Oolong-real.

Score by answer type and task type for Oolong-synth. The month+year and date types are the hardest for many models, corresponding with the difficulty of the timeline tasks.

November 7, 2025 at 5:07 PM

Amanda Bertsch

@abertsch.bsky.social

Why is this so hard? Models must identify relevant sections of input, label or categorize these sections, and then accumulate information to make distributional-level decisions. Adding labels in-context or specifying more reasoning effort has limited benefit.

Graph showing that the performance with labels in-context for Oolong synth is only slightly better.

Graph showing that increasing reasoning effort only helps marginally, and only in contexts shorter than 64K.

November 7, 2025 at 5:07 PM

Amanda Bertsch

@abertsch.bsky.social

Oolong has a synthetic setting that poses distributional questions over sets of classification examples and their metadata and a realistic setting using conversational data from game transcripts. Both splits require counting, temporal reasoning, and multi-step entity resolution.

A figure demonstrating the two splits of Oolong: the left side has the question “Were there more news articles about the economy in September or August?”, from Oolong-synth; the right side has the question “How many times in this set of episodes does the character Jester cast Healing Word?”, from Oolong-real. Both questions require the model to label sections of input, identify the important segments, and aggregate across these to answer the question.

November 7, 2025 at 5:07 PM

Amanda Bertsch

@abertsch.bsky.social

Can LLMs accurately aggregate information over long, information-dense texts? Not yet…

We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!

Performance of a sweep of models on Oolong-synth and Oolong-real. Performance decreases with increasing context length, sometimes steeply.

November 7, 2025 at 5:07 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news