Lightnews — Scholar-powered news

David Schlangen

@davidschlangen.bsky.social

Bonus post advertising this other thread through the medium of "memes" which I've been told is what you have to do on social media.

Scene from the film "wargames", with an added supertitle saying "How can we evaluate LLMs in interactions? Where do we get the interaction purposes from??", to which Matthew Broderick's character answers: "It's games"

Another still from the film, supertitle "But getting people to play games with the computer takes time!", to which our hero answers "Is there any way to make it play itself?"

The famous scene where the computer say "A strange game. The only winning move is" ... well, it now says, ".. to check out the clembench."

July 20, 2025 at 11:44 AM

David Schlangen

@davidschlangen.bsky.social

It's great to see the idea of using games / interactions to evaluate LLMs gain traction, with textarena.ai and now ARC-AGI-3 being latest entrants.
This is something we've been exploring since early 2023 with clembench ( clembench.github.io ), which we've been continuously maintaining & extending. »

July 20, 2025 at 11:21 AM

Reposted by David Schlangen

Philipp Mondorf

@pmondorf.bsky.social

📄 [ACL 2025 main] LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks (doi.org/10.48550/arX...)

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case...

doi.org

July 18, 2025 at 10:19 AM

David Schlangen

@davidschlangen.bsky.social

🚨 New pre-print! (Well, new & much improved version in any case.) 🚨
If you're interested in LLM post-training techniques and in how to make LLMs better "language users", read this thread, introducing the "LM Playpen".

May 29, 2025 at 8:41 PM

David Schlangen

@davidschlangen.bsky.social

The University of Potsdam invites applications for 5 postdoc positions, incl. Cognitive Sciences, incl. NLP (esp. cognitive).

These are fairly independent research positions that will allow the candidate to build their own profile. Dln June 2nd.

Details: tinyurl.com/pd-potsdam-2...

#NLProc #AI 🤖🧠

tinyurl.com

May 21, 2025 at 3:53 PM

David Schlangen

@davidschlangen.bsky.social

There's indeed suddenly a bit of flexibility in a system that's not exactly known for that.. If there's anyone (post-doc, tenure-track, or more senior) in the #NLP space currently in the US who'd like to explore possiblities in Potsdam, contact me.

🤖🧠

www.nytimes.com/2025/05/14/b...

The World Is Wooing U.S. Researchers Shunned by Trump

www.nytimes.com

May 14, 2025 at 12:02 PM

David Schlangen

@davidschlangen.bsky.social

"We ablated both algorithm and hyperparameter choices [...]"

When did "to ablate" take on the meaning "to systematically vary"? I've noticed this only recently, but it's seems to be super common now.

May 7, 2025 at 9:04 PM

Reposted by David Schlangen

David Schlangen

@davidschlangen.bsky.social

Update 2: New pre-print! Outcome of an ELLIS workshop last year, & more than a year of discussions and work, across labs and countries: Meet the Playpen, an environment for exploring learning in dialogic interaction.

arxiv.org/abs/2504.08590

1/2

Titlepage of the paper linked in the post.

April 15, 2025 at 6:51 PM

David Schlangen

@davidschlangen.bsky.social

Update 2: New pre-print! Outcome of an ELLIS workshop last year, & more than a year of discussions and work, across labs and countries: Meet the Playpen, an environment for exploring learning in dialogic interaction.

arxiv.org/abs/2504.08590

1/2

April 15, 2025 at 6:51 PM

David Schlangen

@davidschlangen.bsky.social

Update 1: New models added to our dialogue game-based agentic LLM leaderboard. TL;DR: GPT-4.1 as good as 4o, but much cheaper. Llama4 indeed not very good (decisively worse than 3.2 70B!). OLMo decent, but there's still a secret sauce that only closed labs have.

clembench.github.io

Screenshot of leaderboard as linked in post.

April 15, 2025 at 6:35 PM

Reposted by David Schlangen

arxiv cs.CL

@arxiv-cs-cl.bsky.social

Nicola Horst, Davide Mazzaccara, Antonia Schmidt, Michael Sullivan, Filippo Moment\`e, Luca Franceschetti, Philipp Sadler, Sherzod Hakimov, Alberto Testoni, ...
Playpen: An Environment for Exploring Learning Through Conversational Interaction
https://arxiv.org/abs/2504.08590

April 14, 2025 at 5:36 AM

David Schlangen

@davidschlangen.bsky.social

Wenn die Grünen verhandeln könnten, würden am Tag vor einer Ankündigung über eine Einigung zur Schuldenbremse Söder und Dobrindt ankündigen, dass sie sich für immer aus der Bundespolitik heraushalten werden (und dass die CSU nie wieder einen Verkehrsminister stellen wird).

March 6, 2025 at 4:05 PM

David Schlangen

@davidschlangen.bsky.social

Press release by my Uni about our benchmark for LLMs as agents, which is now out in v2.0.
Check it out here: clembench.github.io

March 6, 2025 at 11:01 AM

David Schlangen

@davidschlangen.bsky.social

Happy to see increasing interest in exploring social interaction as a learning environment!

Along similar lines: We’re preparing a (complementary) challenge that will focus on exploring interaction for post-training, coming with a rich interaction environment to get things started. Stay tuned!

Leshem (Legend) Choshen @EMNLP @lchoshen.bsky.social · Feb 18

We are expecting🫄
A 3rd BabyLM👶, as a workshop
@emnlpmeeting.bsky.social

Kept: all
New:
Interaction (education, agentic) track
Workshop papers
More in 🧵
Even more:
arxiv.org/abs/2502.10645
babylm.github.io
#AI #LLMs #MachineLearning #language #cognition #NLP #data
🤖📈

February 19, 2025 at 9:50 PM

David Schlangen

@davidschlangen.bsky.social

I'm not on X, so I'll use the opportunity of @karpathy.bsky.social 's post over there to plug our "clembench" project here. We've been doing exactly this--evaluating LLMs w/ conversational games--since early 2023, with several papers out by now (e.g. EMNLP 23).
clembench.github.io

February 3, 2025 at 10:22 AM

David Schlangen

@davidschlangen.bsky.social

So, are we banning social network apps now whose owners potentially try to influence the political discourse in other countries? Asking for a supranational political and economic union.

January 17, 2025 at 4:27 PM

David Schlangen

@davidschlangen.bsky.social

I just randomly found this book on my bookshelf. It must have been transported there from an alternate timeline. “20 years of research on agents”? Preposterous! We all know that the very idea of software agents has only been invented last year by the LLM folks!

Photo of book: “The Handbook on Socially Interactive Agents: 20 Years of Research on Embodied Conversational Agents, IVAs, and Social Robotics”. Ed, Lugrin, Pelachaud, Traum. ACM, 2021

January 16, 2025 at 11:50 AM

David Schlangen

@davidschlangen.bsky.social

So my car needed to be towed this morning. It took the guy quite some time to get everything ready. Then the truck broke down. In the end, the tow truck was towed, and I got a new appointment.

I think is probably an allegory for something, maybe the ending year 2024, or the coming year 2025.

December 31, 2024 at 10:23 AM

David Schlangen

@davidschlangen.bsky.social

me: I would really like to end this year with inbox zero.

also me: I would really like to end this year with cookie jar zero / “pages remaining in the books I’ve started” zero.

me again, expert problem solver: *creates IMAP folder “unprocessed emails from 2024”, selects all, moves 625 items*

December 30, 2024 at 2:06 PM

David Schlangen

@davidschlangen.bsky.social

These new models (using “inference time scaling”) bring out what many of us have been saying for a long time, namely that reasoning fundamentally is a discursive process. (What they are missing is that it is an intersubjective, interactive, repairable, and ultimately normative one.)

December 20, 2024 at 8:36 PM

David Schlangen

@davidschlangen.bsky.social

Looking forward to the first lecture of next year, where I can again use this meme I made a couple of years ago and multiply confuse the students in my "intro to NLP" class. (What is an "LP cover"? Who is that person?)

The cover of Lou Reed's "transformer" LP, with the network diagram of the transformers architecture made to look like the guitar that Lou Reed is holding.

December 20, 2024 at 12:38 PM

David Schlangen

@davidschlangen.bsky.social

Can we discuss how stupid this photo button thing on the new iPhones is? Who thought that minimising the space where you can hold this damn thing without something unwanted happening is a good idea?

December 4, 2024 at 12:32 PM

David Schlangen

@davidschlangen.bsky.social

For a recent talk to a lay audience, I’ve used a metaphor which I think resonated: rely on an LLMs not more than you would rely on a dream. Use it to inspire you to work something out, but don’t be the one who has to say “this was once revealed to me in a dream”.

December 1, 2024 at 10:00 AM

David Schlangen

@davidschlangen.bsky.social

Some good news: The world now has one more doctor! Brielen Madureira passed her viva with flying colours (or, as we say in German, summa cum laude). She gave us quite some material to discuss in the viva, ending with the attached theses. (Remote: Luciana Benotti as fantastic examiner.)

Picture of a happy examination committee and the candidate (wearing a silly hat).

Discussion Topics
• Instead of coming from the theory to define a suitable model, we often need to force our phenomena to fit popular machine learning frameworks (and now NLP is framing everything as next token prediction).
• Evaluation in being delegated to LLMs; the step that should bring us understanding and transparency
is becoming as undecipherable as the very problem we need to assess.
• Progress in disseminating Clark’s notion of grounding in NLP stumbles on questions of methodology
in data collection, modelling and evaluation.
• Proper methods are needed to assess what models can do, bearing in mind both the pertinent cognitive
underpinnings and the broad NLP methodology, e.g. by weighing up data and training practices, making representations more interpretable and profiling models’ behaviour, and promoting richer forms of evaluation.
• We should strive to make the development and use of conversational technologies a more “orientable” space, so that they do not erode the social value of dialogue.

November 28, 2024 at 5:45 PM

David Schlangen

@davidschlangen.bsky.social

Ok, this is starting to feel weird. The internet at my Uni has been down for almost 24 hours now, meaning: no new emails for almost 24 hours now. That’s like the dog finally catching the bus. What now??

November 28, 2024 at 11:07 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news