David Schlangen
@davidschlangen.bsky.social
360 followers 1.4K following 70 posts
Prof of Computational Linguistics / NLP @ Uni Potsdam, Germany. Working on embodied / multimodal / conversational AI. In a way. Also affiliated w/ DFKI Berlin (German Research Center for AI).
Posts Media Videos Starter Packs
Pinned
davidschlangen.bsky.social
Do we do introductions here? Anyway, here is mine: I’m a Professor of Computational Linguistics at the University of Potsdam.

I am interested in understanding “understanding”, the process or activity by which an agent makes sense of its environment and, in interaction, of and with other agents.
davidschlangen.bsky.social
Bonus post advertising this other thread through the medium of "memes" which I've been told is what you have to do on social media.
Scene from the film "wargames", with an added supertitle saying "How can we evaluate LLMs in interactions? Where do we get the interaction purposes from??", to which Matthew Broderick's character answers: "It's games" Another still from the film, supertitle "But getting people to play games with the computer takes time!", to which our hero answers "Is there any way to make it play itself?" The famous scene where the computer say "A strange game. The only winning move is" ... well, it now says, ".. to check out the clembench."
davidschlangen.bsky.social
(That animation in the first post? That's claude trying, and failing, to fully explore a maze in the MapWorld game.)
davidschlangen.bsky.social
Thanks to a recent short-term grant, we've been able to focus on code quality and ease of use for benchmarking and extensibility. (Exploring new games is a fun programming lab activity, which we've run several times by now!) Here's a writeup of the current state: arxiv.org/abs/2507.08491
»
A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine l...
arxiv.org
davidschlangen.bsky.social
It's great to see the idea of using games / interactions to evaluate LLMs gain traction, with textarena.ai and now ARC-AGI-3 being latest entrants.
This is something we've been exploring since early 2023 with clembench ( clembench.github.io ), which we've been continuously maintaining & extending. »
davidschlangen.bsky.social
Ha, yes, I'm quite pleased as well with how that turned out. It's nothing fancy, just a nice font, colouring (obviously), fbox, and rotate.
davidschlangen.bsky.social
This was the outcome of a collaboration that started last year at an ELLIS workshop, and that has brought together many labs (and many master's and PhD students, and PIs).

Much more remains to be explored in "learning in interaction" -- maybe by you?

🤖🧠 #NLP #AI #LLM
The list of authors from the paper.
davidschlangen.bsky.social
We release the framework and the baseline training setups to foster research in the promising new direction of learning in (synthetic) interaction which we believe will provide more effective ways of post-training agentic conversational LLMs. github.com/lm-playpen/p...
GitHub - lm-playpen/playpen: All you need to get started with the LM Playpen Environment for Learning in Interaction.
All you need to get started with the LM Playpen Environment for Learning in Interaction. - lm-playpen/playpen
github.com
davidschlangen.bsky.social
We find that imitation learning through SFT improves performance on unseen game instances, but does not generalise to new games and negatively impacts other skills -- while interactive learning with GRPO shows balanced improvements without loss of skills.
Table 3 from the paper linked in a post below.
davidschlangen.bsky.social
Playpen is a training environment for post-training LLMs through learning in interaction, by self-play of "dialogue games": goal-oriented language-based activities that generate verifiable rewards.
Diagram showing an interaction triangle "interlocutor A -- world -- interlocutor B", except that this is mediated by GM (the "Game Master"), and that A is a learner wrapped around an LLM, and B also is a wrapper around a (non-learning) LLM.
davidschlangen.bsky.social
🚨 New pre-print! (Well, new & much improved version in any case.) 🚨
If you're interested in LLM post-training techniques and in how to make LLMs better "language users", read this thread, introducing the "LM Playpen".
Title of the paper, with a colourful "playpen" logo
davidschlangen.bsky.social
The University of Potsdam invites applications for 5 postdoc positions, incl. Cognitive Sciences, incl. NLP (esp. cognitive).

These are fairly independent research positions that will allow the candidate to build their own profile. Dln June 2nd.

Details: tinyurl.com/pd-potsdam-2...

#NLProc #AI 🤖🧠
tinyurl.com
davidschlangen.bsky.social
There's indeed suddenly a bit of flexibility in a system that's not exactly known for that.. If there's anyone (post-doc, tenure-track, or more senior) in the #NLP space currently in the US who'd like to explore possiblities in Potsdam, contact me.

🤖🧠

www.nytimes.com/2025/05/14/b...
The World Is Wooing U.S. Researchers Shunned by Trump
www.nytimes.com
davidschlangen.bsky.social
"We ablated both algorithm and hyperparameter choices [...]"

When did "to ablate" take on the meaning "to systematically vary"? I've noticed this only recently, but it's seems to be super common now.
Reposted by David Schlangen
davidschlangen.bsky.social
Update 2: New pre-print! Outcome of an ELLIS workshop last year, & more than a year of discussions and work, across labs and countries: Meet the Playpen, an environment for exploring learning in dialogic interaction.

arxiv.org/abs/2504.08590

1/2
Titlepage of the paper linked in the post.
davidschlangen.bsky.social
[Sneak preview: If you're wondering where this is going, have a secret look at lm-playschool.github.io -- and stay tuned for more info!]

3/2
A Playschool for LLMs
lm-playschool.github.io
davidschlangen.bsky.social
Nice baseline results as well: learning via SFT from transcripts does a bit, but only "real"(-ish) learning in interaction (GRPO) generalises. (Basically, you want to see the whole row being green in this table.)

2/2
Table 1 from that paper.
davidschlangen.bsky.social
Update 2: New pre-print! Outcome of an ELLIS workshop last year, & more than a year of discussions and work, across labs and countries: Meet the Playpen, an environment for exploring learning in dialogic interaction.

arxiv.org/abs/2504.08590

1/2
Titlepage of the paper linked in the post.
davidschlangen.bsky.social
This is only a subset of the models on the leaderboard, visit the site to see all 32 models, and also the results for the multimodal version of the benchmark.
davidschlangen.bsky.social
Update 1: New models added to our dialogue game-based agentic LLM leaderboard. TL;DR: GPT-4.1 as good as 4o, but much cheaper. Llama4 indeed not very good (decisively worse than 3.2 70B!). OLMo decent, but there's still a secret sauce that only closed labs have.

clembench.github.io
Screenshot of leaderboard as linked in post.
Reposted by David Schlangen
arxiv-cs-cl.bsky.social
Nicola Horst, Davide Mazzaccara, Antonia Schmidt, Michael Sullivan, Filippo Moment\`e, Luca Franceschetti, Philipp Sadler, Sherzod Hakimov, Alberto Testoni, ...
Playpen: An Environment for Exploring Learning Through Conversational Interaction
https://arxiv.org/abs/2504.08590