Lightnews — Scholar-powered news

Nick Tomlin

@nickatomlin.bsky.social

1.7K followers 110 following 9 posts

Incoming assistant professor at TTIC, and current PhD student at Berkeley. Natural language processing. He/him. 🌐 eecs.berkeley.edu/~nicholas_tomlin/

Posts Media Videos Starter Packs

Pinned

Nick Tomlin @nickatomlin.bsky.social · Apr 15

Writing my first post here to announce that I've accepted an assistant professor job at TTIC! I'll be starting in Fall 2026, and recruiting students this upcoming cycle.

Until then, I'll be wrapping up the PhD at Berkeley, and this summer I'll join NYU as a CDS Faculty Fellow 🏙️

3 2 41

Reposted by Nick Tomlin

Ari @ari-holtzman.bsky.social · 1d

FYI that UChicago CS & Stats is hiring at all levels via the Data Science Institue:

Postdoc: uchicago.infoready4.com#freeformComp...
Assistant Professor: apply.interfolio.com/174766
Associate Professor: apply.interfolio.com/174768

3 8

Nick Tomlin @nickatomlin.bsky.social · 10d

What does it take to build a human-like user simulator? //

Jessy Lin and I wrote another blogpost on user simulators as a reward function for training interactive models, this time focused on methods + open questions:
jessylin.com/2025/09/25/u...

What does it take to build a human-like user simulator?

jessylin.com

Reposted by Nick Tomlin

Eugene Vinitsky 🍒 @eugenevinitsky.bsky.social · Jul 27

Was talking to a student who wasn't sure about why one would get a PhD. So I wrote up a list of reasons!
www.eugenevinitsky.com/posts/reason...

Eugene Vinitsky

www.eugenevinitsky.com

7 11 51

Reposted by Nick Tomlin

Eugene Vinitsky 🍒 @eugenevinitsky.bsky.social · Jul 10

An excellent blog post about a still huge missing gap, models of humans you can actually use to study human-AI interaction: jessylin.com/2025/07/10/u...

User simulators bridge RL with real-world interaction

jessylin.com

1 2 12

Reposted by Nick Tomlin

TTIC @tticconnect.bsky.social · Jun 27

We’re proud to announce three new tenure-track assistant professors joining TTIC in Fall 2026: Yossi Gandelsman, Will Merrill, and Nick Tomlin (@nickatomlin.bsky.social). Meet them here: buff.ly/JH1DFtT

2 7

Nick Tomlin @nickatomlin.bsky.social · May 29

🤠🤓🙂

Robert Hawkins @rdhawkins.bsky.social · May 28

Happy to announce the first workshop on Pragmatic Reasoning in Language Models — PragLM @ COLM 2025! 🎉
How do LLMs engage in pragmatic reasoning, and what core pragmatic capacities remain beyond their reach?
🌐 sites.google.com/berkeley.edu/praglm/
📅 Submit by June 23rd

PragLM @ COLM '25

IMPORTANT DATES

sites.google.com

1 4

Nick Tomlin @nickatomlin.bsky.social · May 14

Haha main reason for using Gym was that we wanted a way to automatically evaluate models against trained RL agents. Doing the full arena-style evaluation on reasoning models gets really expensive

It also helps that current LLMs are really good at generating functional Gym code

1 1

Nick Tomlin @nickatomlin.bsky.social · May 14

I think in the short term that’s reasonable, e.g., current models can play chess but they definitely can’t understand chess variants

In the long term, I suspect there’s more risk of over-optimizing to those specific games, so the hope is that our approach is a bit more future-proof

Nick Tomlin @nickatomlin.bsky.social · May 13

For anyone interested in evaluating or expanding on this benchmark, we have a nice code release here: github.com/vivek3141/gg...

GitHub - vivek3141/gg-bench: Measuring General Intelligence With Generated Games (Preprint)

Measuring General Intelligence With Generated Games (Preprint) - vivek3141/gg-bench

github.com

Nick Tomlin @nickatomlin.bsky.social · May 13

This is a difficult benchmark: the best non-reasoning LLMs score around 9%, while the best reasoning models score around 36%. In the future, as models get stronger, we anticipate that they'll also be able to generate harder games

Results table. The best model (o1) wins about 36% of games against the RL baselines.

1 1

Nick Tomlin @nickatomlin.bsky.social · May 13

We use o1 to generate natural language rulebooks for 1000 two-player games and then implement these games as Gym environments. For each game, we train baseline agents in self-play with RL and then evaluate whether LLMs can beat the RL baselines

Main paper figure showing a three-step pipeline of game description generation, implementation generation, and self-play training of RL agents

2 4

Nick Tomlin @nickatomlin.bsky.social · May 13

I'm particularly fond of this new benchmark paper we wrote, which aims to scalably evaluate whether language models can generalize to arbitrary new tasks. The core idea is to use LLMs to generate new games, and then evaluate whether LLMs can play those games

📄: arxiv.org/abs/2505.07215

Title and abstract of the paper, "Measuring General Intelligence with Generated Games"

3 9 33

Reposted by Nick Tomlin

Kyle Mahowald (COLM 2025) @kmahowald.bsky.social · Apr 21

I might be able to hire a postdoc for this fall in computational linguistics at UT Austin. Topics in the general LLM + cognitive space (particularly reasoning, chain of thought, LLMs + code) and LLM + linguistic space. If this could be of interest, feel free to get in touch!

31 60

Nick Tomlin @nickatomlin.bsky.social · Apr 15

3 2 41