Marc Lanctot
@sharky6000.bsky.social
8.3K followers 410 following 1.5K posts
Research Scientist at Google DeepMind, interested in multiagent reinforcement learning, game theory, games, and search/planning. Lover of Linux 🐧, coffee ☕, and retro gaming. Big fan of open-source. #gohabsgo 🇨🇦 For more info: https://linktr.ee/sharky6000
Posts Media Videos Starter Packs
sharky6000.bsky.social
Yup, I read it in February. Great book, lots of awesome color pics, learned a few things! 👍

Was cool to get a first hand account of the social events around some of the fan art and games of those days.

I only caught the tail end of computer shows in the early 90's and they were pretty cool.
sharky6000.bsky.social
Argghhh the asterisk on Daniel's name got dropped.

Co-first authors are Wolf, Daniel, and Miguel. Apologies!
Reposted by Marc Lanctot
2ne1.com
What happens when a pioneering artist hands creative control to an AI?

Announcing "Evolution and Foundation," a new London exhibition where Google's Gemini acts as the artist, directing the evolution of complex 3D art.

🗓️ Oct 17-26, 2025
📍 Oxo Gallery, London (@oxotowerwharf)
A highly detailed black and white line drawing of the head of a fantastical creature, identified as a "spiky dinosaur" from the "Evolution and Foundation" exhibition. The creature is shown in profile and is composed of intricate patterns: its face has cell-like scales, a crest is made of bundled filaments, and a flowing mane is formed by long, spiky strings of beads. This is an example of generative evolutionary art created through an algorithmic process.
sharky6000.bsky.social
With special thanks to @sirbayes.bsky.social for this excellent summary thread of our work! 👍

23/23
sharky6000.bsky.social
Joint work with:

Wolfgang Lehrach*, Daniel Hennes, Miguel Lázaro-Gredilla*, Xinghua Lou, Carter Wendelken, @lizun.bsky.social, Antoine Dedieu, Jordi Grau-Moya, Atil Iscen, John Schultz, Marcus Chiam, @drimgemp.bsky.social, Piotr Zielinski, @satindersingh.bsky.social, @sirbayes.bsky.social 22/N
sharky6000.bsky.social
We even match the performance of MCTS using the ground truth world model on 4 out of 5 of the games (the exception being backgammon). Imperfect information games are harder to learn, and performance is not as good, but we still beat the common LLM-as-policy approach. 20/N
sharky6000.bsky.social
We see that we beat Gemini 2.5 on 4 of the 5 games, and tie it in the case of tic-tac-toe. (Interestingly, Gemini 2.5 often does not even play legal moves, and hence loses by forfeit, even for existing (non-novel) games such as Backgammon which are part of its training set.) 19/N
sharky6000.bsky.social
Below we evaluate our method (in the perfect information setting) when playing against 3 different kinds of opponents: Gemini 2.5 Pro as a policy (using the same data as our method), MCTS with the ground truth world model (an upper bound), and a random policy (a lower bound). 18/N
sharky6000.bsky.social
6 of the games are well-known (e.g., Tic-tac-toe, Backgammon, Connect Four; Bargaining, Leduc Poker, Gin Rummy), and 4 of the games are novel (Generalized Tic-Tac-Toe, Generalized Chess; Quadranto, Hand of War) and hence not in the training set of the LLM. 17/N
sharky6000.bsky.social
We apply our method to 10 different two-player games, 5 of which have perfect information (full observability of the world state), and 5 of which have imperfect information (partially observed state). 16/N
sharky6000.bsky.social
In the case of imperfect information games, we use Information Set MCTS (IS-MCTS), which samples states from posterior using I, and then rolls out hypothetical futures for each action sequence, using V at the leaf nodes, before picking the best action at the root node. 15/N
sharky6000.bsky.social
In addition to learning the world model M=(T,O) and inference function I, we can also learn a value function V, which can be used to estimate reward-to-go at the leaf nodes explored by the MCTS planner. 14/N
sharky6000.bsky.social
To test that the inference function is correct, we check that replaying the imputed actions through the composition of T and O reconstructs the observed trajectories, similar to an auto-encoder. 13/N
sharky6000.bsky.social
Since the transition function T is deterministic, we can infer the current state s(t) by recursively using s(t) = T(s(t-1), a(t,tau(t)), where tau(t) is the player whose turn occurs at step t. Thus a distribution over action trajectories (given by I) induces a distribution over states. 12/N
sharky6000.bsky.social
For imperfect-information games, the LLM must also synthesize an inference function I that maps the observation history for each player i, o(1:t, i), to a plausible sequence of actions taken by all the players, a(1:t,0:N), including the chance player (number 0). 11/N
sharky6000.bsky.social
Below we show an example of a test for T for tic-tac-toe, where the action corresponds to x-player choosing location (1,0). 10/N
sharky6000.bsky.social
This is represented as a set of unit-tests of the form T(s,a)=s', where (s,a,s') is a state-action-next-state tuple from the trajectory, and O(s)=o, where (s,o) is a state-observation tuple from the trajectory. 9/N
sharky6000.bsky.social
We use a code synthesis method based on LLM-powered code refinement combined with Thompson sampling over a tree of partial programs. For perf. info games, the LLM must synthesize a deterministic transition function T, and a deterministic obs function O, that matches the observed trajectories. 8/N
sharky6000.bsky.social
In both cases, we supplement the trajectory data with some background information about the game, represented in natural language text. (This can be thought of as prior information, which is needed to compensate for the fact that we only observe 5 training trajectories) 7/N
sharky6000.bsky.social
We consider two scenarios for learning the world model: open-deck, where the offline training trajectories consist of (state, observation, action) triples generated from an initial random exploratory policy, and closed-deck, where the training trajectories do not include the ground truth state. 6/N
sharky6000.bsky.social
Each player sees its own private observation stream, but there can also be common knowledge (e.g., the initial world state, before chance deals the cards or rolls the dice). Below we show the obss available to player 1, just before its second turn, using circles with thick black boundaries. 5/N
sharky6000.bsky.social
The world model can be interpreted as a Partially Observed Stochastic Game, which is represented below as a causal graphical model. All transitions are deterministic, but stochasticity can be introduced via hidden random actions chosen by the chance player. 4/N
sharky6000.bsky.social
We apply our method to various two-player games (with both perfect and imp. information), and show that it works much better than prompting the LLM to directly generate actions, especially for novel games. In particular, we beat Gemini 2.5 Pro in 7/10 games, and tie it in 2/10 games. 3/N
sharky6000.bsky.social
The key idea is to use LLM-powered code synthesis to learn a code world model from (obs, action) trajectories, plus some background information, and then to pass this induced WM, plus the observation history, to an existing solver (i.e. information-set MCTS), to choose the next action. 2/N