@anianruoss.bsky.social
110 followers 30 following 6 posts
Posts Media Videos Starter Packs
anianruoss.bsky.social
As a sanity check, we also evaluate how well frontier models can replay the actions from a single demonstration episode (i.e., teacher-forcing, usually we perform dynamic evaluation).

Most models perform well, with the exception of o1-mini, which fails across most tasks.

5/N
anianruoss.bsky.social
We pressure-test frontier models' in-context imitation learning, using up to 1M context size and up to 10k output ("reasoning") tokens.

For o1-mini/o1-preview, performance crucially depends on having many (at least 8192) output tokens, even in simple decision-making tasks.

4/N
anianruoss.bsky.social
We evaluate most tasks with different multimodal observation formats (e.g., ASCII, RGB images).

On some tasks, certain models show strong in-context imitation learning (e.g., Gemini 1.5 below). On others, the performance is independent of the expert demonstration episodes.

3/N
anianruoss.bsky.social
We evaluate
- Phoenix (Atari)
- chess vs weakest version of Stockfish
- crosswords
- cheetah run (DM Control)
- grid world navigation
- tic-tac-toe vs random actions

We compare against a random baseline and an expert policy and use up to 512 expert demonstration episodes:

2/N
anianruoss.bsky.social
Ever wonder how well frontier models (Claude 3.5 Sonnet, Gemini 1.5 Flash & Pro, GPT-4o, o1-mini & o1-preview) play Atari, chess, or tic-tac-toe?

We present LMAct, an in-context imitation learning benchmark with long multimodal demonstrations (arxiv.org/abs/2412.01441).

🧵 1/N