Lightnews — Scholar-powered news

anianruoss.bsky.social @anianruoss.bsky.social · Dec 3

Fantastic work by @pardofab.bsky.social, @harrischan.bsky.social, @bonniesjli.bsky.social,
@vladmnih.bsky.social, and Tim Genewein!

All details and many more results in arxiv.org/abs/2412.01441

N/N

LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Today's largest foundation models have increasingly general capabilities, yet when used as agents, they often struggle with simple reasoning and decision-making tasks, even though they possess good fa...

arxiv.org

1

anianruoss.bsky.social @anianruoss.bsky.social · Dec 3

As a sanity check, we also evaluate how well frontier models can replay the actions from a single demonstration episode (i.e., teacher-forcing, usually we perform dynamic evaluation).

Most models perform well, with the exception of o1-mini, which fails across most tasks.

5/N

1

anianruoss.bsky.social @anianruoss.bsky.social · Dec 3

We pressure-test frontier models' in-context imitation learning, using up to 1M context size and up to 10k output ("reasoning") tokens.

For o1-mini/o1-preview, performance crucially depends on having many (at least 8192) output tokens, even in simple decision-making tasks.

4/N

1

anianruoss.bsky.social @anianruoss.bsky.social · Dec 3

We evaluate most tasks with different multimodal observation formats (e.g., ASCII, RGB images).

On some tasks, certain models show strong in-context imitation learning (e.g., Gemini 1.5 below). On others, the performance is independent of the expert demonstration episodes.

3/N

1

anianruoss.bsky.social @anianruoss.bsky.social · Dec 3

We evaluate
- Phoenix (Atari)
- chess vs weakest version of Stockfish
- crosswords
- cheetah run (DM Control)
- grid world navigation
- tic-tac-toe vs random actions

We compare against a random baseline and an expert policy and use up to 512 expert demonstration episodes:

2/N

1

anianruoss.bsky.social @anianruoss.bsky.social · Dec 3

Ever wonder how well frontier models (Claude 3.5 Sonnet, Gemini 1.5 Flash & Pro, GPT-4o, o1-mini & o1-preview) play Atari, chess, or tic-tac-toe?

We present LMAct, an in-context imitation learning benchmark with long multimodal demonstrations (arxiv.org/abs/2412.01441).

🧵 1/N

1 5 10