On (2), that's an interesting question. My instinct: reasoning is just search with non-random guesses that are informed by the previous steps in the search.
On (2), that's an interesting question. My instinct: reasoning is just search with non-random guesses that are informed by the previous steps in the search.
What level of bizarre tip up on some other tasks is needed to invalidate performance on some others?
Notably, some humans can sometimes exhibit such oddities (e.g. really good at advanced math, incapable on more basic tasks).
What level of bizarre tip up on some other tasks is needed to invalidate performance on some others?
Notably, some humans can sometimes exhibit such oddities (e.g. really good at advanced math, incapable on more basic tasks).
At a high level, that's also what humans do, only we use very different relationships to do the reasoning.
At a high level, that's also what humans do, only we use very different relationships to do the reasoning.
(1) These are interesting questions, but note that the OP wasn't an ask about good data on how human-like the models are in their reasoning.
(2) I don't agree that Gemini's solution doesn't count as abstract reasoning. It's not how humans do it, but it's still a form of reasoning.
(1) These are interesting questions, but note that the OP wasn't an ask about good data on how human-like the models are in their reasoning.
(2) I don't agree that Gemini's solution doesn't count as abstract reasoning. It's not how humans do it, but it's still a form of reasoning.
I guess the question is whether you could train these models on broader tasks and still see some transer to ARC.
I guess the question is whether you could train these models on broader tasks and still see some transer to ARC.
arxiv.org/abs/2510.04871
arxiv.org/abs/2510.04871
And for some of these benchmarks, like ARC, the leaderboard is public - you can go compare all the models in a totally fair way!
And for some of these benchmarks, like ARC, the leaderboard is public - you can go compare all the models in a totally fair way!
There are lots of results for various LLMs on various benchmarks, e.g. for Gemini 3:
storage.googleapis.com/deepmind-med...
And I wouldn't say it's "cherry picked". It's from several different domains.
What is missing for you?
There are lots of results for various LLMs on various benchmarks, e.g. for Gemini 3:
storage.googleapis.com/deepmind-med...
And I wouldn't say it's "cherry picked". It's from several different domains.
What is missing for you?