Blake Richards
banner
tyrellturing.bsky.social
Blake Richards
@tyrellturing.bsky.social
Researcher at Google and CIFAR Fellow, working on the intersection of machine learning and neuroscience in Montréal (academic affiliations: @mcgill.ca and @mila-quebec.bsky.social).
Yeah, on (1) I do think, per what @melaniemitchell.bsky.social says, it shows that ARC is not quite testing what it claims to be testing.

On (2), that's an interesting question. My instinct: reasoning is just search with non-random guesses that are informed by the previous steps in the search.
January 8, 2026 at 4:43 PM
Good point. Indeed, it shows that ARC is not achieving what it was originally claimed to achieve as a benchmark.
January 8, 2026 at 4:41 PM
I guess there's an interesting question here, though:

What level of bizarre tip up on some other tasks is needed to invalidate performance on some others?

Notably, some humans can sometimes exhibit such oddities (e.g. really good at advanced math, incapable on more basic tasks).
January 8, 2026 at 4:40 PM
I would call it "reasoning" because it notices a relationship that holds in the examples it was provided, then uses that relationship to get the correct answer in a roughly logical manner.

At a high level, that's also what humans do, only we use very different relationships to do the reasoning.
January 7, 2026 at 10:25 PM
Two responses:

(1) These are interesting questions, but note that the OP wasn't an ask about good data on how human-like the models are in their reasoning.

(2) I don't agree that Gemini's solution doesn't count as abstract reasoning. It's not how humans do it, but it's still a form of reasoning.
January 7, 2026 at 10:14 PM
Well, I think the real point is that ARC is a unique type of reasoning about a pattern from a few examples. Note that though the models are trained on many examples of such patterns, each pattern is unique, so the model still has to learn to extrapolate from a few (~3-4) examples.
January 7, 2026 at 10:09 PM
Yes, a reasonably objective evaluation of the models' abilities to learn solve ARC tasks (and their ilk).
January 7, 2026 at 9:50 PM
Yeah, exactly. Like, is it a fair comparison to people? No, not really. But, does it at least give a reasonably objective evaluation of the models' capabilities - I would say yes.
January 7, 2026 at 9:40 PM
I mean, yes, it's not *general*, but neither is the benchmark itself, really.

I guess the question is whether you could train these models on broader tasks and still see some transer to ARC.
January 7, 2026 at 9:34 PM
Have, they? I'm not sure that's true. The pareto frontier on that benchmark has been moving forward via some pretty interesting strategies that work on several other domains, e.g., tiny recursive models:

arxiv.org/abs/2510.04871
Less is More: Recursive Reasoning with Tiny Networks
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard ...
arxiv.org
January 7, 2026 at 4:44 PM
No, I don't think so - e.g. you won't find any similar results from Meta or Apple as far as I know (because their models are not as performant as Google, OpenAI's, etc.)

And for some of these benchmarks, like ARC, the leaderboard is public - you can go compare all the models in a totally fair way!
January 7, 2026 at 4:43 PM
Whay do you mean by "measuring performance", though?

There are lots of results for various LLMs on various benchmarks, e.g. for Gemini 3:

storage.googleapis.com/deepmind-med...

And I wouldn't say it's "cherry picked". It's from several different domains.

What is missing for you?
storage.googleapis.com
January 7, 2026 at 2:46 PM