Lightnews — Scholar-powered news

Blake Richards

@tyrellturing.bsky.social

11K followers 3.3K following 2.8K posts

Researcher at Google and CIFAR Fellow, working on the intersection of machine learning and neuroscience in Montréal (academic affiliations: @mcgill.ca and @mila-quebec.bsky.social).

Posts Replies Media Videos

Blake Richards

@tyrellturing.bsky.social

Yeah, on (1) I do think, per what @melaniemitchell.bsky.social says, it shows that ARC is not quite testing what it claims to be testing.

On (2), that's an interesting question. My instinct: reasoning is just search with non-random guesses that are informed by the previous steps in the search.

January 8, 2026 at 4:43 PM

Blake Richards

@tyrellturing.bsky.social

Good point. Indeed, it shows that ARC is not achieving what it was originally claimed to achieve as a benchmark.

January 8, 2026 at 4:41 PM

Blake Richards

@tyrellturing.bsky.social

I guess there's an interesting question here, though:

What level of bizarre tip up on some other tasks is needed to invalidate performance on some others?

Notably, some humans can sometimes exhibit such oddities (e.g. really good at advanced math, incapable on more basic tasks).

January 8, 2026 at 4:40 PM

Blake Richards

@tyrellturing.bsky.social

I would call it "reasoning" because it notices a relationship that holds in the examples it was provided, then uses that relationship to get the correct answer in a roughly logical manner.

At a high level, that's also what humans do, only we use very different relationships to do the reasoning.

January 7, 2026 at 10:25 PM

Blake Richards

@tyrellturing.bsky.social

Two responses:

(1) These are interesting questions, but note that the OP wasn't an ask about good data on how human-like the models are in their reasoning.

(2) I don't agree that Gemini's solution doesn't count as abstract reasoning. It's not how humans do it, but it's still a form of reasoning.

January 7, 2026 at 10:14 PM

Blake Richards

@tyrellturing.bsky.social

Well, I think the real point is that ARC is a unique type of reasoning about a pattern from a few examples. Note that though the models are trained on many examples of such patterns, each pattern is unique, so the model still has to learn to extrapolate from a few (~3-4) examples.

January 7, 2026 at 10:09 PM

Blake Richards

@tyrellturing.bsky.social

Yes, a reasonably objective evaluation of the models' abilities to learn solve ARC tasks (and their ilk).

January 7, 2026 at 9:50 PM

Blake Richards

@tyrellturing.bsky.social

Yeah, exactly. Like, is it a fair comparison to people? No, not really. But, does it at least give a reasonably objective evaluation of the models' capabilities - I would say yes.

January 7, 2026 at 9:40 PM

Blake Richards

@tyrellturing.bsky.social

I mean, yes, it's not *general*, but neither is the benchmark itself, really.

I guess the question is whether you could train these models on broader tasks and still see some transer to ARC.

January 7, 2026 at 9:34 PM

Blake Richards

@tyrellturing.bsky.social

Have, they? I'm not sure that's true. The pareto frontier on that benchmark has been moving forward via some pretty interesting strategies that work on several other domains, e.g., tiny recursive models:

arxiv.org/abs/2510.04871

Less is More: Recursive Reasoning with Tiny Networks

Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard ...

arxiv.org

January 7, 2026 at 4:44 PM

Blake Richards

@tyrellturing.bsky.social

No, I don't think so - e.g. you won't find any similar results from Meta or Apple as far as I know (because their models are not as performant as Google, OpenAI's, etc.)

And for some of these benchmarks, like ARC, the leaderboard is public - you can go compare all the models in a totally fair way!

January 7, 2026 at 4:43 PM

Blake Richards

@tyrellturing.bsky.social

Whay do you mean by "measuring performance", though?

There are lots of results for various LLMs on various benchmarks, e.g. for Gemini 3:

storage.googleapis.com/deepmind-med...

And I wouldn't say it's "cherry picked". It's from several different domains.

What is missing for you?

storage.googleapis.com

January 7, 2026 at 2:46 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news