Lightnews — Scholar-powered news

Melanie Mitchell @melaniemitchell.bsky.social · 15h

Evaluation of reasoning, and reasoning about evaluation -- both understudied, imo

Reposted by Melanie Mitchell

Santa Fe Institute @sfiscience.bsky.social · 16h

Reserve your free tickets to SFI’s next Community Lecture with renowned developmental psychologist Alison Gopnik:

“Transmission Versus Truth: What Will It Take to Make an AI as Smart as a 4-Year-Old?”

October 21, 7:30 pm at The Lensic Performing Arts Center.

Tickets: lensic.org/events/aliso...

1 7

Melanie Mitchell @melaniemitchell.bsky.social · 1d

Excellent statement.

4

Melanie Mitchell @melaniemitchell.bsky.social · 1d

On the other hand, accuracy alone may be *underestimating* this ability in visual settings

It is essential to go beyond accuracy in evaluating such capabilities!

Paper: arxiv.org/abs/2510.02125

Blog post: aiguide.substack.com/p/do-ai-reas...

🧵 10/10

Do AI Reasoning Models Abstract and Reason Like Humans?

Going beyond simple accuracy for evaluating abstraction abilities

aiguide.substack.com

1 18

Melanie Mitchell @melaniemitchell.bsky.social · 1d

Conclusions: Evaluations like those of the ARC Prize, using accuracy alone, may be *overestimating* abstract reasoning ability of these models in textual setting.

🧵 9/10

1 1 15

Melanie Mitchell @melaniemitchell.bsky.social · 1d

With visual inputs, these models all do quite poorly on generating accurate grids. But they do manage to get the correct-intended rule considerably more often than they generate the correct grid.

🧵 8/10

1 5

Melanie Mitchell @melaniemitchell.bsky.social · 1d

We found that while reasoning models approach or exceed human accuracy on these tasks with textual inputs, they are substantially more prone to use unintended “shortcuts” to solve the tasks than do humans.

🧵 7/10

1 5

Melanie Mitchell @melaniemitchell.bsky.social · 1d

We manually rate every generate rule as “correct as intended (captures the intended abstractions), correct but unintended (works for demonstrations but doesn’t capture intended abstractions), and incorrect (does not work for demonstrations).
🧵 6/10

1 5

Melanie Mitchell @melaniemitchell.bsky.social · 1d

We ask the models to not only generate an output grid, but to also state the natural-language transformation rule that describes the demonstrations and can be applied to the grid.

🧵 5/10

1 4

Melanie Mitchell @melaniemitchell.bsky.social · 1d

We investigated this question by running several reasoning models on the tasks in ConceptARC, a benchmark in the ARC domain whose tasks isolate various core concepts. We experiment with both textual inputs (like those used in the ARC Prize competition) and visual inputs.

🧵 4/10

1 4

Melanie Mitchell @melaniemitchell.bsky.social · 1d

AI reasoning have exceeded average human accuracy on ARC-AGI-1 !!

But are they getting the right answers for the “right” reasons – i.e., grasping the intended abstractions?

Or are they solving tasks using less abstract, less generalizable, unintended patterns (“shortcuts”)?

🧵 3/10

1 7

Melanie Mitchell @melaniemitchell.bsky.social · 1d

ARC-AGI-1 is meant to be a test of abstract reasoning, based on “core knowledge” priors such as objectness, along with spatial / geometric concepts (“inside vs. outside”, “top vs. bottom”) semantic concepts (“same vs. different”) and numerical concepts (“largest vs. smallest”).

🧵 2/10

1 4

Melanie Mitchell @melaniemitchell.bsky.social · 1d

Do AI reasoning models abstract and reason like humans?

New paper on this from my group:

arxiv.org/abs/2510.02125

🧵 1/10

Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

OpenAI's o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators inten...

arxiv.org

3 22 77

Reposted by Melanie Mitchell

Shannon Vallor @shannonvallor.bsky.social · 5d

Since it resonated with the audience, I’ll recap my main argument against AGI here. ‘General intelligence’ is like phlogiston, or the aether. It’s an outmoded scientific concept that does not refer to anything real. Any explanatory work it did can be done better by a richer scientific frame. 1/3

Shannon Vallor @shannonvallor.bsky.social · 5d

This was a truly heartening day, with deeply thoughtful challenges to the dominant narrative framed around AGI, coming from across disciplines and perspectives. Felt like the tide might finally be turning a bit, at least among the scientific community. Thanks @royalsociety.org!

Anil Seth @anilseth.bsky.social · 6d

1/2 I'm looking forward to taking part in a panel on AGI and the Turing Test, tomorrow afternoon (Thurs 2nd Oct) at the @royalsociety.org, w/ Dame Wendy Hall, Shannon Vallor, William Isaac, & Sir Nigel Shadbolt. royalsociety.org/science-even...

8 83 320

Melanie Mitchell @melaniemitchell.bsky.social · 5d

And over-generalized "computer" :-)

1

Melanie Mitchell @melaniemitchell.bsky.social · 5d

Doubting the usefulness of the TT is not "moving the goal posts"; it's progress in our understanding of what intelligence is.

2 1 16

Melanie Mitchell @melaniemitchell.bsky.social · 5d

In honor of the 75th anniversary of the Turing test, I'm re-upping my short essay, "The Turing Test and our shifting conceptions of intelligence".

www.science.org/doi/10.1126/...

The Turing Test and our shifting conceptions of intelligence

“Can machines think?” So asked Alan Turing in his 1950 paper, “Computing Machinery and Intelligence.” Turing quickly noted that, given the difficulty of defining thinking, the question is “too meaning...

www.science.org

2 22 75

Melanie Mitchell @melaniemitchell.bsky.social · 6d

In 2016 Hinton predicted that AI would replace all radiologists in five years. Ten years later, why hasn't it happened? This post is a great explainer.

www.understandingai.org/p/ai-isnt-re...

AI isn't replacing radiologists

Radiology combines digital images, clear benchmarks, and repeatable tasks. But demand for human radiologists is at an all-time high.

www.understandingai.org

11 150 350

Melanie Mitchell @melaniemitchell.bsky.social · 7d

Andrew Ng: "AI is the new electricity!"

Cory Doctorow: "AI is the asbestos we are shoveling into the walls of our society and our descendants will be digging it out for generations."

¯\_(ツ)_/¯

(pluralistic.net/2025/09/27/e...)

8 110 480

Melanie Mitchell @melaniemitchell.bsky.social · 7d

We desperately need Dem leaders like @chrismurphyct.bsky.social, @govpritzker.illinois.gov, @aoc.bsky.social and others who are able and willing to fight like hell. As opposed to the current leadership in congress who don't seem to appreciate the stakes.

4 12 150

Melanie Mitchell @melaniemitchell.bsky.social · 8d

⬇️⬇️

Jeremy Van Cleve @vancleve.theoretical.bio · 8d

Please everyone post comments on this rule!

This rule change will be devastating to our valued international students and scholars who contribute tremendously to US universities and other institutions with STEM research and development.

Ian L Morgan @ianlmorgan.bsky.social · 9d

DHS is proposing a rule that would end "duration of status" for J, F and other visa holders. This would decimate the international graduate student and postdoc population, which are a crucial part of the United States biomedical workforce. There's still time for you to comment on the proposed rule.

1 7

Melanie Mitchell @melaniemitchell.bsky.social · 8d

Great thread -- and makes me miss Portland , where I lived for almost 20 years. Keep federal troops OUT of Portland!

Senator Ron Wyden @wyden.senate.gov · 8d

Portlanders — show off your corner of war-ravaged Portland below.

Cats Pajamas @katzpeejays.bsky.social · 9d

Our corner of war-ravaged Portland.

4 45

Melanie Mitchell @melaniemitchell.bsky.social · 9d

@alisongopnik.bsky.social to the rescue

4 33

Melanie Mitchell @melaniemitchell.bsky.social · 11d

In this vein, @mraginsky.bsky.social's "Artificial Intelligence as Sorcery" is also excellent: realizable.substack.com/p/artificial...

1 1 18

Melanie Mitchell @melaniemitchell.bsky.social · 11d

A good piece in the NYT that pairs well with my own recent post on "magical thinking" about AI.

NYT (gift link): www.nytimes.com/2025/09/25/o...

My post: aiguide.substack.com/p/magical-th...

Opinion | A.I. Isn’t Magic. Lots of People Are Acting Like It Is.

www.nytimes.com

2 15 51