Melanie Mitchell
@melaniemitchell.bsky.social
24K followers 400 following 540 posts
Professor, Santa Fe Institute. Research on AI, cognitive science, and complex systems. Website: https://melaniemitchell.me Substack: https://aiguide.substack.com/
Posts Media Videos Starter Packs
melaniemitchell.bsky.social
Evaluation of reasoning, and reasoning about evaluation -- both understudied, imo
Reposted by Melanie Mitchell
sfiscience.bsky.social
Reserve your free tickets to SFI’s next Community Lecture with renowned developmental psychologist Alison Gopnik:

“Transmission Versus Truth: What Will It Take to Make an AI as Smart as a 4-Year-Old?”

October 21, 7:30 pm at The Lensic Performing Arts Center.

Tickets: lensic.org/events/aliso...
melaniemitchell.bsky.social
On the other hand, accuracy alone may be *underestimating* this ability in visual settings

It is essential to go beyond accuracy in evaluating such capabilities!

Paper: arxiv.org/abs/2510.02125

Blog post: aiguide.substack.com/p/do-ai-reas...

🧵 10/10
Do AI Reasoning Models Abstract and Reason Like Humans?
Going beyond simple accuracy for evaluating abstraction abilities
aiguide.substack.com
melaniemitchell.bsky.social
Conclusions: Evaluations like those of the ARC Prize, using accuracy alone, may be *overestimating* abstract reasoning ability of these models in textual setting.

🧵 9/10
melaniemitchell.bsky.social
With visual inputs, these models all do quite poorly on generating accurate grids. But they do manage to get the correct-intended rule considerably more often than they generate the correct grid.

🧵 8/10
melaniemitchell.bsky.social
We found that while reasoning models approach or exceed human accuracy on these tasks with textual inputs, they are substantially more prone to use unintended “shortcuts” to solve the tasks than do humans.

🧵 7/10
melaniemitchell.bsky.social
We manually rate every generate rule as “correct as intended (captures the intended abstractions), correct but unintended (works for demonstrations but doesn’t capture intended abstractions), and incorrect (does not work for demonstrations).
🧵 6/10
melaniemitchell.bsky.social
We ask the models to not only generate an output grid, but to also state the natural-language transformation rule that describes the demonstrations and can be applied to the grid.

🧵 5/10
melaniemitchell.bsky.social
We investigated this question by running several reasoning models on the tasks in ConceptARC, a benchmark in the ARC domain whose tasks isolate various core concepts. We experiment with both textual inputs (like those used in the ARC Prize competition) and visual inputs.

🧵 4/10
melaniemitchell.bsky.social
AI reasoning have exceeded average human accuracy on ARC-AGI-1 !!

But are they getting the right answers for the “right” reasons – i.e., grasping the intended abstractions?

Or are they solving tasks using less abstract, less generalizable, unintended patterns (“shortcuts”)?

🧵 3/10
melaniemitchell.bsky.social
ARC-AGI-1 is meant to be a test of abstract reasoning, based on “core knowledge” priors such as objectness, along with spatial / geometric concepts (“inside vs. outside”, “top vs. bottom”) semantic concepts (“same vs. different”) and numerical concepts (“largest vs. smallest”).

🧵 2/10
Reposted by Melanie Mitchell
shannonvallor.bsky.social
Since it resonated with the audience, I’ll recap my main argument against AGI here. ‘General intelligence’ is like phlogiston, or the aether. It’s an outmoded scientific concept that does not refer to anything real. Any explanatory work it did can be done better by a richer scientific frame. 1/3
shannonvallor.bsky.social
This was a truly heartening day, with deeply thoughtful challenges to the dominant narrative framed around AGI, coming from across disciplines and perspectives. Felt like the tide might finally be turning a bit, at least among the scientific community. Thanks @royalsociety.org!
anilseth.bsky.social
1/2 I'm looking forward to taking part in a panel on AGI and the Turing Test, tomorrow afternoon (Thurs 2nd Oct) at the @royalsociety.org, w/ Dame Wendy Hall, Shannon Vallor, William Isaac, & Sir Nigel Shadbolt. royalsociety.org/science-even...
melaniemitchell.bsky.social
And over-generalized "computer" :-)
melaniemitchell.bsky.social
Doubting the usefulness of the TT is not "moving the goal posts"; it's progress in our understanding of what intelligence is.
melaniemitchell.bsky.social
Andrew Ng: "AI is the new electricity!"

Cory Doctorow: "AI is the asbestos we are shoveling into the walls of our society and our descendants will be digging it out for generations."

¯\_(ツ)_/¯

(pluralistic.net/2025/09/27/e...)
melaniemitchell.bsky.social
We desperately need Dem leaders like @chrismurphyct.bsky.social, @govpritzker.illinois.gov, @aoc.bsky.social and others who are able and willing to fight like hell. As opposed to the current leadership in congress who don't seem to appreciate the stakes.
melaniemitchell.bsky.social
⬇️⬇️
vancleve.theoretical.bio
Please everyone post comments on this rule!

This rule change will be devastating to our valued international students and scholars who contribute tremendously to US universities and other institutions with STEM research and development.
ianlmorgan.bsky.social
DHS is proposing a rule that would end "duration of status" for J, F and other visa holders. This would decimate the international graduate student and postdoc population, which are a crucial part of the United States biomedical workforce. There's still time for you to comment on the proposed rule.
melaniemitchell.bsky.social
Great thread -- and makes me miss Portland , where I lived for almost 20 years. Keep federal troops OUT of Portland!
wyden.senate.gov
Portlanders — show off your corner of war-ravaged Portland below.
katzpeejays.bsky.social
Our corner of war-ravaged Portland.
melaniemitchell.bsky.social
In this vein, @mraginsky.bsky.social's "Artificial Intelligence as Sorcery" is also excellent: realizable.substack.com/p/artificial...