Lightnews — Scholar-powered news

Reposted by Zory Zhang

Hokin @hokin.bsky.social · Jun 30

#CoreCognition #LLM #multimodal #GrowAI We spent 3 years to curate 1503 classic experiments spanning 12 core concepts in human cognitive development and evaluated on 230 MLLMs with 11 different prompts for 5 times to get over 3.8 millions inference data points.

A thread (1/n) - #ICML2025 ✅

1 9 13

Reposted by Zory Zhang

Paul Linton @lintonvision.bsky.social · Jun 11

Beautiful to see this initiative from a group of like minded PhD students collaborating together! 🚀

Hokin @hokin.bsky.social · Jun 11

New Paper Alert ‼️ Current VLMs completely fail human gaze understanding 🙀 and scaling does NO help ‼️

However, humans, since an extremely age 🧒, are extremely sensitive to other people's gaze 🙄 👀

No mentors, no labs, only pre-doc students, 111 VLMs, and we did it 😎

1 4 9

Zory Zhang @zoryzhang.bsky.social · Jun 12

GrowAI Team: @growai.bsky.social

1

Reposted by Zory Zhang

Hokin @hokin.bsky.social · Jun 11

New Paper Alert ‼️ Current VLMs completely fail human gaze understanding 🙀 and scaling does NO help ‼️

However, humans, since an extremely age 🧒, are extremely sensitive to other people's gaze 🙄 👀

No mentors, no labs, only pre-doc students, 111 VLMs, and we did it 😎

1 5 6

Zory Zhang @zoryzhang.bsky.social · Jun 12

With the amazing GrowAI team: Pinyuan Feng (equally contributed), Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, @hokin.bsky.social , Ziqiao Ma, Yijiang Li, & Dezhi Luo.

🧵11/11 🎉

1 2

Zory Zhang @zoryzhang.bsky.social · Jun 12

Thank you for reading 😋

GrowAI Team Present.
growing-ai-like-a-child.github.io

Arxiv: arxiv.org/abs/2506.05412
Project page: grow-ai-like-a-child.github.io/gaze/
Stimuli: osf.io/kyaeu
Code: github.com/grow-ai-like...
🧵10/11

GrowAI

Growing AI like a Child, at Scale Humans never "learn" intelligence. Humans develop intelligence. Biological lives on this planet take heavy advantage of intelligent primitives embedded in their genes...

growing-ai-like-a-child.github.io

1 2

Zory Zhang @zoryzhang.bsky.social · Jun 12

Besides understanding VLMs, this explanation also suggests that VLM training should include more embodied social interaction, such that natural human-AI interaction can stem from next-token/frame-prediction training. We also recommend a better learning curriculum design📚.
🧵9/11

1 2

Zory Zhang @zoryzhang.bsky.social · Jun 12

We leave this explanation open for further investigation and conclude that this work shows how controlled studies can complement benchmarking by providing aspects that explanations need to account for, as a way to constrain the hypothesis space to better understand VLMs🌟.
🧵8/11

1 2

Zory Zhang @zoryzhang.bsky.social · Jun 12

Surprisingly, their accuracy does not differ between front views and side views, while humans do (p<0.001). VLMs may rely on 👺head orientation rather than 👀eye gaze direction, making them "robust" to side views that increase the geometric ambiguity of eye direction.
🧵7/11

1 2

Zory Zhang @zoryzhang.bsky.social · Jun 12

On the other hand, the performance of Gemini 1.5 Pro, GPT-4o, InternLM, Qwen2.5, and GLM becomes closer to the chance level as difficulty increases (with increasing proximity and number of objects). They likely employ heuristics that break down under difficult conditions.
🧵6/11

1 2

Zory Zhang @zoryzhang.bsky.social · Jun 12

Before that, we need to establish baselines. 65 human participants were presented with MC questions like the one below. Their performance degrades 📉 with increasing proximity, increasing number of objects, and when the camera view switches from the front to the side.
🧵5/11

1 2

Zory Zhang @zoryzhang.bsky.social · Jun 12

In addition to the chance-level accuracy, VLMs responded with every possible answer almost equally frequently. Are they random guessers? 🤡 Spoiler: top-tier VLMs are not, as we further analyzed how their performance varies with respect to the controlled variables. 🤗
🧵4/11

1 2

Zory Zhang @zoryzhang.bsky.social · Jun 12

We found that humans excel at gaze inference (~91% accuracy), but 94 of 111 VLMs performed about as well as if they had guessed randomly without looking at the images (~42%) 😲. Even the best, like GPT-4o, hit only ~50%. Bigger (or newer) VLMs are not better. 🫤
🧵3/11

1 2

Zory Zhang @zoryzhang.bsky.social · Jun 12

We systematically manipulated variables across 900 evaluation stimuli: View (left/right/front), Proximity (1-3 scale), Number of objects (2-4), etc., and tested 65 human participants (45 stimuli per person) and 111 VLMs on it.
🧵2/11

1 2

Zory Zhang @zoryzhang.bsky.social · Jun 12

👁️ 𝐂𝐚𝐧 𝐕𝐢𝐬𝐢𝐨𝐧 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬 (𝐕𝐋𝐌𝐬) 𝐈𝐧𝐟𝐞𝐫 𝐇𝐮𝐦𝐚𝐧 𝐆𝐚𝐳𝐞 𝐃𝐢𝐫𝐞𝐜𝐭𝐢𝐨𝐧?
Knowing where someone looks is key to a Theory of Mind. We test 111 VLMs and 65 humans to compare their inferences.
Project page: grow-ai-like-a-child.github.io/gaze/
🧵1/11

1 3

Reposted by Zory Zhang

Hokin @hokin.bsky.social · May 24

Sam is 100% correct on this. Indeed, human babies have essential cognitive priors such as permanence, continuity, and boundary of objects, 3D Euclidean understanding of space, etc.

We spent 2 years to systematically to examine and show the lack of such in MLLMs: arxiv.org/abs/2410.10855

5 21