A huge shoutout to my co-authors @lucasmgomez.bsky.social,
@taylorwwebb.bsky.social, and @bashivan.bsky.social!
Check out the full paper for the deep dive into VLM cognitive profiles at arxiv.org/abs/2505.21538
See you in San Deigo! 🏔️ #AI #VLM #NeurIPS2025
A huge shoutout to my co-authors @lucasmgomez.bsky.social,
@taylorwwebb.bsky.social, and @bashivan.bsky.social!
Check out the full paper for the deep dive into VLM cognitive profiles at arxiv.org/abs/2505.21538
See you in San Deigo! 🏔️ #AI #VLM #NeurIPS2025
Our work suggests that future VLM improvements shouldn't just focus on larger encoders, but on better Visual Chain-of-Thought and integration strategies to overcome the "Perception-Reasoning" disconnect.
Our work suggests that future VLM improvements shouldn't just focus on larger encoders, but on better Visual Chain-of-Thought and integration strategies to overcome the "Perception-Reasoning" disconnect.
Does this generalize? Yes.
Fine-tuning on our cognitive tasks correlated with improvements on established benchmarks like MMMU-Pro and VQAv2. 📊
Does this generalize? Yes.
Fine-tuning on our cognitive tasks correlated with improvements on established benchmarks like MMMU-Pro and VQAv2. 📊
We didn't stop there. We fine-tuned Qwen2.5 on our Composite Visual Reasoning (CVR) tasks.
🔹 1k training samples yielded large gains.
🔹 100k samples pushed performance even higher.
We didn't stop there. We fine-tuned Qwen2.5 on our Composite Visual Reasoning (CVR) tasks.
🔹 1k training samples yielded large gains.
🔹 100k samples pushed performance even higher.
This suggests a major bottleneck in current VLMs: Chain-of-Thought (CoT) needs to be better grounded in visual features.
Models are "Caught in the Middle"—they possess the visual info and the reasoning capacity, but fail to connect them without an explicit text bridge.
This suggests a major bottleneck in current VLMs: Chain-of-Thought (CoT) needs to be better grounded in visual features.
Models are "Caught in the Middle"—they possess the visual info and the reasoning capacity, but fail to connect them without an explicit text bridge.
Is the vision encoder causing this gap? No.
We tested Self-Captioning (SC): The model describes the image, then answers the prompt using its own caption.
👉 Qwen2.5-VL-7B Spatial Perception accuracy went from 44% (Base) → 73% (SC). 📈
Is the vision encoder causing this gap? No.
We tested Self-Captioning (SC): The model describes the image, then answers the prompt using its own caption.
👉 Qwen2.5-VL-7B Spatial Perception accuracy went from 44% (Base) → 73% (SC). 📈
The Diagnosis? 🏥
VLMs have distinct cognitive profiles.
✅ Perception: Strong at identifying what an object is (Category).
❌ Spatial: Terrible at identifying where it is (Location).
❌ Attention: They struggle to ignore distractors.
The Diagnosis? 🏥
VLMs have distinct cognitive profiles.
✅ Perception: Strong at identifying what an object is (Category).
❌ Spatial: Terrible at identifying where it is (Location).
❌ Attention: They struggle to ignore distractors.
Human intelligence is built on core abilities: Perception, Attention, and Memory.
Existing VLM benchmarks (MMMU, etc.) test high-level reasoning. We went deeper. We built the PAM Dataset to isolate these low-level cognitive abilities in models like GPT-4o and Qwen2.5-VL.
Human intelligence is built on core abilities: Perception, Attention, and Memory.
Existing VLM benchmarks (MMMU, etc.) test high-level reasoning. We went deeper. We built the PAM Dataset to isolate these low-level cognitive abilities in models like GPT-4o and Qwen2.5-VL.