Zihan
zhweng.bsky.social
Zihan
@zhweng.bsky.social
PhD Student @mcgill.ca | {Biological,Artificial} Neural Networks
9/9
A huge shoutout to my co-authors @lucasmgomez.bsky.social,
@taylorwwebb.bsky.social, and @bashivan.bsky.social!
Check out the full paper for the deep dive into VLM cognitive profiles at arxiv.org/abs/2505.21538
See you in San Deigo! 🏔️ #AI #VLM #NeurIPS2025
Caption This, Reason That: VLMs Caught in the Middle
Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relatio...
arxiv.org
December 1, 2025 at 4:43 PM
8/9
Our work suggests that future VLM improvements shouldn't just focus on larger encoders, but on better Visual Chain-of-Thought and integration strategies to overcome the "Perception-Reasoning" disconnect.
December 1, 2025 at 4:43 PM
7/9
Does this generalize? Yes.
Fine-tuning on our cognitive tasks correlated with improvements on established benchmarks like MMMU-Pro and VQAv2. 📊
December 1, 2025 at 4:43 PM
6/9
We didn't stop there. We fine-tuned Qwen2.5 on our Composite Visual Reasoning (CVR) tasks.
🔹 1k training samples yielded large gains.
🔹 100k samples pushed performance even higher.
December 1, 2025 at 4:43 PM
5/9
This suggests a major bottleneck in current VLMs: Chain-of-Thought (CoT) needs to be better grounded in visual features.
Models are "Caught in the Middle"—they possess the visual info and the reasoning capacity, but fail to connect them without an explicit text bridge.
December 1, 2025 at 4:43 PM
4/9
Is the vision encoder causing this gap? No.
We tested Self-Captioning (SC): The model describes the image, then answers the prompt using its own caption.
👉 Qwen2.5-VL-7B Spatial Perception accuracy went from 44% (Base) → 73% (SC). 📈
December 1, 2025 at 4:43 PM
3/9
The Diagnosis? 🏥
VLMs have distinct cognitive profiles.
✅ Perception: Strong at identifying what an object is (Category).
❌ Spatial: Terrible at identifying where it is (Location).
❌ Attention: They struggle to ignore distractors.
December 1, 2025 at 4:43 PM
2/9
Human intelligence is built on core abilities: Perception, Attention, and Memory.
Existing VLM benchmarks (MMMU, etc.) test high-level reasoning. We went deeper. We built the PAM Dataset to isolate these low-level cognitive abilities in models like GPT-4o and Qwen2.5-VL.
December 1, 2025 at 4:43 PM