Zihan
zhweng.bsky.social
Zihan
@zhweng.bsky.social
PhD Student @mcgill.ca | {Biological,Artificial} Neural Networks
7/9
Does this generalize? Yes.
Fine-tuning on our cognitive tasks correlated with improvements on established benchmarks like MMMU-Pro and VQAv2. 📊
December 1, 2025 at 4:43 PM
4/9
Is the vision encoder causing this gap? No.
We tested Self-Captioning (SC): The model describes the image, then answers the prompt using its own caption.
👉 Qwen2.5-VL-7B Spatial Perception accuracy went from 44% (Base) → 73% (SC). 📈
December 1, 2025 at 4:43 PM
3/9
The Diagnosis? 🏥
VLMs have distinct cognitive profiles.
✅ Perception: Strong at identifying what an object is (Category).
❌ Spatial: Terrible at identifying where it is (Location).
❌ Attention: They struggle to ignore distractors.
December 1, 2025 at 4:43 PM
2/9
Human intelligence is built on core abilities: Perception, Attention, and Memory.
Existing VLM benchmarks (MMMU, etc.) test high-level reasoning. We went deeper. We built the PAM Dataset to isolate these low-level cognitive abilities in models like GPT-4o and Qwen2.5-VL.
December 1, 2025 at 4:43 PM
1/9
🚨Thrilled to share "Caption This, Reason That", a #NeurIPS2025 Spotlight! 🔦
Meet us at #2112, 3 Dec 11 a.m.
We analyze VLM limitations through the lens of Cognitive Science (Perception, Attention, Memory) and propose a simple "Self-Captioning" method that boosts spatial reasoning by ~18%.
🧵👇
December 1, 2025 at 4:43 PM