Lightnews — Scholar-powered news

Zihan

@zhweng.bsky.social

5 followers 6 following 9 posts

PhD Student @mcgill.ca | {Biological,Artificial} Neural Networks

Posts Replies Media Videos

Zihan

@zhweng.bsky.social

7/9
Does this generalize? Yes.
Fine-tuning on our cognitive tasks correlated with improvements on established benchmarks like MMMU-Pro and VQAv2. 📊

December 1, 2025 at 4:43 PM

Zihan

@zhweng.bsky.social

4/9
Is the vision encoder causing this gap? No.
We tested Self-Captioning (SC): The model describes the image, then answers the prompt using its own caption.
👉 Qwen2.5-VL-7B Spatial Perception accuracy went from 44% (Base) → 73% (SC). 📈

December 1, 2025 at 4:43 PM

Zihan

@zhweng.bsky.social

3/9
The Diagnosis? 🏥
VLMs have distinct cognitive profiles.
✅ Perception: Strong at identifying what an object is (Category).
❌ Spatial: Terrible at identifying where it is (Location).
❌ Attention: They struggle to ignore distractors.

December 1, 2025 at 4:43 PM

Zihan

@zhweng.bsky.social

2/9
Human intelligence is built on core abilities: Perception, Attention, and Memory.
Existing VLM benchmarks (MMMU, etc.) test high-level reasoning. We went deeper. We built the PAM Dataset to isolate these low-level cognitive abilities in models like GPT-4o and Qwen2.5-VL.

December 1, 2025 at 4:43 PM

Zihan

@zhweng.bsky.social

1/9
🚨Thrilled to share "Caption This, Reason That", a #NeurIPS2025 Spotlight! 🔦
Meet us at #2112, 3 Dec 11 a.m.
We analyze VLM limitations through the lens of Cognitive Science (Perception, Attention, Memory) and propose a simple "Self-Captioning" method that boosts spatial reasoning by ~18%.
🧵👇

December 1, 2025 at 4:43 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news