Aarash Feizi
@aarashfeizi.bsky.social
10 followers 25 following 8 posts
Visiting Researcher at @ServiceNowRSRCH | PhD student in @mcgillu and @Mila_Quebec | Prev. @RecursionPharma https://aarashfeizi.github.io/
Posts Media Videos Starter Packs
Pinned
aarashfeizi.bsky.social
🚨 Excited to introduce PairBench! 🚨

💡 TL;DR: VLM-judges can fail at data comparison!

✅ PairBench helps you pick the right one by testing alignment, symmetry, smoothness & controllability—ensuring reliable auto-evaluation.

📄 Paper: arxiv.org/abs/2502.15210

🧵 Thread: 👇
aarashfeizi.bsky.social
🧵 6/7

✅ Beyond benchmarking, PairBench can be used during VLM training & fine-tuning to detect biases early and improve evaluation methods!

This could lead to more trustworthy, consistent AI systems for real-world tasks. 🚀
aarashfeizi.bsky.social
🧵 5/7

✅ PairBench correlates strongly with existing benchmarks, meaning it can serve as a low-cost alternative to expensive human-annotated benchmarks!

This makes it easier to compare and rank models efficiently—without excessive computational costs.
aarashfeizi.bsky.social
🧵 4/7

Instead of blindly picking a judge model, we should ask:
🔹 What task is being evaluated?
🔹 What metric matters most?

✅ PairBench helps match the right VLM to the right task, improving fairness & reliability in auto-evaluation.
aarashfeizi.bsky.social
🧵 3/7

🚨 No single VLM is the best! Models vary drastically across PairBench metrics.

Although some align well with human judgements, they may struggle at symmetry, smoothness, or controllability—making their scores unreliable!

📄 More failure cases in our paper’s appendix!
aarashfeizi.bsky.social
🧵 2/7

✅ Surprising (and concerning) result: Most VLMs lack symmetry! 🤯

In theory, sim(A, B) = sim(B, A)—but in practice? Many models fail!

For example, simply swapping the order of the input images makes GPT-4o and Gemini 1.5 Pro change their decision and scores drastically. 🔄
aarashfeizi.bsky.social
🧵 1/7

Vision language models (VLMs) are widely used as automated evaluators, but can they actually compare data reliably? 🤔

✅ PairBench systematically tests how well VLMs judge similarity across modalities, revealing key strengths & weaknesses in their decisions.
aarashfeizi.bsky.social
🚨 Excited to introduce PairBench! 🚨

💡 TL;DR: VLM-judges can fail at data comparison!

✅ PairBench helps you pick the right one by testing alignment, symmetry, smoothness & controllability—ensuring reliable auto-evaluation.

📄 Paper: arxiv.org/abs/2502.15210

🧵 Thread: 👇