Bill Psomas
banner
billpsomas.bsky.social
Bill Psomas
@billpsomas.bsky.social
MSCA Postdoctoral Fellow @ Visual Recognition Group, CTU in Prague. Deep Learning for Computer Vision. Former IARAI, Inria, Athena RC intern. Photographer. Crossfit freak.

📍Prague, CZ. 🔗 http://users.ntua.gr/psomasbill/
🚀New task: Instance-level Image+Text→Image Retrieval

🔎Given a query image + an edit (“during night”), retrieve the same specific instance after the change — not just any similar object.

🛢New dataset on HF: i-CIR huggingface.co/datasets/bil...

🔥Download, run, and share results!
January 6, 2026 at 8:00 PM
10/n Faster convergence🔥

REGLUE (SiT-B/2) achieves 12.9 and 28.7 FID at 400K iterations in conditional and unconditional generation, respectively, outperforming REPA, ReDi, and REG. REGLUE (SiT-XL/2) matches 1M-step SOTA performance in just 700k iterations (~30% fewer steps).
December 27, 2025 at 10:30 AM
7/n Semantic preservation under compression📉

Do compressed patch features retain VFM semantics?

Points show frozen compressed DINOv2 semantics (x: ImageNet top-1 / Cityscapes mIoU) vs SiT-B generation quality (y: ImageNet FID) when trained on VAE latents + compressed features.
December 27, 2025 at 10:29 AM
6/n Non-linear compression matters 💎

Linear PCA can limit patch-level semantics (e.g., ReDi). We introduce a lightweight non-linear semantic compressor that aggregates multi-layer VFM features into a compact, semantics-preserving space, boosting quality (21.4 → 13.3 FID).
December 27, 2025 at 10:28 AM
5/n Our method 🧠

REGLUE puts these into one unified model and jointly models:

1️⃣ VAE latents (pixels)
2️⃣ local semantics (compressed patch features)
3️⃣ global [CLS] (concept)
➕ alignment loss as a complementary auxiliary boost.
December 27, 2025 at 10:28 AM
4/n Main insight 💡

Jointly modeling compressed patch-level semantics ➕ VAE latents provides spatial guidance and yields larger gains than alignment-only (REPA) or global-only (REG).

Alignment loss and a global [CLS] token stay complementary, orthogonal signals.
December 27, 2025 at 10:27 AM
1/n REGLUE Your Latents! 🚀

We introduce REGLUE: a unified framework that entangles VAE latents ➕ Global ➕ Local semantics for faster, higher-fidelity image generation.

Links (paper + code) at the end👇
December 27, 2025 at 10:26 AM
Heading to #NeurIPS2025?
Come by Poster Session 6, Fri 16:30, #4514 🧵
We present instance-level composed image retrieval, the new i-CIR dataset, and our training-free method BASIC.
Drop in and say hi!
December 4, 2025 at 5:31 PM
A method for i-CIR and CIR in general:

⚡BASIC: training-free pipeline (centering, projection with PCA, textual contextualization, Harris-style fusion) with strong results across i-CIR and class-level CIR benchmarks.
November 6, 2025 at 12:07 PM
Compact ⚖️ but hard 🔥:

📊~750K images, 202 instances, ~1,900 composed queries. Despite small per-query DBs (~3.7K images), i-CIR matches the difficulty of searching with >40M random distractors.
November 6, 2025 at 12:05 PM
How i-CIR is structured:

🗂️ Per instance we share a database and define:

- composed positives (same object + modification)
- hard negatives:
- visual (same/similar object, wrong text)
- textual (right text, wrong instance)
- composed (near-miss on both).
November 6, 2025 at 12:04 PM
Why this matters:

🔎 Gap in the community: Existing CIR benchmarks are class-level, ambiguous, without explicit hard negatives, and often reward text-only behaviour. We needed a dataset that truly requires both image and text, at the instance level. i-CIR fills that gap.
November 6, 2025 at 12:03 PM
🎉 Instance-level Composed Image Retrieval @ #NeurIPS2025

🎨 Task: given (image of an object instance) + (text modification), retrieve photos of that exact instance under the change.

E.g.: Temple of Poseidon 🏛️ ➕ during sunset 🌅

📦 Project page: vrg.fel.cvut.cz/icir/
November 6, 2025 at 12:02 PM
The Colloquium begins!
April 10, 2025 at 9:07 AM