Bill Psomas
banner
billpsomas.bsky.social
Bill Psomas
@billpsomas.bsky.social
MSCA Postdoctoral Fellow @ Visual Recognition Group, CTU in Prague. Deep Learning for Computer Vision. Former IARAI, Inria, Athena RC intern. Photographer. Crossfit freak.

📍Prague, CZ. 🔗 http://users.ntua.gr/psomasbill/
Would love to try
January 13, 2026 at 6:33 PM
Best promo anyone could make for this position 👏🏾🏰 And, amazingly, everything said is true 🎆
January 9, 2026 at 5:36 AM
12/12 Joint work with Giorgos Petsangourakis, Christos Sgouropoulos, Theodoros Giannakopoulos, Giorgos Sfikas, @ikakogeorgiou.bsky.social.
December 27, 2025 at 10:32 AM
11/n Summary🏁

REGLUE shows that the way we leverage VFM semantics matters for diffusion. Combining compact local semantics with global context yields faster convergence and state-of-the-art image generation.

📄arXiv: arxiv.org/abs/2512.16636
💻Project: reglueyourlatents.github.io
REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion
Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slo...
arxiv.org
December 27, 2025 at 10:30 AM
10/n Faster convergence🔥

REGLUE (SiT-B/2) achieves 12.9 and 28.7 FID at 400K iterations in conditional and unconditional generation, respectively, outperforming REPA, ReDi, and REG. REGLUE (SiT-XL/2) matches 1M-step SOTA performance in just 700k iterations (~30% fewer steps).
December 27, 2025 at 10:30 AM
9/n Alignment effects ⚓

External alignment complements joint modeling, but its benefits depend on the signal. Local alignment yields consistent gains, whereas global-only alignment can degrade performance. Spatial joint modeling remains the primary driver.
December 27, 2025 at 10:29 AM
8/n Local > Global Semantics🧩

Our analysis shows that jointly modeling with patch-level semantics drives most gains. The global [CLS] helps, but fine-grained spatial features deliver a strongly larger FID improvement, highlighting the importance of local structure for diffusion.
December 27, 2025 at 10:29 AM
7/n Semantic preservation under compression📉

Do compressed patch features retain VFM semantics?

Points show frozen compressed DINOv2 semantics (x: ImageNet top-1 / Cityscapes mIoU) vs SiT-B generation quality (y: ImageNet FID) when trained on VAE latents + compressed features.
December 27, 2025 at 10:29 AM
6/n Non-linear compression matters 💎

Linear PCA can limit patch-level semantics (e.g., ReDi). We introduce a lightweight non-linear semantic compressor that aggregates multi-layer VFM features into a compact, semantics-preserving space, boosting quality (21.4 → 13.3 FID).
December 27, 2025 at 10:28 AM
5/n Our method 🧠

REGLUE puts these into one unified model and jointly models:

1️⃣ VAE latents (pixels)
2️⃣ local semantics (compressed patch features)
3️⃣ global [CLS] (concept)
➕ alignment loss as a complementary auxiliary boost.
December 27, 2025 at 10:28 AM
4/n Main insight 💡

Jointly modeling compressed patch-level semantics ➕ VAE latents provides spatial guidance and yields larger gains than alignment-only (REPA) or global-only (REG).

Alignment loss and a global [CLS] token stay complementary, orthogonal signals.
December 27, 2025 at 10:27 AM
3/n Key design choice 🧩 Compact spatial semantics matter!

To leverage VFMs effectively, diffusion should jointly model VAE latents with multi-layer VFM spatial (patch-level) semantics, via a compact, non-linearly compressed representation.
December 27, 2025 at 10:27 AM
2/n More semantics are needed! ➕

Existing joint modeling and external alignment approaches (e.g., REPA, REG) inject only a “narrow slice” of VFM features into diffusion. We argue richer semantics are needed to unlock their full potential.
December 27, 2025 at 10:26 AM
⬇️ Grab i-CIR, run your method, tell us how it handles instance-level composed image retrieval.

📄 arxiv.org/abs/2510.25387
🧪 github.com/billpsomas/i...

George Retsinas, @nikos-efth.bsky.social, Panagiotis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, @gtolias.bsky.social.
Instance-Level Composed Image Retrieval
The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training...
arxiv.org
November 6, 2025 at 12:08 PM
A method for i-CIR and CIR in general:

⚡BASIC: training-free pipeline (centering, projection with PCA, textual contextualization, Harris-style fusion) with strong results across i-CIR and class-level CIR benchmarks.
November 6, 2025 at 12:07 PM
Compact ⚖️ but hard 🔥:

📊~750K images, 202 instances, ~1,900 composed queries. Despite small per-query DBs (~3.7K images), i-CIR matches the difficulty of searching with >40M random distractors.
November 6, 2025 at 12:05 PM