Lightnews — Scholar-powered news

AI Daily Post

@aidailypost.com

New study shows human‑aligned AI models like AligNet boost robustness on Vision Transformers, SigLIP, DINOv2 across THINGS and Levels datasets. Lukas Muttenthaler’s findings could reshape reliability benchmarks. Dive in! #AligNet #VisionTransformers #THINGSdataset

🔗 aidailypost.com/news/human-a...

November 13, 2025 at 5:09 PM

Yinglun Zhu

@yinglunzhu.bsky.social

TTM provides substantial improvements on top of SimpleMatch, without external supervision.

Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.

Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

SimpleMatch reveals substantial hidden capability -- it enables SigLIP-B16 to surpass all prior results and GPT-4.1 to achieve the first result surpassing human performance on Winoground.

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

Super excited to share Test-Time Matching (TTM), an iterative, self-improving algorithm that unlocks substantial compositional reasoning capabilities in multimodal models.

TTM enables SigLIP-B16 (~0.2B params) to outperform GPT-4.1 on MMVP-VLM, establishing a new SOTA.

October 31, 2025 at 6:03 PM

Paper

@paper.bsky.social

2510.11690
事前に学習されたオートエンコーダがピクセルを拡散プロセスのための潜在空間にマッピングする潜在生成モデリングは、拡散トランスフォーマー（DiT）の標準的な戦略となっているが、オートエンコーダコンポーネントはほとんど進化していない。ほとんどのDiTは、オリジナルのVAEエンコーダーに依存し続けてい...

事前に学習されたオートエンコーダがピクセルを拡散プロセスのための潜在空間にマッピングする潜在生成モデリングは、拡散トランスフォーマー（DiT）の標準的な戦略となっているが、オートエンコーダコンポーネントはほとんど進化していない。

ほとんどのDiTは、オリジナルのVAEエンコーダーに依存し続けているが、これにはいくつかの限界がある。アーキテクチャのシンプルさを損なう時代遅れのバックボーン、情報容量を制限する低次元の潜在空間、純粋に再構成に基づく学習から生じる弱い表現が、最終的に生成品質を制限する。

この研究では、VAEを訓練済みの表現エンコーダー（DINO、SigLIP、MAEなど）と訓練済みのデコーダーの組み合わせに置き換え、表現オートエンコーダー（RAE）と呼ぶものを構築する。

これらのモデルは、スケーラブルな変換器ベースのアーキテクチャを可能にしながら、高品質な再構成と意味的に豊かな潜在空間の両方を提供する。

これらの潜在的空間は一般的に高次元であるため、重要な課題は、拡散トランスフォーマーがその中で効果的に動作できるようにすることである。

我々はこの困難の原因を分析し、理論的に動機づけられた解決策を提案し、それを実証的に検証する。

我々のアプローチは、補助的な表現のアライメントロスを伴わずに、より速い収束を達成する。

軽量でワイドなDDTヘッドを搭載したDiTバリエーションを使用し、ImageNetで強力な画像生成結果を達成した：256x256（ガイダンスなし）で1.51 FID、256x256と512x512（ガイダンスあり）の両方で1.13 FID。

RAEには明確な利点があり、拡散変圧器トレーニングの新たなデフォルトとなるべきである。

October 17, 2025 at 12:06 AM

Paper

@paper.bsky.social

[30/30] 132 Likes, 3 Comments, 1 Posts
2510.11690, cs․CV | cs․LG, 13 Oct 2025

🆕Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie

Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved.

Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality.

In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs).

These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture.

Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them.

We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically.

Our approach achieves faster convergence without auxiliary representation alignment losses.

Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance).

RAE offers clear advantages and should be the new default for diffusion transformer training.

October 17, 2025 at 12:05 AM

Sumit

@reachsumit.com

Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models

Mercari fine-tunes SigLIP on product image-title pairs to get 9.1% offline improvement and 50% CTR increase in production for visual similarity-based recommendations.

📝 arxiv.org/abs/2510.13359

Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models

On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with ...

arxiv.org

October 16, 2025 at 6:08 AM

Sung Kim

@sungkim.bsky.social

Replace Variational Autoencoder (VAE) with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, which they terms as Representation Autoencoders (RAE).

October 15, 2025 at 3:49 AM

Yinglun Zhu

@yinglunzhu.bsky.social

Sharing new paper: Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

We extend classical unimodal active learning to the multimodal AL with unaligned data, allowing data-efficient finetuning and pretraining of vision-language models such as CLIP and SigLIP.

1/3

October 10, 2025 at 6:03 PM

AI Adoption Agency

@aiadoptionagency.bsky.social

LucidFlux: Restore any image—no captions, no text.

Powered by Flux.1 diffusion transformer.
Dual-branch conditioning.
Adaptive modulation.
SigLIP semantic alignment.

Read more:

aiadoptionagency.com/lucidflux-ca...

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer - Ai Adoption Agency

LucidFlux is an advanced AI framework designed for universal image restoration that does not rely on captions or text prompts. It uses a large-scale diffusion transformer model called Flux.1 to restor...

https://aiadoptionagency.com/lucidflux-caption-free-universal-image-restoration-via-a-large-scale-diffusion-transformer/"

October 6, 2025 at 8:11 PM

Paper

@paper.bsky.social

[18/30] 182 Likes, 39 Comments, 2 Posts
2509.22414, cs․CV, 26 Sep 2025

🆕LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

Song Fei, Tian Ye, Lujia Wang, Lei Zhu

Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics -- conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift.

We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) without image captions.

LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts.

Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbone's hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture.

After that, to avoid the latency and instability of text prompts or MLLM captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy.

A scalable curation pipeline further filters large-scale data for structure-rich supervision.

Across synthetic and in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component.

LucidFlux shows that, for large DiTs, when, where, and what to condition on -- rather than adding parameters or relying on text prompts -- is the governing lever for robust and caption-free universal image restoration in the wild.

October 5, 2025 at 12:06 AM

Vibrant Rida ✨ Commish Closed

@vibrantrida.bsky.social

pitted google's SigLIP with apple's MobileCLIP and the result are:
- if you prefer searching with danbooru-style tags, go with SigLIP
- if you prefer english sentences go with MobileCLIP

on my ryzen 7 7800x3d cpu, MobileCLIP is faster than SigLIP by a 3-4 seconds

Vibrant Rida ✨ Commish Closed @vibrantrida.bsky.social · Sep 28

needed a better way to traverse my tens of thousands of reference images so testing out SigLIP-based semantic image search

October 2, 2025 at 12:34 AM

Vibrant Rida ✨ Commish Closed

@vibrantrida.bsky.social

needed a better way to traverse my tens of thousands of reference images so testing out SigLIP-based semantic image search

September 28, 2025 at 6:31 AM

arXiv Sound

@arxiv-sound.bsky.social

SupCLAP introduces Support Vector Regularization (SVR) to control perpendicular component in contrastive learning, mitigating trajectory drift with unsupervised radius modeling; outperforms InfoNCE and SigLIP loss.

SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization

Jiehui Luo, Yuguo Yin, Yuxin Xie, Jinghan Ru, Xianwei Zhuang, Minghua He, Aofan Liu, Zihan Xiong, Dongchao Yang

arxiv.org

September 26, 2025 at 10:35 AM

Hubert Baniecki

@hbaniecki.com

We show that explaining vision–language interactions is essential to faithfully interpret models like OpenAI CLIP & Google SigLIP-2. 𝐅𝐈𝐱𝐋𝐈𝐏 is grounded in cooperative game theory, where we analyze its intriguing properties compared to prior art like Shapley values.
👇2/4

September 25, 2025 at 4:43 PM

GetNews.me

@getnews-me.bsky.social

Caption‑trained multimodal models miss details like broccoli’s yellow color. Reconstruction Alignment (RecA) adds CLIP and SigLIP embeddings to generation side, improving perception‑generation alignment. https://getnews.me/unified-multimodal-models-link-visual-understanding-and-generation/ #umm #rec

Unified multimodal models link visual understanding and generation

September 25, 2025 at 3:47 PM

JUKᴡᴏʀᴋs

@juk.works

VLM에서 비전 임베딩과 언어 임베딩 간의 connector를 두고 있다. 비전 모델(CLIP, SigLIP, ...)과 LM이 별도로 동작하기 때문. 커넥터에서 많은 정보 손실이 발생한다는 이야기.

September 23, 2025 at 2:45 AM

Giorgos Kordopatis-Zilos

@gkordo.bsky.social

• SigLIP2 → still the best for text-to-image retrieval.
– Unlike SigLIP models, PE shows a large gap between its image-to-image and text-to-image performance.

September 5, 2025 at 2:35 PM

Giorgos Tolias

@gtolias.bsky.social

SfM/SLAM follow an instance-level class definition. On ILIAS benchmark which evaluates the instance-level recognition ability (it's not about geometry), SigLIP (1&2) are significantly better than DINOv2. Before this result I had a similar intuition as yours, not anymore.
vrg.fel.cvut.cz/ilias/

ILIAS | Instance-level Retrieval at Scale

Instance-level Retrieval at Scale

vrg.fel.cvut.cz

August 15, 2025 at 7:45 AM

Chris Offner

@chrisoffner3d.bsky.social

Yay, DINOv3 is out!

SigLIP (VLMs) and DINO are two competing paradigms for image encoders.

My intuition is that joint vision-language modeling works great for semantic problems but may be too coarse for geometry problems like SfM or SLAM.

Most animals navigate 3D space perfectly without language.

August 14, 2025 at 5:59 PM

David Marx

@digthatdata.bsky.social

This visualization tells you that you have a lot of localized information. This is good for some tasks but not as good for others. There are tasks which SigLIP is good for which those "better" DINOv2 features are ineffective.

August 14, 2025 at 4:39 PM

David Marx

@digthatdata.bsky.social

SigLip features objectively aren't "bad" though. SigLip is tremendously effective. The "noise features" in that image are probably features that simply aren't localized, i.e. global semantics which look like noise because they are distributed over all of your tokens.

August 14, 2025 at 4:35 PM