#SigLIP
New study shows human‑aligned AI models like AligNet boost robustness on Vision Transformers, SigLIP, DINOv2 across THINGS and Levels datasets. Lukas Muttenthaler’s findings could reshape reliability benchmarks. Dive in! #AligNet #VisionTransformers #THINGSdataset

🔗 aidailypost.com/news/human-a...
November 13, 2025 at 5:09 PM
TTM provides substantial improvements on top of SimpleMatch, without external supervision.

Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.

Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa
October 31, 2025 at 6:03 PM
SimpleMatch reveals substantial hidden capability -- it enables SigLIP-B16 to surpass all prior results and GPT-4.1 to achieve the first result surpassing human performance on Winoground.
October 31, 2025 at 6:03 PM
Super excited to share Test-Time Matching (TTM), an iterative, self-improving algorithm that unlocks substantial compositional reasoning capabilities in multimodal models.

TTM enables SigLIP-B16 (~0.2B params) to outperform GPT-4.1 on MMVP-VLM, establishing a new SOTA.
October 31, 2025 at 6:03 PM
2510.11690
事前に学習されたオートエンコーダがピクセルを拡散プロセスのための潜在空間にマッピングする潜在生成モデリングは、拡散トランスフォーマー(DiT)の標準的な戦略となっているが、オートエンコーダコンポーネントはほとんど進化していない。ほとんどのDiTは、オリジナルのVAEエンコーダーに依存し続けてい...
October 17, 2025 at 12:06 AM
[30/30] 132 Likes, 3 Comments, 1 Posts
2510.11690, cs․CV | cs․LG, 13 Oct 2025

🆕Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie
October 17, 2025 at 12:05 AM
Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models

Mercari fine-tunes SigLIP on product image-title pairs to get 9.1% offline improvement and 50% CTR increase in production for visual similarity-based recommendations.

📝 arxiv.org/abs/2510.13359
Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with ...
arxiv.org
October 16, 2025 at 6:08 AM
Replace Variational Autoencoder (VAE) with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, which they terms as Representation Autoencoders (RAE).
October 15, 2025 at 3:49 AM
Sharing new paper: Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

We extend classical unimodal active learning to the multimodal AL with unaligned data, allowing data-efficient finetuning and pretraining of vision-language models such as CLIP and SigLIP.

1/3
October 10, 2025 at 6:03 PM
LucidFlux: Restore any image—no captions, no text.

Powered by Flux.1 diffusion transformer.
Dual-branch conditioning.
Adaptive modulation.
SigLIP semantic alignment.

Read more:

aiadoptionagency.com/lucidflux-ca...
LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer - Ai Adoption Agency
LucidFlux is an advanced AI framework designed for universal image restoration that does not rely on captions or text prompts. It uses a large-scale diffusion transformer model called Flux.1 to restor...
https://aiadoptionagency.com/lucidflux-caption-free-universal-image-restoration-via-a-large-scale-diffusion-transformer/"
October 6, 2025 at 8:11 PM
[18/30] 182 Likes, 39 Comments, 2 Posts
2509.22414, cs․CV, 26 Sep 2025

🆕LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

Song Fei, Tian Ye, Lujia Wang, Lei Zhu
October 5, 2025 at 12:06 AM
pitted google's SigLIP with apple's MobileCLIP and the result are:
- if you prefer searching with danbooru-style tags, go with SigLIP
- if you prefer english sentences go with MobileCLIP

on my ryzen 7 7800x3d cpu, MobileCLIP is faster than SigLIP by a 3-4 seconds
needed a better way to traverse my tens of thousands of reference images so testing out SigLIP-based semantic image search
October 2, 2025 at 12:34 AM
needed a better way to traverse my tens of thousands of reference images so testing out SigLIP-based semantic image search
September 28, 2025 at 6:31 AM
SupCLAP introduces Support Vector Regularization (SVR) to control perpendicular component in contrastive learning, mitigating trajectory drift with unsupervised radius modeling; outperforms InfoNCE and SigLIP loss.
SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization
Jiehui Luo, Yuguo Yin, Yuxin Xie, Jinghan Ru, Xianwei Zhuang, Minghua He, Aofan Liu, Zihan Xiong, Dongchao Yang
arxiv.org
September 26, 2025 at 10:35 AM
We show that explaining vision–language interactions is essential to faithfully interpret models like OpenAI CLIP & Google SigLIP-2. 𝐅𝐈𝐱𝐋𝐈𝐏 is grounded in cooperative game theory, where we analyze its intriguing properties compared to prior art like Shapley values.
👇2/4
September 25, 2025 at 4:43 PM
Caption‑trained multimodal models miss details like broccoli’s yellow color. Reconstruction Alignment (RecA) adds CLIP and SigLIP embeddings to generation side, improving perception‑generation alignment. https://getnews.me/unified-multimodal-models-link-visual-understanding-and-generation/ #umm #rec
September 25, 2025 at 3:47 PM
VLM에서 비전 임베딩과 언어 임베딩 간의 connector를 두고 있다. 비전 모델(CLIP, SigLIP, ...)과 LM이 별도로 동작하기 때문. 커넥터에서 많은 정보 손실이 발생한다는 이야기.
September 23, 2025 at 2:45 AM
• SigLIP2 → still the best for text-to-image retrieval.
– Unlike SigLIP models, PE shows a large gap between its image-to-image and text-to-image performance.
September 5, 2025 at 2:35 PM
SfM/SLAM follow an instance-level class definition. On ILIAS benchmark which evaluates the instance-level recognition ability (it's not about geometry), SigLIP (1&2) are significantly better than DINOv2. Before this result I had a similar intuition as yours, not anymore.
vrg.fel.cvut.cz/ilias/
ILIAS | Instance-level Retrieval at Scale
Instance-level Retrieval at Scale
vrg.fel.cvut.cz
August 15, 2025 at 7:45 AM
Yay, DINOv3 is out!

SigLIP (VLMs) and DINO are two competing paradigms for image encoders.

My intuition is that joint vision-language modeling works great for semantic problems but may be too coarse for geometry problems like SfM or SLAM.

Most animals navigate 3D space perfectly without language.
August 14, 2025 at 5:59 PM
This visualization tells you that you have a lot of localized information. This is good for some tasks but not as good for others. There are tasks which SigLIP is good for which those "better" DINOv2 features are ineffective.
August 14, 2025 at 4:39 PM
SigLip features objectively aren't "bad" though. SigLip is tremendously effective. The "noise features" in that image are probably features that simply aren't localized, i.e. global semantics which look like noise because they are distributed over all of your tokens.
August 14, 2025 at 4:35 PM
UNITE rileva deepfake anche senza volti umani, potenziando la difesa da video AI in elezioni e media.

#Deepfake #SigLIP #Transformer #UNITE
www.matricedigitale.it/2025/07/27/u...
July 27, 2025 at 5:35 PM
Beats or is competitive to SigLIP/2, DinoV2 on linear eval, OOD detection, linear segmentation.
July 23, 2025 at 12:17 PM
Enjoyed discussing how we used GPT and SigLIP to design search and access on digitaldocumerica.org @adho-org.bsky.social #DH2025
July 16, 2025 at 1:45 PM