Can a simple inference-time approach unlock better Vision-Language Compositionality?🤯
Our latest paper shows how adding structure at inference significantly boosts performance in popular dual-encoder VLMs on different datasets.
Read more: arxiv.org/abs/2506.09691
Can a simple inference-time approach unlock better Vision-Language Compositionality?🤯
Our latest paper shows how adding structure at inference significantly boosts performance in popular dual-encoder VLMs on different datasets.
Read more: arxiv.org/abs/2506.09691