yinglunz.com
The global TTM variant achieves up to 33.3% relative error reduction.
The global TTM variant achieves up to 33.3% relative error reduction.
For 1-by-k groups, GroupMatch = GroupScore, so metric change brings no benefit. Yet, TTM still delivers substantial improvements -- up to 85.7% -- on datasets such as SugarCrepe and WhatsUp.
For 1-by-k groups, GroupMatch = GroupScore, so metric change brings no benefit. Yet, TTM still delivers substantial improvements -- up to 85.7% -- on datasets such as SugarCrepe and WhatsUp.
Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.
Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa
Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.
Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa
(i) GroupMatch-based pseudo-labels for stronger supervision.
(ii) A progressively decaying selection threshold schedule to gradually expand coverage across the test set.
(i) GroupMatch-based pseudo-labels for stronger supervision.
(ii) A progressively decaying selection threshold schedule to gradually expand coverage across the test set.
TTM enables SigLIP-B16 (~0.2B params) to outperform GPT-4.1 on MMVP-VLM, establishing a new SOTA.
TTM enables SigLIP-B16 (~0.2B params) to outperform GPT-4.1 on MMVP-VLM, establishing a new SOTA.
Online Finetuning Decision Transformers with Pure RL Gradients
RL drives reasoning in LLMs—but remains underexplored for online finetuning of Decision Transformers (DTs), where most methods still rely mainly on supervised objectives.
Why?
Online Finetuning Decision Transformers with Pure RL Gradients
RL drives reasoning in LLMs—but remains underexplored for online finetuning of Decision Transformers (DTs), where most methods still rely mainly on supervised objectives.
Why?
We extend classical unimodal active learning to the multimodal AL with unaligned data, allowing data-efficient finetuning and pretraining of vision-language models such as CLIP and SigLIP.
1/3
We extend classical unimodal active learning to the multimodal AL with unaligned data, allowing data-efficient finetuning and pretraining of vision-language models such as CLIP and SigLIP.
1/3
We propose adaptive algorithms that estimate query difficulty on the fly and allocate compute strategically—just enough for easy queries and more for hard ones.
📊 Example (avg. budget = 32):
(2/3)
We propose adaptive algorithms that estimate query difficulty on the fly and allocate compute strategically—just enough for easy queries and more for hard ones.
📊 Example (avg. budget = 32):
(2/3)
We turn test-time compute allocation into a bandit learning problem, achieving:
✅ +11.10% on MATH-500
✅ +7.41% on LiveCodeBench
Paper: arxiv.org/pdf/2506.12721
(1/3)
We turn test-time compute allocation into a bandit learning problem, achieving:
✅ +11.10% on MATH-500
✅ +7.41% on LiveCodeBench
Paper: arxiv.org/pdf/2506.12721
(1/3)
Please visit yinglunz.com for detailed information on research directions and contact instructions.
Please visit yinglunz.com for detailed information on research directions and contact instructions.