Martina Vilas
@martinagvilas.bsky.social
Computer Science PhD student | AI interpretability | Vision + Language | Cogntive Science. Prev. intern @MicrosoftResearch.
https://martinagvilas.github.io/
https://martinagvilas.github.io/
Working on this project was a great experience during my internship at @msftresearch.bsky.social 💙
Learned so much from this amazing team! Huge thanks to my coauthors: @vidhishab.bsky.social, Safoora Yousefi, @besmiranushi.bsky.social, @erichorvitz.bsky.social
Learned so much from this amazing team! Huge thanks to my coauthors: @vidhishab.bsky.social, Safoora Yousefi, @besmiranushi.bsky.social, @erichorvitz.bsky.social
October 22, 2025 at 3:38 PM
Working on this project was a great experience during my internship at @msftresearch.bsky.social 💙
Learned so much from this amazing team! Huge thanks to my coauthors: @vidhishab.bsky.social, Safoora Yousefi, @besmiranushi.bsky.social, @erichorvitz.bsky.social
Learned so much from this amazing team! Huge thanks to my coauthors: @vidhishab.bsky.social, Safoora Yousefi, @besmiranushi.bsky.social, @erichorvitz.bsky.social
We also found that these signals emerge EARLY in reasoning! At just 4k tokens, we can predict solution quality with ROC-AUC > 0.6.
This enables early path selection during parallel generation and ~60% token savings with +2.1% accuracy gains 🚀
This enables early path selection during parallel generation and ~60% token savings with +2.1% accuracy gains 🚀
October 22, 2025 at 3:38 PM
We also found that these signals emerge EARLY in reasoning! At just 4k tokens, we can predict solution quality with ROC-AUC > 0.6.
This enables early path selection during parallel generation and ~60% token savings with +2.1% accuracy gains 🚀
This enables early path selection during parallel generation and ~60% token savings with +2.1% accuracy gains 🚀
Using LT signals for answer selection in multi-sample inference leads to:
⚡ 48% average token reduction (up to 70%!)
📈 +2.6% accuracy improvement over majority voting
🎯 Works by identifying correct paths even when the majority is wrong
⚡ 48% average token reduction (up to 70%!)
📈 +2.6% accuracy improvement over majority voting
🎯 Works by identifying correct paths even when the majority is wrong
October 22, 2025 at 3:38 PM
Using LT signals for answer selection in multi-sample inference leads to:
⚡ 48% average token reduction (up to 70%!)
📈 +2.6% accuracy improvement over majority voting
🎯 Works by identifying correct paths even when the majority is wrong
⚡ 48% average token reduction (up to 70%!)
📈 +2.6% accuracy improvement over majority voting
🎯 Works by identifying correct paths even when the majority is wrong
Hidden states have distinctive temporal patterns for correct paths. They show:
✴️ Larger overall representational change (Net ↑)
✴️ Less wandering in latent space (Cumulative ↓)
✴️ More direct progress toward final state (Aligned ↑)
✴️ Larger overall representational change (Net ↑)
✴️ Less wandering in latent space (Cumulative ↓)
✴️ More direct progress toward final state (Aligned ↑)
October 22, 2025 at 3:38 PM
Hidden states have distinctive temporal patterns for correct paths. They show:
✴️ Larger overall representational change (Net ↑)
✴️ Less wandering in latent space (Cumulative ↓)
✴️ More direct progress toward final state (Aligned ↑)
✴️ Larger overall representational change (Net ↑)
✴️ Less wandering in latent space (Cumulative ↓)
✴️ More direct progress toward final state (Aligned ↑)
Across 3 reasoning models (DeepSeek-R1, Phi-4-Reasoning-Plus, Qwen3) and diverse domains (GPQA, AIME, TSP), LT signals:
✅ Significantly predict correctness
✅ Outperform output-based confidence measures and cross-layer signals
✅ Significantly predict correctness
✅ Outperform output-based confidence measures and cross-layer signals
October 22, 2025 at 3:38 PM
Across 3 reasoning models (DeepSeek-R1, Phi-4-Reasoning-Plus, Qwen3) and diverse domains (GPQA, AIME, TSP), LT signals:
✅ Significantly predict correctness
✅ Outperform output-based confidence measures and cross-layer signals
✅ Significantly predict correctness
✅ Outperform output-based confidence measures and cross-layer signals
We track how representations evolve through the trace and extract 3 complementary signals:
📊 Net Change: Overall shift (start → end)
🔄 Cumulative Change: Total movement
🎯 Aligned Change: Progress toward final state
📊 Net Change: Overall shift (start → end)
🔄 Cumulative Change: Total movement
🎯 Aligned Change: Progress toward final state
October 22, 2025 at 3:38 PM
We track how representations evolve through the trace and extract 3 complementary signals:
📊 Net Change: Overall shift (start → end)
🔄 Cumulative Change: Total movement
🎯 Aligned Change: Progress toward final state
📊 Net Change: Overall shift (start → end)
🔄 Cumulative Change: Total movement
🎯 Aligned Change: Progress toward final state
Identifying trace quality is critical: it enables more reliable predictions, improves efficiency by avoiding wasted compute, and can be used to guide models toward productive reasoning strategies.
Our solution: Look inside the temporal evolution of the model's latent space! 🔍
Our solution: Look inside the temporal evolution of the model's latent space! 🔍
October 22, 2025 at 3:38 PM
Identifying trace quality is critical: it enables more reliable predictions, improves efficiency by avoiding wasted compute, and can be used to guide models toward productive reasoning strategies.
Our solution: Look inside the temporal evolution of the model's latent space! 🔍
Our solution: Look inside the temporal evolution of the model's latent space! 🔍
But not all reasoning traces are equal ⚖️ → some contain productive steps that lead to correct solutions ✅, while others deviate into overthinking, fail to converge, or exhibit inconsistent reasoning patterns ❌
October 22, 2025 at 3:38 PM
But not all reasoning traces are equal ⚖️ → some contain productive steps that lead to correct solutions ✅, while others deviate into overthinking, fail to converge, or exhibit inconsistent reasoning patterns ❌
Modern LLMs use chain-of-thought reasoning to solve complex problems, generating step-by-step solutions that can span thousands of tokens.
📈Scaling this inference-time compute (longer traces, multiple samples) significantly improves performance across reasoning tasks.
📈Scaling this inference-time compute (longer traces, multiple samples) significantly improves performance across reasoning tasks.
October 22, 2025 at 3:38 PM
Modern LLMs use chain-of-thought reasoning to solve complex problems, generating step-by-step solutions that can span thousands of tokens.
📈Scaling this inference-time compute (longer traces, multiple samples) significantly improves performance across reasoning tasks.
📈Scaling this inference-time compute (longer traces, multiple samples) significantly improves performance across reasoning tasks.
👋 I also work on the field (examples on my profile). Would love to be added!
November 19, 2024 at 9:42 AM
👋 I also work on the field (examples on my profile). Would love to be added!