Emily Xiao
emilyxiao.bsky.social
Emily Xiao
@emilyxiao.bsky.social
Student @ CMU
Some insights we found:
- preceding context + attention sink are both critical for making block-sparse attention work without additional training.
- grouping examples for encoding & retrieval also boosts performance vs. purely individual retrieval.

[5/n]
March 18, 2025 at 3:48 PM
Results:
We evaluate DBSA with Llama models, and up to 90k context length. DBSA achieves comparable per request latency to fine-tuning while maintaining on average >95% of the best accuracy.

[3/n]
March 18, 2025 at 3:45 PM
Method:
- DBSA pre-encodes the many-shot examples with streaming block-sparse attention, allowing constant encoding time for new demos.
- During inference, it dynamically selects relevant KV chunks for each test query, given any retrieval method.

[2/n]
March 18, 2025 at 3:44 PM
Many-shot ICL (thousands of examples+) can match fine-tuning on many tasks, but its high inference cost makes deployment impractical.

We introduce DBSA, a training-free framework that achieves the best efficiency even under high request volumes, while maintaining strong accuracy 🧵
March 18, 2025 at 3:43 PM