Emily Xiao
emilyxiao.bsky.social
Emily Xiao
@emilyxiao.bsky.social
8 followers 3 following 6 posts
Student @ CMU
Posts Media Videos Starter Packs
Some insights we found:
- preceding context + attention sink are both critical for making block-sparse attention work without additional training.
- grouping examples for encoding & retrieval also boosts performance vs. purely individual retrieval.

[5/n]
Storage Cost?
Yes, caching thousands of examples can be large. However, it’s also easy to re-compute if needed—unlike fine-tuned parameters, which also requires substantial storage space for a large number of tasks and are often stored indefinitely.

[4/n]
Results:
We evaluate DBSA with Llama models, and up to 90k context length. DBSA achieves comparable per request latency to fine-tuning while maintaining on average >95% of the best accuracy.

[3/n]
Method:
- DBSA pre-encodes the many-shot examples with streaming block-sparse attention, allowing constant encoding time for new demos.
- During inference, it dynamically selects relevant KV chunks for each test query, given any retrieval method.

[2/n]
Many-shot ICL (thousands of examples+) can match fine-tuning on many tasks, but its high inference cost makes deployment impractical.

We introduce DBSA, a training-free framework that achieves the best efficiency even under high request volumes, while maintaining strong accuracy 🧵