Lightnews — Scholar-powered news

Reposted by Nathan Godey

Inria Paris NLP (ALMAnaCH team) @inriaparisnlp.bsky.social · Jul 17

🏆🤩 We are excited to share the news that @nthngdy.bsky.social, supervised by @bensagot.bsky.social and Éric de la Clergerie, has received the 2025 ATALA Best PhD Dissertation Prize!

You can read his PhD online here: hal.science/tel-04994414/

Nathan Godey receiving the 2025 ATALA best thesis prize at CORIA-TALN 2025.

1 1 9

Nathan Godey @nthngdy.bsky.social · Mar 6

PS: I am looking for an academic post-doc on related topics (efficiency, sparsity, sequence compression, spectral analysis of LLMs, among others), feel free to reach out if you are interested :)

4 3

Nathan Godey @nthngdy.bsky.social · Mar 6

This work was the final touch to my PhD at @inriaparisnlp.bsky.social and was just accepted to the SLLM workshop at ICLR 2025 (sparsellm.org) 🎉

SLLM@ICLR 2025

Workshop Summary

sparsellm.org

1 1

Nathan Godey @nthngdy.bsky.social · Mar 6

Thanks a lot to all my amazing co-authors @alessiodevoto.bsky.social @sscardapane.bsky.social @yuzhaouoe.bsky.social @neuralnoise.com Eric de la Clergerie @bensagot.bsky.social

And a special thanks to @edoardo-ponti.bsky.social for the academic visit that made this work possible!

1 1 2

Nathan Godey @nthngdy.bsky.social · Mar 6

We also release a HuggingFace transformers-compatible implementation here:
github.com/NathanGodey/...

GitHub - NathanGodey/qfilters: Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)

Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812) - NathanGodey/qfilters

github.com

1 1

Nathan Godey @nthngdy.bsky.social · Mar 6

Our method is also competitive in the prompt compression setup, especially for some synthetic token retrieval tasks such as needle-in-a-haystack or variable tracking, allowing reasonable error rates with up to x32 compression ratios:

1 2

Nathan Godey @nthngdy.bsky.social · Mar 6

This Q-Filter direction is context-agnostic, which means that it can be pre-computed once and for all for any attention head in a given model.

We release a collection of pre-computed Q-Filters for various models ranging from 1.5B to 405B parameters:
huggingface.co/collections/...

Q-Filters - a nthngdy Collection

Pre-computed Q-Filters for efficient KV cache compression.

huggingface.co

1 1

Nathan Godey @nthngdy.bsky.social · Mar 6

The projection of a Key vector on this direction strongly correlates with the averaged attention weights given to this K along generation, providing a finer KV pair ranking compared to previous work ( arxiv.org/abs/2406.11430 ):

1 1

Nathan Godey @nthngdy.bsky.social · Mar 6

Based on arxiv.org/pdf/2401.12143, we find that they share a single biased direction which encodes a selection mechanism in self-attention: K vectors with a strong component in this direction are ignored by the model.

1 1

Nathan Godey @nthngdy.bsky.social · Mar 6

We vastly improve over similar counterparts in the compress-as-you-generate scenario, where we reach similar generation throughputs while reducing the perplexity gap by up to 65% in the case of Llama-70B!

1 1

Nathan Godey @nthngdy.bsky.social · Mar 6

Q-Filters is very efficient which allows streaming compression at virtually no latency cost, just like Streaming-LLM...

...but it is also much better at retaining relevant KV pairs compared to fast alternatives (and can even beat slower algorithms such as SnapKV)

1 2

Nathan Godey @nthngdy.bsky.social · Mar 6

🚀 New Paper Alert! 🚀

We introduce Q-Filters, a training-free method for efficient KV Cache compression!

It is compatible with FlashAttention and can compress along generation which is particularly useful for reasoning models ⚡

TLDR: we make Streaming-LLM smarter using the geometry of attention

1 7 21

Nathan Godey @nthngdy.bsky.social · Mar 6

Thanks a lot to all my amazing co-authors
@alessiodevoto.bsky.social @sscardapane.bsky.social @yuzhaouoe.bsky.social @neuralnoise.com Eric de la Clergerie @bensagot.bsky.social

And a special thanks to @edoardo-ponti.bsky.social for the academic visit that made this work possible!

Nathan Godey @nthngdy.bsky.social · Mar 6

We also release a HuggingFace transformers-compatible implementation here:
github.com/NathanGodey/...

GitHub - NathanGodey/qfilters: Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)

Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812) - NathanGodey/qfilters

github.com

1

Nathan Godey @nthngdy.bsky.social · Mar 6

If you want to read more about the theoretical background and limitations of the method, feel free to check out our pre-print:
arxiv.org/abs/2503.02812

Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache b...

arxiv.org

1 1 1

Nathan Godey @nthngdy.bsky.social · Mar 6

Our method is also competitive in the prompt compression setup, especially for some synthetic token retrieval tasks such as needle-in-a-haystack or variable tracking, allowing reasonable error rates with up to x32 compression ratios:

1

Nathan Godey @nthngdy.bsky.social · Mar 6

This Q-Filter direction is context-agnostic, which means that it can be pre-computed once and for all for any attention head in a given model.

We release a collection of pre-computed Q-Filters for various models ranging from 1.5B to 405B parameters:
huggingface.co/collections/...

Q-Filters - a nthngdy Collection

Pre-computed Q-Filters for efficient KV cache compression.

huggingface.co

1

Nathan Godey @nthngdy.bsky.social · Mar 6

The projection of a Key vector on this direction strongly correlates with the averaged attention weights given to this K along generation, providing a finer KV pair ranking compared to previous work (arxiv.org/abs/2406.11430):

1

Nathan Godey @nthngdy.bsky.social · Mar 6

Based on arxiv.org/pdf/2401.12143, we find that they share a single biased direction which encodes a selection mechanism in self-attention: K vectors with a strong component in this direction are ignored by the model.

1

Nathan Godey @nthngdy.bsky.social · Mar 6

We vastly improve over similar counterparts in the compress-as-you-generate scenario, where we reach similar generation throughputs while reducing the perplexity gap by up to 65% in the case of Llama-70B!

1

Nathan Godey @nthngdy.bsky.social · Mar 6

Q-Filters is very efficient which allows streaming compression at virtually no latency cost, just like Streaming-LLM...

...but it is also much better at retaining relevant KV pairs compared to fast alternatives (and can even beat slower algorithms such as SnapKV)

1 1 1