Nathan Godey
@nthngdy.bsky.social
30 followers 18 following 20 posts
Looking to start a post-doc in early 2025! Working on the representations of LMs and pretraining methods @InriaParis https://nathangodey.github.io
Posts Media Videos Starter Packs
Reposted by Nathan Godey
inriaparisnlp.bsky.social
🏆🤩 We are excited to share the news that @nthngdy.bsky.social, supervised by @bensagot.bsky.social and Éric de la Clergerie, has received the 2025 ATALA Best PhD Dissertation Prize!

You can read his PhD online here: hal.science/tel-04994414/
Nathan Godey receiving the 2025 ATALA best thesis prize at CORIA-TALN 2025.
nthngdy.bsky.social
PS: I am looking for an academic post-doc on related topics (efficiency, sparsity, sequence compression, spectral analysis of LLMs, among others), feel free to reach out if you are interested :)
nthngdy.bsky.social
This work was the final touch to my PhD at @inriaparisnlp.bsky.social and was just accepted to the SLLM workshop at ICLR 2025 (sparsellm.org) 🎉
SLLM@ICLR 2025
Workshop Summary
sparsellm.org
nthngdy.bsky.social
Thanks a lot to all my amazing co-authors @alessiodevoto.bsky.social @sscardapane.bsky.social @yuzhaouoe.bsky.social @neuralnoise.com Eric de la Clergerie @bensagot.bsky.social

And a special thanks to @edoardo-ponti.bsky.social for the academic visit that made this work possible!
nthngdy.bsky.social
Our method is also competitive in the prompt compression setup, especially for some synthetic token retrieval tasks such as needle-in-a-haystack or variable tracking, allowing reasonable error rates with up to x32 compression ratios:
nthngdy.bsky.social
This Q-Filter direction is context-agnostic, which means that it can be pre-computed once and for all for any attention head in a given model.

We release a collection of pre-computed Q-Filters for various models ranging from 1.5B to 405B parameters:
huggingface.co/collections/...
Q-Filters - a nthngdy Collection
Pre-computed Q-Filters for efficient KV cache compression.
huggingface.co
nthngdy.bsky.social
The projection of a Key vector on this direction strongly correlates with the averaged attention weights given to this K along generation, providing a finer KV pair ranking compared to previous work ( arxiv.org/abs/2406.11430 ):
nthngdy.bsky.social
Based on arxiv.org/pdf/2401.12143, we find that they share a single biased direction which encodes a selection mechanism in self-attention: K vectors with a strong component in this direction are ignored by the model.
nthngdy.bsky.social
We vastly improve over similar counterparts in the compress-as-you-generate scenario, where we reach similar generation throughputs while reducing the perplexity gap by up to 65% in the case of Llama-70B!
nthngdy.bsky.social
Q-Filters is very efficient which allows streaming compression at virtually no latency cost, just like Streaming-LLM...

...but it is also much better at retaining relevant KV pairs compared to fast alternatives (and can even beat slower algorithms such as SnapKV)
nthngdy.bsky.social
🚀 New Paper Alert! 🚀

We introduce Q-Filters, a training-free method for efficient KV Cache compression!

It is compatible with FlashAttention and can compress along generation which is particularly useful for reasoning models ⚡

TLDR: we make Streaming-LLM smarter using the geometry of attention
nthngdy.bsky.social
Thanks a lot to all my amazing co-authors
@alessiodevoto.bsky.social @sscardapane.bsky.social @yuzhaouoe.bsky.social @neuralnoise.com Eric de la Clergerie @bensagot.bsky.social

And a special thanks to @edoardo-ponti.bsky.social for the academic visit that made this work possible!
nthngdy.bsky.social
Our method is also competitive in the prompt compression setup, especially for some synthetic token retrieval tasks such as needle-in-a-haystack or variable tracking, allowing reasonable error rates with up to x32 compression ratios:
nthngdy.bsky.social
This Q-Filter direction is context-agnostic, which means that it can be pre-computed once and for all for any attention head in a given model.

We release a collection of pre-computed Q-Filters for various models ranging from 1.5B to 405B parameters:
huggingface.co/collections/...
Q-Filters - a nthngdy Collection
Pre-computed Q-Filters for efficient KV cache compression.
huggingface.co
nthngdy.bsky.social
The projection of a Key vector on this direction strongly correlates with the averaged attention weights given to this K along generation, providing a finer KV pair ranking compared to previous work (arxiv.org/abs/2406.11430):
nthngdy.bsky.social
Based on arxiv.org/pdf/2401.12143, we find that they share a single biased direction which encodes a selection mechanism in self-attention: K vectors with a strong component in this direction are ignored by the model.
nthngdy.bsky.social
We vastly improve over similar counterparts in the compress-as-you-generate scenario, where we reach similar generation throughputs while reducing the perplexity gap by up to 65% in the case of Llama-70B!
nthngdy.bsky.social
Q-Filters is very efficient which allows streaming compression at virtually no latency cost, just like Streaming-LLM...

...but it is also much better at retaining relevant KV pairs compared to fast alternatives (and can even beat slower algorithms such as SnapKV)