Lightnews — Scholar-powered news

François Fleuret

@francois.fleuret.org

5.5K followers 230 following 410 posts

Research Scientist Meta/FAIR, Prof. University of Geneva, co-founder Neural Concept SA. I like reality. https://fleuret.org

fleuret.org

Posts Media Videos Starter Packs

Pinned

François Fleuret @francois.fleuret.org · Nov 26

My deep learning course at the University of Geneva is available on-line. 1000+ slides, ~20h of screen-casts. Full of examples in PyTorch.

fleuret.org/dlc/

And my "Little Book of Deep Learning" is available as a phone-formatted pdf (nearing 700k downloads!)

fleuret.org/lbdl/

47 250 1.3K

François Fleuret @francois.fleuret.org · 28d

The voc corresponding to the logits

François Fleuret @francois.fleuret.org · 28d

3 3 26

François Fleuret @francois.fleuret.org · Apr 28

- Ring Attention: takes advantage of multi-node hardware to scale the computation according to the sequence length

- Speculative decoding: a cheaper model generates tokens, and a rejection process corrects this generation to march the full-model distribution.

1 16

François Fleuret @francois.fleuret.org · Apr 28

- Multi-token prediction: sums the training over multiple future tokens, possibly with additional readout heads.

- FlashAttention: computes the attention on the fly, avoiding a memory footprint O(T^2) (+ optimizes very carefully for the GPU!)

1 12

François Fleuret @francois.fleuret.org · Apr 28

- Warmup: very short ramping-up of the learning rate, starting from 0

- Cosine schedule: the learning rate varies less at the beginning and end of the schedule

- AdamW: decouples weight includes decay from Adam

1 13

François Fleuret @francois.fleuret.org · Apr 28

- RoPE (Rotary Positional Embedding): makes the attention depend only on the relative Q/K positions

- MoE (Mixture of Experts): The FFN block is implemented with multiple MLPs and a gating mechanism selects which ones process each token.

1 12

François Fleuret @francois.fleuret.org · Apr 28

- RMSNorm instead of Layernorm: normalize only the scaling

- MLA (Multi-head Latent Attention): stores a low-rank projection of the attention block input and compute the KV from it

- SwiGLU: non-linearity for the FFN block with per-component gating

1 12

François Fleuret @francois.fleuret.org · Apr 28

- Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively

- GQA (Group Query Attention): more Q than (K, V)

1 15

François Fleuret @francois.fleuret.org · Apr 28

I asked "on the other platform" what were the most important improvements to the original 2017 transformer.

That was quite popular and here is a synthesis of the responses:

4 43 210

François Fleuret @francois.fleuret.org · Apr 1

"You are in Paris, enjoy the city, stop obsessing with AI"

Paris:

2 1 35

François Fleuret @francois.fleuret.org · Mar 12

Yes, it's awesome. The kind of work that opens up a whole new and important field.

François Fleuret @francois.fleuret.org · Feb 28

If your task is not resolution-agnostic, do not use normalized p-e.

All this being said, putting both normalized and non-normalized cannot hurt methinks.

François Fleuret @francois.fleuret.org · Feb 28

You cannot be better off without p-e.

François Fleuret @francois.fleuret.org · Feb 28

Why not a normalized positional encoding?

François Fleuret @francois.fleuret.org · Feb 28

After a long lecture, I recommend a coffee, a pain au chocolat, and leave-me-the-fuck-alone time.

François Fleuret @francois.fleuret.org · Feb 28

Maybe the wall was the friends we made during that journey Ted.

Ted Underwood @tedunderwood.com · Feb 28

The year is 2435. Human beings — now sentient spheres of glowing gas — finally understand why matter exists. Our knowledge of the world is complete, and can go no farther!

One the telepresence screen, a simulation of a 21c internet pundit pops up: "Told you deep learning was hitting a wall!"

1 7

François Fleuret @francois.fleuret.org · Feb 27

Why is it spooky?

2 1

François Fleuret @francois.fleuret.org · Feb 21

I asked this because even though I am interested in the topic, I have not met so far "foundational" theory regarding the future of society with AI.

Someone linked this paper which is exactly the sort of thing I was looking for:

arxiv.org/abs/2502.12102

Relational Norms for Human-AI Cooperation

How we should design and interact with social artificial intelligence depends on the socio-relational role the AI is meant to emulate or occupy. In human society, relationships such as teacher-student...

arxiv.org

2 6

Reposted by François Fleuret

Ramon @noctrog.bsky.social · Feb 14

What is the true depth of an LLM?

Together with @danielepal.bsky.social , @matpagliardini.bsky.social, M. Jaggi and @francois.fleuret.org we show that LLMs have a smaller effective depth that can be exploited to increase inference speeds on multi-GPU settings!

arxiv.org/abs/2502.02790
(1/N)

1 3 12

François Fleuret @francois.fleuret.org · Feb 11

We can't complain, can we?

1 1

François Fleuret @francois.fleuret.org · Feb 11

J'étais l'invité du journal de 19h30 sur la @radiotelesuisse.bsky.social ce soir pour parler d'Intelligence Artificielle.

www.rts.ch/play/tv/19h3...

19h30 - Play RTS

Play RTS vous permet de visionner ou d'écouter de nombreuses émissions tv ou radio, quand et aussi souvent que vous le souhaitez.

www.rts.ch

1 3 11

François Fleuret @francois.fleuret.org · Feb 6

To do so, you concatenate all the sequences to make a batch of a single sequence, and carve the attention matrix into a block-diagonal one (possibly with causal structure in each block) so that sequences cannot look at each other.

Magic!

3/3

1 8

François Fleuret @francois.fleuret.org · Feb 6

It does this by generating an optimized cuda kernel on the fly.

So it's cool for causal masks, but it also allows an amazing trick to deal with batches of sequences of various lengths *without padding*!

2/3

1 4

François Fleuret @francois.fleuret.org · Feb 6

It is hard to overstate how cool and powerful is flex attention. @chhillee.bsky.social

pytorch.org/blog/flexatten…

TL;DR: it is an implementation of the attention operator in pytorch that allows in particular to efficiently "carve" the attention matrix.

1/3

https://pytorch.org/blog/flexatten…

2 5 57

François Fleuret @francois.fleuret.org · Feb 5

I have to admit I am more on the other platform.