Lightnews — Scholar-powered news

Carl Allen

@carl-allen.bsky.social

2.2K followers 440 following 44 posts

Laplace Junior Chair, Machine Learning ENS Paris. (prev ETH Zurich, Edinburgh, Oxford..) Working on mathematical foundations/probabilistic interpretability of ML (what NNs learn🤷‍♂️, disentanglement🤔, king-man+woman=queen?👌…)

Posts Media Videos Starter Packs

Pinned

Carl Allen @carl-allen.bsky.social · Dec 18

Machine learning has made incredible breakthroughs, but our theoretical understanding lags behind.

We take a step towards unravelling its mystery by explaining why the phenomenon of disentanglement arises in generative latent variable models.

Blog post: carl-allen.github.io/theory/2024/...

1 4 18

Reposted by Carl Allen

Valérie Castin @vcastin.bsky.social · Jan 31

How do tokens evolve as they are processed by a deep Transformer?

With José A. Carrillo, @gabrielpeyre.bsky.social and @pierreablin.bsky.social, we tackle this in our new preprint: A Unified Perspective on the Dynamics of Deep Transformers arxiv.org/abs/2501.18322

ML and PDE lovers, check it out!

2 16 95

Carl Allen @carl-allen.bsky.social · Jan 17

Softmax is also the exact formula for a label distribution p(y|x) under Bayes rule if class distributions p(x|y) have exponential family form (equivariant if Gaussian), so it can have a deeper rationale in a probabilistic model of the data (than a one-hot relaxation).

Carl Allen @carl-allen.bsky.social · Dec 29

Sorry, more a question re the OP. Just looking to understand the context.

Carl Allen @carl-allen.bsky.social · Dec 29

Can you give some examples of the kind of papers you’re referring to?

1 1

Carl Allen @carl-allen.bsky.social · Dec 19

And of course this all builds on the seminal work of @wellingmax.bsky.social, @dpkingma.bsky.social, Irina Higgins, Chris Burgess et al.

Carl Allen @carl-allen.bsky.social · Dec 18

sorry, @benmpoole.bsky.social (fat fingers..)

Carl Allen @carl-allen.bsky.social · Dec 18

Any constructive feedback, discussion or future collaboration more than welcome!

Full paper: arxiv.org/pdf/2410.22559

arxiv.org

1 2

Carl Allen @carl-allen.bsky.social · Dec 18

Building on this, we clarify the connection between diagonal covariance and Jacobian orthogonality and explain how disentanglement follows, ultimately defining disentanglement as factorising the data distribution into statistically independent components

Carl Allen @carl-allen.bsky.social · Dec 18

We focus on VAEs, used as building blocks of SOTA diffusion models. Recent works by Rolinek et al. and Kumar & @benmpoole.bsy.social suggest that disentanglement arises because diagonal posterior covariance matrices promote column-orthogonality in the decoder’s Jacobian matrix.

Carl Allen @carl-allen.bsky.social · Dec 18

While disentanglement is often linked to different models whose popularity may ebb & flow, we show that the phenomenon itself relates to the data’s latent structure and is more fundamental than any model that may expose it.

Carl Allen @carl-allen.bsky.social · Dec 18

1 4 18

Carl Allen @carl-allen.bsky.social · Dec 18

Maybe give it time. Rome, a day, etc..

Carl Allen @carl-allen.bsky.social · Dec 15

Yup sure, the curve has to kick in at some point. I guess “law” sounds cooler than linear-ish graph. Maybe it started out as an acronym “Linear for A While”.. 🤷‍♂️

1 5

Carl Allen @carl-allen.bsky.social · Dec 15

I guess as complexity increases math->phys->chem->bio->… It’s inevitable that “theory-driven” tends to “theory-inspired”. ML seems a bit tangential tho since experimenting is relatively consequence free and you don’t need to deeply theorise, more iterate. So theory is deprioritised and lags for now

1 3

Carl Allen @carl-allen.bsky.social · Dec 14

But doesn’t theory follow empirics in all of science.. until it doesn’t? Except that in most sciences you can’t endlessly experiment for cost/risk/melting your face off reasons. But ML keeps going, making it a tricky moving/expanding target to try to explain/get ahead of.. I think it’ll happen tho.

2 1

Carl Allen @carl-allen.bsky.social · Dec 5

The last KL is nice as it’s clear that the objective is optimised when the model and posteriors match as well as possible. The earlier KL is nice as it contains the data distribution and all explicitly modelled distributions, so maximising ELBO can be seen intuitively as bringing them all “in line”.

Carl Allen @carl-allen.bsky.social · Dec 5

I think an intuitive view is that:
- max likelihood minimises
KL[p(x)||p’(x)] (p’(x)=model)

- max ELBO minimises
KL[p(x)q(z|x) || p’(x|z)p’(z)]
So brings together 2 models of the joint. (where p’(x)=\int p’(x|z)p’(z))

Can rearrange in diff ways, eg as
KL[p(x)q(z|x) || p’(x)p’(z|x)]
(or as in VAE)

1 1

Carl Allen @carl-allen.bsky.social · Dec 3

Ha me too, exactly that..

Carl Allen @carl-allen.bsky.social · Dec 2

(and here it comes.. ;) ). The latter view of classification is the motivation behind this work: scholar.google.co.uk/citations?vi...

‪Variational classification: A probabilistic generalization of the softmax classifier‬

‪SZ Dhuliawala, M Sachan, C Allen‬, ‪Transactions on Machine Learning Research, 2024‬ - ‪Cited by 10‬

scholar.google.co.uk

Carl Allen @carl-allen.bsky.social · Dec 2

In the binary case, both look the same: sigmoid might be a good model of how y becomes more likely (in future) as x increases. But sigmoid is also 2-case softmax so models Bayes rule for 2 classes of (exp-fam) x|y. The causality between x and y are very different, which "p(y|x)" doesn't capture.

1 1

Carl Allen @carl-allen.bsky.social · Dec 2

I think this comes down to the model behind p(x,y). If features of x cause y, e.g. aspects of a website (x) -> clicks (y); age/health -> disease, then p(y|x) is a (regression) fn of x. But if x|y is a distrib'n of different y's (e.g. cats) then p(y|x) is given by Bayes rule (squint at softmax).

1 1 7

Carl Allen @carl-allen.bsky.social · Nov 29

Pls add me thanks!

Carl Allen @carl-allen.bsky.social · Nov 28

If few-shot transfer is ur thing!

Ivana Balazevic @ibalazevic.bsky.social · Nov 28

We maintain strong zero-shot transfer of CLIP / SigLIP across model size and data scale, while achieving up to 4x few-shot sample efficiency and up to +16% performance gains!

Fun project with @confusezius.bsky.social, @zeynepakata.bsky.social, @dimadamen.bsky.social and
@olivierhenaff.bsky.social.

Karsten Roth @confusezius.bsky.social · Nov 28

🤔 Can you turn your vision-language model from a great zero-shot model into a great-at-any-shot generalist?

Turns out you can, and here is how: arxiv.org/abs/2411.15099

Really excited to this work on multimodal pretraining for my first bluesky entry!

🧵 A short and hopefully informative thread:

Carl Allen @carl-allen.bsky.social · Nov 26

Could you pls add me? Thanks!

1 1

Carl Allen @carl-allen.bsky.social · Nov 24

Yep, could maybe work. The accepted-to-RR bar would need to be high to maintain value, but “shininess” test cld be deferred. Think there’s still a separate issue of “highly irresponsible” reviews that needs addressing either way (as at #CVPR2025). We can’t just whinge & doing absolutely nothing!