Lightnews — Scholar-powered news

Reposted by Alexander Kolesnikov

Andreas Steiner @andreaspsteiner.bsky.social · Feb 19

Looking for a small or medium sized VLM? PaliGemma 2 spans more than 150x of compute!

Not sure yet if you want to invest the time 🪄finetuning🪄 on your data? Give it a try with our ready-to-use "mix" checkpoints:

🤗 huggingface.co/blog/paligem...
🎤 developers.googleblog.com/en/introduci...

7 19

Alexander Kolesnikov @kolesnikov.ch · Dec 21

The full answer is probably very complex.

I really like the "function matching" angle we discovered (or rediscovered) in one of our papers that partially demystifies distillation for me: arxiv.org/abs/2106.05237

Knowledge distillation: A good teacher is patient and consistent

There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications. In this paper we addres...

arxiv.org

15

Alexander Kolesnikov @kolesnikov.ch · Dec 21

Thank you!

5

Alexander Kolesnikov @kolesnikov.ch · Dec 20

Also check out this concurrent work that is very similar in spirit to Jet and JetFormer, which proposes autoregressive ViT-powered normalizing flows (NFs): x.com/zhaisf/statu...

x.com

6

Alexander Kolesnikov @kolesnikov.ch · Dec 20

Joint work with @asusanopinto.bsky.social
and @mtschannen.bsky.social performed at Google Deepmind.

1 2

Alexander Kolesnikov @kolesnikov.ch · Dec 20

Final note: we see the Jet model as a powerful tool and a building block for advanced generative models, like JetFormer bsky.app/profile/mtsc..., and not as a standalone competitive generative model.

Michael Tschannen @mtschannen.bsky.social · Dec 2

Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)?

We have been pondering this during summer and developed a new model: JetFormer 🌊🤖

arxiv.org/abs/2411.19722

A thread 👇

1/

1 1

Alexander Kolesnikov @kolesnikov.ch · Dec 20

Check out the paper for more juicy details: arxiv.org/abs/2412.15129.

My favorite mini-insight is how implicit half-precision matrix multiplications (with float32 accumulation) can 'eat' entropy and lead to an overly optimistic, flawed objective and evaluations.

Jet: A Modern Transformer-Based Normalizing Flow

In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute...

arxiv.org

1 1 3

Alexander Kolesnikov @kolesnikov.ch · Dec 20

We release full Jet code (including training) in big_vision repo: github.com/google-resea....

Add "Jet: A Modern Transformer-Based Normalizing Flow" by andresusanopinto · Pull Request #143 · google-research/big_vision

Implementation used in https://arxiv.org/abs/2412.15129 There are a few other small fixes in big_vision codebase.

github.com

1 1 4

Alexander Kolesnikov @kolesnikov.ch · Dec 20

When trained on 'small' data, such as ImageNet-1k, overfitting occurs.

Another contribution is a demonstration that transfer learning is effective in mitigating overfitting. The recipe is: pretrain on a large image database and then fine-tune to a small dataset, e.g., CIFAR-10.

1 1 2

Alexander Kolesnikov @kolesnikov.ch · Dec 20

We observe robust performance improvements with compute scaling, showing behavior similar to classical scaling laws.

These are the results of varying the Jet model size when training on ImageNet-21k images:

1 2

Alexander Kolesnikov @kolesnikov.ch · Dec 20

Our main contribution is a very straightforward design: Jet is just repeated affine coupling layers with ViT inside. We show that many standard components are not needed with our simple design:
❌ invertible dense layer
❌ ActNorm layer
❌ multiscale latents
❌ dequant. noise

1 1 3

Alexander Kolesnikov @kolesnikov.ch · Dec 20

With some delay, JetFormer's *prequel* paper is finally out on arXiv: a radically simple ViT-based normalizing flow (NF) model that achieves SOTA results in its class.

Jet is one of the key components of JetFormer, deserving a standalone report. Let's unpack: 🧵⬇️

2 7 42

Alexander Kolesnikov @kolesnikov.ch · Dec 20

Here it is: arxiv.org/abs/2412.15129

Jet: A Modern Transformer-Based Normalizing Flow

In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute...

arxiv.org

1 1 1

Alexander Kolesnikov @kolesnikov.ch · Dec 5

Paligemma2 is out! Bigger models, better results. For the best experience, do not forget to finetune.

Congrats Paligemma2 team!

Andreas Steiner @andreaspsteiner.bsky.social · Dec 5

🚀🚀PaliGemma 2 is our updated and improved PaliGemma release using the Gemma 2 models and providing new pre-trained checkpoints for the full cross product of {224px,448px,896px} resolutions and {3B,10B,28B} model sizes.

1/7

1 13

Alexander Kolesnikov @kolesnikov.ch · Dec 4

Ok, it is yesterdays news already, but good night sleep is important.

After 7 amazing years at Google Brain/DM, I am joining OpenAI. Together with @xzhai.bsky.social and @giffmana.ai, we will establish OpenAI Zurich office. Proud of our past work and looking forward to the future.

9 11 120

Reposted by Alexander Kolesnikov

Sander Dieleman @sedielem.bsky.social · Dec 2

In arxiv.org/abs/2303.00848, @dpkingma.bsky.social and @ruiqigao.bsky.social had suggested that noise augmentation could be used to make other likelihood-based models optimise perceptually weighted losses, like diffusion models do. So cool to see this working well in practice!

11 52

Alexander Kolesnikov @kolesnikov.ch · Dec 2

The answer has just dropped: bsky.app/profile/kole...

Jia-Bin Huang @jbhuang0604.bsky.social · Dec 1

2021: Replace every CNN with a Transformer

2022: Replace every GAN with diffusion models

2023: Replace every NeRF with 3DGS

2024: Replace every diffusion model with Flow Matching

2025: ???

2 2 15

Alexander Kolesnikov @kolesnikov.ch · Dec 2

JetFormer product of endless and heated (but friendly) arguing and discussions with @mtschannen.bsky.social
and @asusanopinto.bsky.social.

Very excited about this model due to its potential to unify multimodal learning with a simple and universal end-to-end approach.

1

Alexander Kolesnikov @kolesnikov.ch · Dec 2

We evaluate JetFormer potential to model large-scale multimodal image+text data and do image-to-text, text-to-image and VQA tasks, and get rather encouraging results.

1 1

Alexander Kolesnikov @kolesnikov.ch · Dec 2

We also present novel data augmentation: "noise curriculum". It helps a pure NLL model to focus on high-level image details.

Even though it is inspired by diffusion, it is very different: it only affects training and does not require iterative denoising during inference.

1 2

Alexander Kolesnikov @kolesnikov.ch · Dec 2

JetFormer is just an autoregressive transformer, trained end-to-end in one go, with no pretrained image encoders/quantizers.

There is a small twist though. An image input is re-encoded with a normalizing flow model, which is trained jointly with the main transformer model.

1 2

Alexander Kolesnikov @kolesnikov.ch · Dec 2

I always dreamed of a model that simultaneously

1. optimizes NLL of raw pixel data,
2. generates competitive high-res. natural images,
3. is practical.

But it seemed too good to be true. Until today!

Our new JetFormer model (arxiv.org/abs/2411.19722) ticks on all of these.

🧵

2 5 37