Lightnews — Scholar-powered news

Alexander Kolesnikov

@kolesnikov.ch

When trained on 'small' data, such as ImageNet-1k, overfitting occurs.

Another contribution is a demonstration that transfer learning is effective in mitigating overfitting. The recipe is: pretrain on a large image database and then fine-tune to a small dataset, e.g., CIFAR-10.

December 20, 2024 at 2:39 PM

Alexander Kolesnikov

@kolesnikov.ch

We observe robust performance improvements with compute scaling, showing behavior similar to classical scaling laws.

These are the results of varying the Jet model size when training on ImageNet-21k images:

December 20, 2024 at 2:39 PM

Alexander Kolesnikov

@kolesnikov.ch

Our main contribution is a very straightforward design: Jet is just repeated affine coupling layers with ViT inside. We show that many standard components are not needed with our simple design:
❌ invertible dense layer
❌ ActNorm layer
❌ multiscale latents
❌ dequant. noise

December 20, 2024 at 2:39 PM

Alexander Kolesnikov

@kolesnikov.ch

With some delay, JetFormer's *prequel* paper is finally out on arXiv: a radically simple ViT-based normalizing flow (NF) model that achieves SOTA results in its class.

Jet is one of the key components of JetFormer, deserving a standalone report. Let's unpack: 🧵⬇️

December 20, 2024 at 2:39 PM

Alexander Kolesnikov

@kolesnikov.ch

We evaluate JetFormer potential to model large-scale multimodal image+text data and do image-to-text, text-to-image and VQA tasks, and get rather encouraging results.

December 2, 2024 at 5:19 PM

Alexander Kolesnikov

@kolesnikov.ch

We also present novel data augmentation: "noise curriculum". It helps a pure NLL model to focus on high-level image details.

Even though it is inspired by diffusion, it is very different: it only affects training and does not require iterative denoising during inference.

December 2, 2024 at 5:19 PM

Alexander Kolesnikov

@kolesnikov.ch

JetFormer is just an autoregressive transformer, trained end-to-end in one go, with no pretrained image encoders/quantizers.

There is a small twist though. An image input is re-encoded with a normalizing flow model, which is trained jointly with the main transformer model.

December 2, 2024 at 5:19 PM

Alexander Kolesnikov

@kolesnikov.ch

I always dreamed of a model that simultaneously

1. optimizes NLL of raw pixel data,
2. generates competitive high-res. natural images,
3. is practical.

But it seemed too good to be true. Until today!

Our new JetFormer model (arxiv.org/abs/2411.19722) ticks on all of these.

🧵

December 2, 2024 at 5:19 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news