Lightnews — Scholar-powered news

Reposted by Alaa El-Nouby

Josh Susskind @kindsuss.bsky.social · Apr 11

Check out our Apple research work on scaling laws for native multimodal models! Combined with mixtures of experts, native models develop both specialized and multimodal representations! Lots of rich findings and opportunists for follow up research!

arXiv cs.CV Computer Vision and Pattern Recognition @cscv-bot.bsky.social · Apr 11

Shukor, Fini, da Costa, Cord, Susskind, El-Nouby: Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models https://arxiv.org/abs/2504.07951 https://arxiv.org/pdf/2504.07951 https://arxiv.org/html/2504.07951

5 6

Reposted by Alaa El-Nouby

Samira @samiraabnar.bsky.social · Jan 28

🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, parallelizable compute, or sequential compute?

We explored this through the lens of MoEs:

1 8 18

Alaa El-Nouby @alaaelnouby.bsky.social · Nov 22

Could you clarify for what task did you test the checkpoints and which checkpoint in particular did you use? Thanks!

2 1

Alaa El-Nouby @alaaelnouby.bsky.social · Nov 22

Hey Johan, For AIMv2 please use the last layer features, typically after the post trunk layer normalization.

2 2

Reposted by Alaa El-Nouby

merve @merve.bsky.social · Nov 22

Apple released AIMv2 🍏 a family of state-of-the-art open-set vision encoders
> like CLIP, but add a decoder and train on autoregression 🤯
> 19 open models come in 300M, 600M, 1.2B, 2.7B with resolutions of 224, 336, 448
> Loadable and usable with 🤗 transformers huggingface.co/collections/...

2 5 46

Reposted by Alaa El-Nouby

Andrei Bursuc @abursuc.bsky.social · Nov 22

The return of the Autoregressive Image Model: AIMv2 now going multimodal.
Excellent work by @alaaelnouby.bsky.social & team with code and checkpoints already up:

arxiv.org/abs/2411.14402

1 8 46

Reposted by Alaa El-Nouby

Dmytro Mishkin @ducha-aiki.bsky.social · Nov 22

Multimodal Autoregressive Pre-training of Large Vision Encoders
Enrico Fini et 15 al

tl;dr: in title. Scaling laws and ablations.
they claim to be better than SigLIP and DINOv2 for semantic tasks. I would be interested in monodepth performance though.

arxiv.org/abs/2411.14402

4 4 48

Alaa El-Nouby @alaaelnouby.bsky.social · Nov 22

It has been an absolute pleasure working with Enrico, Mustafa and the whole AIMv2 team the past few months. We are looking forward to seeing our models being useful to the community.

For many more results, insights and analysis please check our preprint. arxiv.org/abs/2411.14402

Multimodal Autoregressive Pre-training of Large Vision Encoders

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal s...

arxiv.org

3

Alaa El-Nouby @alaaelnouby.bsky.social · Nov 22

The open-sourced AIMv2 checkpoints support a number of fixed resolutions (224px, 336px, and 448px) in addition to a Native resolution checkpoint that accepts images of variable resolutions and aspect ratios

1 3

Alaa El-Nouby @alaaelnouby.bsky.social · Nov 22

AIMv2 provides a strong off-the-shelf recognition performance, with AIMv2-3B achieving 89.5% on ImageNet with a frozen-trunk. We also observe consistent improvement in performance with scaling the parameters for AIMv2 (check Section.3 in the preprint)

1 2

Alaa El-Nouby @alaaelnouby.bsky.social · Nov 22

AIMv2 is pre-trained in a manner similar to modern VLMs; therefore, it can be integrated seamlessly with our smallest backbone (i.e., AIMv2-L), outperforming popular backbones such as OpenAI CLIP and SigLIP on multimodal understanding benchmarks

1 1 3

Alaa El-Nouby @alaaelnouby.bsky.social · Nov 22

AIMv2 is pre-trained to autoregressively generate image patches and text tokens. It is easy to implement and train and it can be trivially scaled to billions of parameters. We are sharing checkpoints ranging between 300M and 3B params, available in Pytorch, JAX, and MLX on🤗

1 2

Alaa El-Nouby @alaaelnouby.bsky.social · Nov 22

𝗗𝗼𝗲𝘀 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝘃𝗶𝘀𝗶𝗼𝗻? 🤔
Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding 🧵

paper: arxiv.org/abs/2411.14402
code: github.com/apple/ml-aim
HF: huggingface.co/collections/...

3 19 59