Alaa El-Nouby
@alaaelnouby.bsky.social
350 followers 60 following 8 posts
Research Scientist at @Apple. Previous: @Meta (FAIR), @Inria, @MSFTResearch, @VectorInst and @UofG . Egyptian 🇪🇬
Posts Media Videos Starter Packs
Reposted by Alaa El-Nouby
kindsuss.bsky.social
Check out our Apple research work on scaling laws for native multimodal models! Combined with mixtures of experts, native models develop both specialized and multimodal representations! Lots of rich findings and opportunists for follow up research!
cscv-bot.bsky.social
Shukor, Fini, da Costa, Cord, Susskind, El-Nouby: Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models https://arxiv.org/abs/2504.07951 https://arxiv.org/pdf/2504.07951 https://arxiv.org/html/2504.07951
Reposted by Alaa El-Nouby
samiraabnar.bsky.social
🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, parallelizable compute, or sequential compute?

We explored this through the lens of MoEs:
alaaelnouby.bsky.social
Could you clarify for what task did you test the checkpoints and which checkpoint in particular did you use? Thanks!
alaaelnouby.bsky.social
Hey Johan, For AIMv2 please use the last layer features, typically after the post trunk layer normalization.
Reposted by Alaa El-Nouby
merve.bsky.social
Apple released AIMv2 🍏 a family of state-of-the-art open-set vision encoders
> like CLIP, but add a decoder and train on autoregression 🤯
> 19 open models come in 300M, 600M, 1.2B, 2.7B with resolutions of 224, 336, 448
> Loadable and usable with 🤗 transformers huggingface.co/collections/...
Reposted by Alaa El-Nouby
abursuc.bsky.social
The return of the Autoregressive Image Model: AIMv2 now going multimodal.
Excellent work by @alaaelnouby.bsky.social & team with code and checkpoints already up:

arxiv.org/abs/2411.14402
Reposted by Alaa El-Nouby
ducha-aiki.bsky.social
Multimodal Autoregressive Pre-training of Large Vision Encoders
Enrico Fini et 15 al

tl;dr: in title. Scaling laws and ablations.
they claim to be better than SigLIP and DINOv2 for semantic tasks. I would be interested in monodepth performance though.

arxiv.org/abs/2411.14402
alaaelnouby.bsky.social
It has been an absolute pleasure working with Enrico, Mustafa and the whole AIMv2 team the past few months. We are looking forward to seeing our models being useful to the community.

For many more results, insights and analysis please check our preprint. arxiv.org/abs/2411.14402
Multimodal Autoregressive Pre-training of Large Vision Encoders
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal s...
arxiv.org
alaaelnouby.bsky.social
The open-sourced AIMv2 checkpoints support a number of fixed resolutions (224px, 336px, and 448px) in addition to a Native resolution checkpoint that accepts images of variable resolutions and aspect ratios
alaaelnouby.bsky.social
AIMv2 provides a strong off-the-shelf recognition performance, with AIMv2-3B achieving 89.5% on ImageNet with a frozen-trunk. We also observe consistent improvement in performance with scaling the parameters for AIMv2 (check Section.3 in the preprint)
alaaelnouby.bsky.social
AIMv2 is pre-trained in a manner similar to modern VLMs; therefore, it can be integrated seamlessly with our smallest backbone (i.e., AIMv2-L), outperforming popular backbones such as OpenAI CLIP and SigLIP on multimodal understanding benchmarks
alaaelnouby.bsky.social
AIMv2 is pre-trained to autoregressively generate image patches and text tokens. It is easy to implement and train and it can be trivially scaled to billions of parameters. We are sharing checkpoints ranging between 300M and 3B params, available in Pytorch, JAX, and MLX on🤗
alaaelnouby.bsky.social
𝗗𝗼𝗲𝘀 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝘃𝗶𝘀𝗶𝗼𝗻? 🤔
Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding 🧵

paper: arxiv.org/abs/2411.14402
code: github.com/apple/ml-aim
HF: huggingface.co/collections/...