Ivana Balazevic
@ibalazevic.bsky.social
920 followers 130 following 4 posts
Senior Research Scientist at Google DeepMind, working on Gemini. PhD from University of Edinburgh. ibalazevic.github.io
Posts Media Videos Starter Packs
ibalazevic.bsky.social
Disentanglement is an intriguing phenomenon that arises in generative latent variable models for reasons that are not fully understood.

If you’re interested in learning why, I highly recommend giving Carl’s blog a read!
carl-allen.bsky.social
Machine learning has made incredible breakthroughs, but our theoretical understanding lags behind.

We take a step towards unravelling its mystery by explaining why the phenomenon of disentanglement arises in generative latent variable models.

Blog post: carl-allen.github.io/theory/2024/...
Reposted by Ivana Balazevic
aidanematzadeh.bsky.social
I am hiring for RS/RE positions! If you are interested in language-flavored multimodal learning, evaluation, or post-training apply here 🦎 boards.greenhouse.io/deepmind/job...

I will also be #NeurIPS2024 so come say hi! (Please email me to find time to chat)
Research Scientist, Language
London, UK
boards.greenhouse.io
Reposted by Ivana Balazevic
giffmana.ai
Our big_vision codebase is really good! And it's *the* reference for ViT, SigLIP, PaliGemma, JetFormer, ... including fine-tuning them.

However, it's criminally undocumented. I tried using it outside Google to fine-tune PaliGemma and SigLIP on GPUs, and wrote a tutorial: lb.eyer.be/a/bv_tuto.html
Reposted by Ivana Balazevic
carl-allen.bsky.social
I think this comes down to the model behind p(x,y). If features of x cause y, e.g. aspects of a website (x) -> clicks (y); age/health -> disease, then p(y|x) is a (regression) fn of x. But if x|y is a distrib'n of different y's (e.g. cats) then p(y|x) is given by Bayes rule (squint at softmax).
Reposted by Ivana Balazevic
dimadamen.bsky.social
Read our paper:
Context-Aware Multimodal Pretraining

Now on ArXiv

Can you turn vision-language models into strong any-shot models?

Go beyond zero-shot performance in SigLixP (x for context)

Read @confusezius.bsky.social thread below…

And follow Karsten … a rising star!
confusezius.bsky.social
🤔 Can you turn your vision-language model from a great zero-shot model into a great-at-any-shot generalist?

Turns out you can, and here is how: arxiv.org/abs/2411.15099

Really excited to this work on multimodal pretraining for my first bluesky entry!

🧵 A short and hopefully informative thread:
ibalazevic.bsky.social
We maintain strong zero-shot transfer of CLIP / SigLIP across model size and data scale, while achieving up to 4x few-shot sample efficiency and up to +16% performance gains!

Fun project with @confusezius.bsky.social, @zeynepakata.bsky.social, @dimadamen.bsky.social and
@olivierhenaff.bsky.social.
confusezius.bsky.social
🤔 Can you turn your vision-language model from a great zero-shot model into a great-at-any-shot generalist?

Turns out you can, and here is how: arxiv.org/abs/2411.15099

Really excited to this work on multimodal pretraining for my first bluesky entry!

🧵 A short and hopefully informative thread:
Reposted by Ivana Balazevic
sharky6000.bsky.social
Just a heads up to everyone: @deep-mind.bsky.social is unfortunately a fake account and has been reported. Please do not follow it nor repost anything from it.
ibalazevic.bsky.social
Could you add me please? :)