Antoine Chaffin
@nohtow.bsky.social
570 followers 87 following 38 posts
27, French CS Engineer 💻, PhD in ML 🎓🤖 — Guiding generative models for better synthetic data and building multimodal representations @LightOn
Posts Media Videos Starter Packs
Reposted by Antoine Chaffin
tomaarsen.com
ColBERT (a.k.a. multi-vector, late-interaction) models are extremely strong search models, often outperforming dense embedding models. And @lightonai.bsky.social just released a new state-of-the-art one: GTE-ModernColBERT-v1!

Details in 🧵
Reposted by Antoine Chaffin
tomaarsen.com
I'm a big fan of the PyLate project for ColBERT models, and I'm glad to see these strong models coming out. Very nice work by the @lightonai.bsky.social folks, especially @nohtow.bsky.social.

Learn more about PyLate here: lightonai.github.io/pylate/
pylate
Neural Search
lightonai.github.io
nohtow.bsky.social
As per usual, thanks to my dear co-maintainer @raphaelsty.bsky.social for helping me make PyLate what it is 🫶
nohtow.bsky.social
In addition to knowledge distillation, we recently added features to allow large-scale contrastive pre-training, and this model has been released upon popular demand, but we are currently doing heavier training so stay tuned!
nohtow.bsky.social
PyLate makes downstream usage easy, but also facilitate training!
You can reproduce this SOTA training with <80 LoC and 2 hours of training and it'll run NanoBEIR during training, report it to W&B and create an informative model card!
Link to the gist: gist.github.com/NohTow/3030f...
nohtow.bsky.social
Besides, it also comes with the 8k context window of ModernBERT, which is very useful given that late interaction models generalize very well to longer context as highlighted in the ModernBERT paper
It is thus very suited to handle your very long documents!
nohtow.bsky.social
It is also the first model to outperform ColBERT-small on BEIR
While it is bigger, it is still a very lightweight model and benefits from the efficiency of ModernBERT!
Also, it has only been trained on MS MARCO (for late interaction) and should thus generalize pretty well!
nohtow.bsky.social
Model link: huggingface.co/lightonai/GT...
GTE-ModernColBERT is trained on top of the GTE-ColBERT model using knowledge distillation on the MS MARCO dataset and is the first SOTA model trained using PyLate!
Get started with PyLate using the documentation:
lightonai.github.io/pylate/
lightonai/GTE-ModernColBERT-v1 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
nohtow.bsky.social
Among all those LLM releases, here is an important retrieval release:
To overcome limitations of awesome ModernBERT-based dense models, today @lightonai.bsky.social is releasing GTE-ModernColBERT, the very first state-of-the-art late-interaction (multi-vectors) model trained using PyLate 🚀
nohtow.bsky.social
When I saw the release of ModernBERT-embed during the holidays, I knew I had to build the large variant, so I wanted to thank Zach Nussbaum from Nomic AI for building and sharing it (as well as all the nomic-embed tools and data) and bearing with me during the training!
nohtow.bsky.social
ModernBERT-embed-large not only enables usage of ModernBERT-large out-of-the-box, but it should also be a very good starting point for strong fine-tunings on various tasks, so I can't wait to see what the community will build on top of it!
nohtow.bsky.social
Obviously, it comes at a slightly higher cost, but it is also trained with Matryoshka capabilities to reduce the footprint of embeddings
Notably, the performance with dimension 256 is only slightly worse than the base version with full dimension 768
nohtow.bsky.social
Model link: huggingface.co/lightonai/mo...
ModernBERT-embed-large is trained using the same (two-stage training) recipe as its smaller sibling and expectedly increases the performance, reaching +1.22 in MTEB average
nohtow.bsky.social
ModernBERT-embed-base is awesome because it allows to use ModernBERT-base for various tasks out-of-the-box
But the large variant of ModernBERT is also awesome...
So today, @lightonai.bsky.social is releasing ModernBERT-embed-large, the larger and more capable iteration of ModernBERT-embed!
nohtow.bsky.social
The multilingual version is not planned yet, there have been a work on adapting ModernBERT to other languages:
www.linkedin.com/posts/fremyc...

In the mean time, you could have a shot with mGTE (using xformers) or recent language-specific iteration of BERT such as CamemBERTv2!
François REMY on LinkedIn: Fast and multilingual embedding models are the key to many of the world&#39;s…
Fast and multilingual embedding models are the key to many of the world&#39;s RAG pipelines. Today marks my first step into the realm of ModernBERT, and certainly…
www.linkedin.com
Reposted by Antoine Chaffin
lightonai.bsky.social
Today, LightOn releases ModernBERT, a SOTA model for retrieval and classification.

This work was performed in collaboration with Answer.ai and the model was trained on Orange Business Cloud Avenue infrastructure.

www.lighton.ai/lighton-blog...
Reposted by Antoine Chaffin
benjaminwarner.dev
This week we released ModernBERT, the first encoder to reach SOTA on most common benchmarks across language understanding, retrieval, and code, while running twice as fast as DeBERTaV3 on short context and three times faster than NomicBERT & GTE on long context.
Reposted by Antoine Chaffin
calvinmccarter.bsky.social
When one evaluates log-likelihood of a sequence of length L via the chain rule of probability, the first term has missingness fraction of 1, the second has missingness of (L-1)/L, etc. So the inference-time masking rate is ~ Uniform[0, 1].
nohtow.bsky.social
Oh, I see!
Having also worked a lot on causal models, I never thought of this kind of modelling because I always opposed MLM to open ended generation
I guess with papers such as this one arxiv.org/pdf/2406.04823, I should more!
Very interesting perspective, thanks!
arxiv.org
nohtow.bsky.social
Could you elaborate?
Or give me pointers?
Is it because having a fixed value bias the learning w.r.t the way we will sample downstream? (Like not mask 30% of the target?)
nohtow.bsky.social
But there is definitely some diggings to be made to find an optimal strategy in this regard
To me the logic would be to ramp up to have a kick off signal and make it harder and harder but papers seems to say otherwise
Maybe random is the optimal solution!
nohtow.bsky.social
Not really, we considered ramping up/down the masking ratio, but the findings from the literature (at least what we read at time) seemed counter-intuitive/not consensual
We ended up not really digging much into this particular aspect, again because we had so much to explore
nohtow.bsky.social
Definitely!
Again, the original goal of the project (besides cool models) was to convince some researchers to spend a bit of their GPUs hours on encoders pre-training again!

Hopefully we nailed it and will have the answers to a lot of questions in the future!