Tom Aarsen
@tomaarsen.com
2.5K followers 200 following 310 posts
Sentence Transformers, SetFit & NLTK maintainer Machine Learning Engineer at 🤗 Hugging Face
Posts Media Videos Starter Packs
tomaarsen.com
Hahaha, or somewhat less accidental overfitting 😉
tomaarsen.com
Check out the leaderboard here: mteb-leaderboard.hf.space?benchmark_na...

I'm very proud of everyone who worked on this. It's been a nice collaboration between Voyage AI by @mongodb.bsky.social and the core MTEB team.
Gradio
Click to try out the app!
mteb-leaderboard.hf.space
tomaarsen.com
The benchmark is multilingual (20 languages) and covers various domains (general, legal, healthcare, code, etc.), and it's already available on MTEB right now.

There's also an English only version available.

🧵
tomaarsen.com
With RTEB, we can see the differences between public and private benchmarks, displayed in this figure here.

This would be an indication of whether the model is capable of generalizing nicely.

🧵
tomaarsen.com
In short: RTEB uses a hybrid approach with both open and private datasets to measure generalization, preventing overfitting to test sets.

The picture at the top of this thread is what we commonly see on MTEB: models with lower zero-shot score higher, but generalize worse.

🧵
tomaarsen.com
We're announcing a new update to MTEB: RTEB

It's a new multilingual text embedding retrieval benchmark with private (!) datasets, to ensure that we measure true generalization and avoid (accidental) overfitting.

Details in our blogpost below 🧵
tomaarsen.com
- Add FLOPS calculation to SparseEncoder evaluators for determining a performance/speed tradeoff
- Add support for Knowledgeable Passage Retriever (KPR) models
- Multi-GPU processing with 'model.encode()' now works with 'convert_to_tensor'

🧵
tomaarsen.com
- `model.encode()` now throws an error if an unused keyword argument is passed
- a new `model.get_model_kwargs()` method for checking which custom model-specific keyword arguments are supported for this model

🧵
tomaarsen.com
🐛 I've just released Sentence Transformers v5.1.1!

It's a small patch release that makes the project more explicit with incorrect arguments and introduces some fixes for multi-GPU processing, evaluators, and hard negatives mining.

Details in 🧵
tomaarsen.com
Sounds like a great initiative. I'm looking forward to seeing it develop
tomaarsen.com
I'm very much looking forward to seeing embedding models based on mmBERT!
I already trained a basic Sentence Transformer model myself as I was too curious 👀

🧵
tomaarsen.com
Based on this, mmBERT should be the new go-to multilingual encoder base models at <=300M.

Note: mmBERT models are "base" models: they're currently only trained for Mask Filling. They need to be finetuned for tasks like semantic search, classification, clustering, etc.

🧵
tomaarsen.com
On top of all that, both models are MIT Licensed, and the full datasets and intermediary checkpoints are also publicly released!

🧵
tomaarsen.com
Additionally: the ModernBERT-based mmBERT is much faster than the alternatives due to its architectural benefits. Easily up to 2x throughput in common scenarios.

🧵
tomaarsen.com
In short: the models beat commonly used multilingual base models like mDistilBERT, XLM-R (multilingual RoBERTa), multilingual MiniLM, etc.

🧵
tomaarsen.com
- Consistently outperforms equivalently sized models on all Multilingual tasks (XTREME, classification, MTEB v2 Multilingual after finetuning)

E.g. see the picture for MTEB v2 Multilingual performance.
🧵
tomaarsen.com
Evaluation details:
- Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning)

E.g. see the picture for MTEB v2 English performance.

🧵