Lightnews — Scholar-powered news

Tom Aarsen @tomaarsen.com · 7d

The benchmark is multilingual (20 languages) and covers various domains (general, legal, healthcare, code, etc.), and it's already available on MTEB right now.

There's also an English only version available.

🧵

Tom Aarsen @tomaarsen.com · 7d

With RTEB, we can see the differences between public and private benchmarks, displayed in this figure here.

This would be an indication of whether the model is capable of generalizing nicely.

🧵

Tom Aarsen @tomaarsen.com · 7d

In short: RTEB uses a hybrid approach with both open and private datasets to measure generalization, preventing overfitting to test sets.

The picture at the top of this thread is what we commonly see on MTEB: models with lower zero-shot score higher, but generalize worse.

🧵

Introducing RTEB: A New Standard for Retrieval Evaluation

Tom Aarsen @tomaarsen.com · 7d

Read our full blogpost: huggingface.co/blog/rteb

🧵

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Release v5.1.1 - Explicit incorrect arguments, fixes for multi-GPU, evaluator, and hard negative · UKPLab/sentence-transformers

Tom Aarsen @tomaarsen.com · 7d

We're announcing a new update to MTEB: RTEB

It's a new multilingual text embedding retrieval benchmark with private (!) datasets, to ensure that we measure true generalization and avoid (accidental) overfitting.

Details in our blogpost below 🧵

1 3 4

Tom Aarsen @tomaarsen.com · 16d

And more! Check out the full release notes here: github.com/UKPLab/sente...

Looking forward to bigger changes coming soon!

This patch makes Sentence Transformers more explicit with incorrect arguments and introduces some fixes for multi-GPU processing, evaluators, and hard negatives mining. Install this version with # ...

github.com

Tom Aarsen @tomaarsen.com · 16d

- Add FLOPS calculation to SparseEncoder evaluators for determining a performance/speed tradeoff
- Add support for Knowledgeable Passage Retriever (KPR) models
- Multi-GPU processing with 'model.encode()' now works with 'convert_to_tensor'

🧵

Tom Aarsen @tomaarsen.com · 16d

- `model.encode()` now throws an error if an unused keyword argument is passed
- a new `model.get_model_kwargs()` method for checking which custom model-specific keyword arguments are supported for this model

🧵

2

Tom Aarsen @tomaarsen.com · 16d

🐛 I've just released Sentence Transformers v5.1.1!

It's a small patch release that makes the project more explicit with incorrect arguments and introduces some fixes for multi-GPU processing, evaluators, and hard negatives mining.

Details in 🧵

2 2 20

Tom Aarsen @tomaarsen.com · 28d

Sounds like a great initiative. I'm looking forward to seeing it develop

opengov.nl is live • David Graus

Tom Aarsen @tomaarsen.com · 28d

(Small typo in graus.nu/blog/opengov..., it's committing with double 'm')

We have published our ICAI OpenGov Lab website at opengov.nl! It is a bit bare at the moment, containing some information on our team, projects, and some news items that were shared on socials. But as...

graus.nu

Paper page - mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Tom Aarsen @tomaarsen.com · 29d

Or even read their paper: huggingface.co/papers/2509....

Join the discussion on this paper page

mmBERT: ModernBERT goes Multilingual

1 2

Tom Aarsen @tomaarsen.com · 29d

And if you made it this far, just go read the blogpost! huggingface.co/blog/mmbert

🧵

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Tom Aarsen @tomaarsen.com · 29d

Great work by @marcmarone.com, @orionweller.bsky.social, and the rest of the team at @jhuclsp.bsky.social!

🧵

Tom Aarsen @tomaarsen.com · 29d

I'm very much looking forward to seeing embedding models based on mmBERT!
I already trained a basic Sentence Transformer model myself as I was too curious 👀

🧵

Tom Aarsen @tomaarsen.com · 29d

Based on this, mmBERT should be the new go-to multilingual encoder base models at <=300M.

Note: mmBERT models are "base" models: they're currently only trained for Mask Filling. They need to be finetuned for tasks like semantic search, classification, clustering, etc.

🧵

jhu-clsp/mmBERT-base · Hugging Face

Tom Aarsen @tomaarsen.com · 29d

And a link to the base one: huggingface.co/jhu-clsp/mmB...

🧵

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

jhu-clsp/mmBERT-small · Hugging Face

Tom Aarsen @tomaarsen.com · 29d

Link to the small model: huggingface.co/jhu-clsp/mmB...

🧵

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Tom Aarsen @tomaarsen.com · 29d

On top of all that, both models are MIT Licensed, and the full datasets and intermediary checkpoints are also publicly released!

🧵

1 2

Tom Aarsen @tomaarsen.com · 29d

Additionally: the ModernBERT-based mmBERT is much faster than the alternatives due to its architectural benefits. Easily up to 2x throughput in common scenarios.

🧵

Tom Aarsen @tomaarsen.com · 29d

In short: the models beat commonly used multilingual base models like mDistilBERT, XLM-R (multilingual RoBERTa), multilingual MiniLM, etc.

🧵

Tom Aarsen @tomaarsen.com · 29d

- Consistently outperforms equivalently sized models on all Multilingual tasks (XTREME, classification, MTEB v2 Multilingual after finetuning)

E.g. see the picture for MTEB v2 Multilingual performance.
🧵

Tom Aarsen @tomaarsen.com · 29d

Evaluation details:
- Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning)

E.g. see the picture for MTEB v2 English performance.

🧵