Machine Learning Engineer at 🤗 Hugging Face
🧵
🧵
5. Rescore the top 40 documents using the fp32 query embedding and the 40 int8 embeddings
6. Sort the 40 documents based on the new scores, grab the top 10
7. Load the titles/texts of the top 10 documents
🧵
5. Rescore the top 40 documents using the fp32 query embedding and the 40 int8 embeddings
6. Sort the 40 documents based on the new scores, grab the top 10
7. Load the titles/texts of the top 10 documents
🧵
1. Embed your query using a dense embedding model into a 'standard' fp32 embedding
2. Quantize the fp32 embedding to binary: 32x smaller
3. Use an approximate (or exact) binary index to retrieve e.g. 40 documents (~20x faster than a fp32 index)
🧵
1. Embed your query using a dense embedding model into a 'standard' fp32 embedding
2. Quantize the fp32 embedding to binary: 32x smaller
3. Use an approximate (or exact) binary index to retrieve e.g. 40 documents (~20x faster than a fp32 index)
🧵
The trick: Binary search with int8 rescoring.
I'll show you a demo & how it works in the 🧵:
The trick: Binary search with int8 rescoring.
I'll show you a demo & how it works in the 🧵:
Now that Python 3.9 has lost security support, Sentence Transformers no longer supports it.
🧵
Now that Python 3.9 has lost security support, Sentence Transformers no longer supports it.
🧵
This release works with both Transformers v4 and the upcoming v5. In the future, Sentence Transformers will only work with Transformers v5, but not yet!
Even my tests run on both Transformers v4 and v5.
🧵
This release works with both Transformers v4 and the upcoming v5. In the future, Sentence Transformers will only work with Transformers v5, but not yet!
Even my tests run on both Transformers v4 and v5.
🧵
When mining for hard negatives to create a strong training dataset, you can now pass `output_scores=True` to get similarity scores returned. This can be useful for some distillation losses!
🧵
When mining for hard negatives to create a strong training dataset, you can now pass `output_scores=True` to get similarity scores returned. This can be useful for some distillation losses!
🧵
You can now use community translations of the tiny NanoBEIR retrieval benchmark instead of only the English one, by passing `dataset_id`, e.g. `dataset_id="lightonai/NanoBEIR-de"` for the German benchmark.
🧵
You can now use community translations of the tiny NanoBEIR retrieval benchmark instead of only the English one, by passing `dataset_id`, e.g. `dataset_id="lightonai/NanoBEIR-de"` for the German benchmark.
🧵
Similar to SentenceTransformer and SparseEncoder, you can now use multi-processing with CrossEncoder rerankers. Useful for multi-GPU and CPU settings, and simple to configure:
just `device=["cuda:0", "cuda:1"]` or `device=["cpu"]*4` on the `predict`/`rank` calls.
🧵
Similar to SentenceTransformer and SparseEncoder, you can now use multi-processing with CrossEncoder rerankers. Useful for multi-GPU and CPU settings, and simple to configure:
just `device=["cuda:0", "cuda:1"]` or `device=["cpu"]*4` on the `predict`/`rank` calls.
🧵
It introduces multi-processing for CrossEncoder (rerankers), multilingual NanoBEIR evaluators, similarity score outputs in mine_hard_negatives, Transformers v5 support and more.
Details in 🧵
It introduces multi-processing for CrossEncoder (rerankers), multilingual NanoBEIR evaluators, similarity score outputs in mine_hard_negatives, Transformers v5 support and more.
Details in 🧵
🧵
🧵
This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face. I'm super excited about the transfer!
Details in 🧵
This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face. I'm super excited about the transfer!
Details in 🧵
Their blogpost covers all changes, including easier evaluation, multimodal support, rerankers, new interfaces, documentation, dataset statistics, a migration guide, etc.
🧵
Their blogpost covers all changes, including easier evaluation, multimodal support, rerankers, new interfaces, documentation, dataset statistics, a migration guide, etc.
🧵
There's also an English only version available.
🧵
There's also an English only version available.
🧵
This would be an indication of whether the model is capable of generalizing nicely.
🧵
This would be an indication of whether the model is capable of generalizing nicely.
🧵
It's a new multilingual text embedding retrieval benchmark with private (!) datasets, to ensure that we measure true generalization and avoid (accidental) overfitting.
Details in our blogpost below 🧵
It's a new multilingual text embedding retrieval benchmark with private (!) datasets, to ensure that we measure true generalization and avoid (accidental) overfitting.
Details in our blogpost below 🧵
- Add support for Knowledgeable Passage Retriever (KPR) models
- Multi-GPU processing with 'model.encode()' now works with 'convert_to_tensor'
🧵
- Add support for Knowledgeable Passage Retriever (KPR) models
- Multi-GPU processing with 'model.encode()' now works with 'convert_to_tensor'
🧵
- a new `model.get_model_kwargs()` method for checking which custom model-specific keyword arguments are supported for this model
🧵
- a new `model.get_model_kwargs()` method for checking which custom model-specific keyword arguments are supported for this model
🧵
It's a small patch release that makes the project more explicit with incorrect arguments and introduces some fixes for multi-GPU processing, evaluators, and hard negatives mining.
Details in 🧵
It's a small patch release that makes the project more explicit with incorrect arguments and introduces some fixes for multi-GPU processing, evaluators, and hard negatives mining.
Details in 🧵
I already trained a basic Sentence Transformer model myself as I was too curious 👀
🧵
I already trained a basic Sentence Transformer model myself as I was too curious 👀
🧵
🧵
🧵
E.g. see the picture for MTEB v2 Multilingual performance.
🧵
E.g. see the picture for MTEB v2 Multilingual performance.
🧵
- Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning)
E.g. see the picture for MTEB v2 English performance.
🧵
- Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning)
E.g. see the picture for MTEB v2 English performance.
🧵
- Trained on 1833 languages incl. DCLM, FineWeb2, etc
- 3 training phases: 2.3T tokens on 60 languages, 600B tokens on 110 languages, and 100B tokens on all 1833 languages.
- Also uses model merging and clever transitions between the three training phases.
🧵
- Trained on 1833 languages incl. DCLM, FineWeb2, etc
- 3 training phases: 2.3T tokens on 60 languages, 600B tokens on 110 languages, and 100B tokens on all 1833 languages.
- Also uses model merging and clever transitions between the three training phases.
🧵
- 2 model sizes: 42M non-embed (140M total) and 110M non-embed (307M total)
- Uses the ModernBERT architecture + Gemma2 multilingual tokenizer (so: flash attention, alternating global/local attention, sequence packing, etc.)
- Max. seq. length of 8192 tokens
🧵
- 2 model sizes: 42M non-embed (140M total) and 110M non-embed (307M total)
- Uses the ModernBERT architecture + Gemma2 multilingual tokenizer (so: flash attention, alternating global/local attention, sequence packing, etc.)
- Max. seq. length of 8192 tokens
🧵