Lightnews — Scholar-powered news

Michael Günther

@michael-g-u.bsky.social

I went together with @bowang0911.bsky.social to SIGIR this year, we wrote with @scottmartens.bsky.social a blog post with our highlights and summaries of AI and neural search papers that we found interesting at the conference
jina.ai/news/what-we...

What We Learned at SIGIR 2025

Sharing what we saw and learned at SIGIR 2025, feat. CLIP-AdaM, RE-AdaptIR and evaluations for LLM-based retrieval systems.

jina.ai

August 12, 2025 at 10:08 AM

Michael Günther

@michael-g-u.bsky.social

Image resolution matters for embeddings - especially for visual document retrieval. jina-embeddings-v4 supports inputs up to 16+ MP (the default is much lower). We wrote a blog post about how resolution affects performance across benchmarks
jina.ai/news/how-ima...

How Image Resolution Impacts Visual Document Retrieval

Image resolution is crucial for embedding visually rich documents. Too small and models miss key details; too large and they can't connect the parts.

jina.ai

July 31, 2025 at 7:59 AM

Michael Günther

@michael-g-u.bsky.social

We created a new benchmark for visual document retrieval with diverse visually rich documents (more than linear paginated PDFs) and more query types than just questions
👨‍💻 github.com/jina-ai/jina...
📑 jina.ai/news/jinavdr...

JinaVDR: New Visual Document Retrieval Benchmark with 95 Tasks in 20 Languages

JinaVDR is a new benchmark spanning 95 tasks across 20 languages for visual document retrieval, soon on MTEB.

jina.ai

July 26, 2025 at 10:27 AM

Michael Günther

@michael-g-u.bsky.social

Our paper "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models" has been accepted at the Robust IR Workshop @ SIGIR 2025! 🌠

📅 I'll present it on July 17th

📝 Pre-print: arxiv.org/abs/2409.04701
🔗 Workshop: sigir-2025-workshop-on-robust-ir.github.io

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compre...

arxiv.org

July 8, 2025 at 8:35 AM

Reposted by Michael Günther

Tom Aarsen

@tomaarsen.com

‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance!

Details in 🧵

July 1, 2025 at 2:00 PM

Michael Günther

@michael-g-u.bsky.social

We are currently working on quantization-aware training to speed up retrieval. My colleagues Andrej, @scottmartens.bsky.social, and @bowang0911.bsky.social have published a blog post about first results - more is on the way!
jina.ai/news/quantiz...

Quantization-Aware Training of jina-embeddings-v4

Quantization gives smaller embeddings. We show you fine-tuned quantization gives you even lossless embeddings.

jina.ai

July 1, 2025 at 6:33 AM

Michael Günther

@michael-g-u.bsky.social

We released a new model: jina-embeddings-v4
- multilingual text-to-text and text-to-image search w/o modality gap
- also visual docs (e.g. pdfs, maps) - trained on a wider scope than DSE, ColPali, etc.
+ MRL, late interaction, etc.
🤗 huggingface.co/jinaai/jina-...
📄 arxiv.org/abs/2506.18902

jinaai/jina-embeddings-v4 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

June 25, 2025 at 2:53 PM

Michael Günther

@michael-g-u.bsky.social

Interesting blog post by my colleague @scottmartens.bsky.social on influences of text size on embedding similarity, e.g., longer queries produce higher scores, thus comparing score between two docs and the same query works but scores for different queries are not comparable
jina.ai/news/on-the-...

April 23, 2025 at 11:12 AM

Michael Günther

@michael-g-u.bsky.social

New Multi-Modal Reranking Model (e.g. for text-to-image retrieval): jina.ai/news/jina-re...

Supports Multiple Languages and Dynamic Resolution (up to 4K)

🤗 huggingface.co/jinaai/jina-...

jina-reranker-m0: Multilingual Multimodal Document Reranker

Introducing jina-reranker-m0, our new multilingual multimodal reranker for retrieving visual documents, with SOTA performance on multilingual long documents and code searching tasks.

jina.ai

April 8, 2025 at 2:23 PM

Michael Günther

@michael-g-u.bsky.social

This Thursday Bowen Jin will present his research on how to train LLMs (R1) to generate search queries to use search engines more effectively using reinforcement learning in our paper talks event series.

Online Event: lu.ma/j8g0wnit
Paper: arxiv.org/abs/2503.09516

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning · Zoom · Luma

Join us for an insightful discussion of the groundbreaking Search-R1 framework, presented by Bowen Jin, a fourth-year Ph.D. student in Computer Science at the…

lu.ma

March 26, 2025 at 10:10 AM

Reposted by Michael Günther

Saahil Ognawala

@saahilognawala.bsky.social

Embedding models become "blind" beyond 4K tokens in context length. Building on the NoLIMA paper, our experiments show that for needle-in-a-haystack tasks, performance of embedding models drops to near-random chance with long contexts—even with exact keyword matches 🤔 🧵

March 7, 2025 at 9:28 AM

Michael Günther

@michael-g-u.bsky.social

I applied LLMs for query expansion and we wrote this article:
It sees to work out-of-the-box and generally boost the performance of embedding models. However, it requires more latency. Would be interesting to see more about this.
📃: jina.ai/news/query-e...
🛠️: github.com/jina-ai/llm-...

Query Expansion with LLMs: Searching Better by Saying More

Search has changed a lot since embedding models were introduced. Is there still a role for lexical techniques like query expansion in AI? We think so.

jina.ai

February 18, 2025 at 8:29 AM

Reposted by Michael Günther

Sung Kim

@sungkim.bsky.social

When it rains, it pours.

Baichuan releases Baichuan-Omni-1.5

Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs.

Both model ( huggingface.co/baichuan-inc... ) and base ( huggingface.co/baichuan-inc... ).

January 26, 2025 at 9:14 PM

Michael Günther

@michael-g-u.bsky.social

It seems like LLM APIs are cheaper and more versatile for translation than translation APIs like Google Translate, as they allow customized instructions. I created a small tool to experiment with LLM translation and translation comparison.
github.com/guenthermi/t...

GitHub - guenthermi/translation-align: LLM-based translation and translation comparison

LLM-based translation and translation comparison. Contribute to guenthermi/translation-align development by creating an account on GitHub.

github.com

January 26, 2025 at 4:28 PM

Michael Günther

@michael-g-u.bsky.social

An interesting blog post from my colleague that compares ModernBERT to RoBERTa and the backbone of our Jina-Embeddings-V3 model Jina-XLM-RoBERTa in context of embedding training
jina.ai/news/what-sh...

What Should We Learn From ModernBERT?

Bigger training data, efficient parameter sizing, and a deep-but-thin architecture, ModernBERT sets a direction for future BERT-like models.

jina.ai

January 22, 2025 at 4:57 PM

Michael Günther

@michael-g-u.bsky.social

Many search and data analysis use cases require extracting information from the HTML code of websites. To make this easier, the new ReaderLM-v2 model can effectively convert HTML to Markdown and extract information to JSON in a given JSON schema.
jina.ai/news/readerl...

ReaderLM v2: Frontier Small Language Model for HTML to Markdown and JSON

ReaderLM-v2 is a 1.5B small language model for HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional accuracy.

jina.ai

January 15, 2025 at 4:34 PM

Michael Günther

@michael-g-u.bsky.social

Our submission to ECIR 2025 on jina-embeddings-v3 has been accepted! 🎉
At the ECIR Industry Day my colleague @str-saba.bsky.social presents how we train the latest version of our text embedding model.
More details on ECIR: ecir2025.eu
More details about the model: arxiv.org/abs/2409.10173

47th EUROPEAN CONFERENCE ON INFORMATION RETRIEVAL – 47th EUROPEAN CONFERENCE ON INFORMATION RETRIEVAL

ecir2025.eu

December 16, 2024 at 4:18 PM

Reposted by Michael Günther

Alexander Doria

@dorialexander.bsky.social

I’m releasing a series of experiment to enhance Retrieval augmented generation using attention scores. colab.research.google.com/drive/1HEUqy... Basic idea is to leverage the internal reading process, as the model goes back and forth to the sources to find information and potential quotes.

December 15, 2024 at 2:35 PM

Michael Günther

@michael-g-u.bsky.social

Interesting article by Han Xiao about how to better utilize embedding models for classification tasks if the embedding model doesn't know much about your classes. It proposes an interesting method to decompose the classification into multiple simpler classification problems
jina.ai/news/scaling...

Scaling Test-Time Compute For Embedding Models

Better results scale with compute—more on learning, more on search. A good pretrained model takes you far, but test-time compute takes you further. It's time to recognize this paradigm of test-time co...

jina.ai

December 13, 2024 at 4:02 PM

Michael Günther

@michael-g-u.bsky.social

One year ago, we released the first OS embedding model for 8192 tokens. Many suspected it to be not useful and chunking to be better than a single vector. I run many experiments to explain, when to use what and we summarized the findings in this article
t.co/BLC3WTU3LP

https://jina.ai/news/still-need-chunking-when-long-context-models-can-do-it-all/

t.co

December 5, 2024 at 8:49 AM

Reposted by Michael Günther

merve

@merve.bsky.social

Small yet mighty! 💫

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🤠

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base huggingface.co/collections/...

November 26, 2024 at 4:04 PM

Reposted by Michael Günther

bo

@bowang0911.bsky.social

follow @jina-ai.bsky.social official account and our team here:

go.bsky.app/99FgER

Jina AI

Join the conversation

go.bsky.app

November 26, 2024 at 9:29 AM

Reposted by Michael Günther

Jina AI

@jina.ai

Jina-CLIP-v2: a 0.9B multilingual multimodal embedding model that supports 89 languages, 512x512 image resolution, 8192 token-length, and Matryoshka representations down to 64-dim for both images and text. jina.ai/news/jina-cl... With of course strong performance on retrieval & classification tasks.

Jina CLIP v2: Multilingual Multimodal Embeddings for Text and Images

Jina-CLIP v2, a 0.9B multimodal embedding model with multilingual support of 89 languages, high image resolution at 512x512, and Matryoshka representations.

jina.ai

November 26, 2024 at 8:56 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news