Lightnews — Scholar-powered news

Orion Weller

@orionweller.bsky.social

2.3K followers 330 following 22 posts

PhD Student at Johns Hopkins University. Previously: Allen Institute for AI, Apple, Samaya AI. Research for #NLProc #IR

Posts Replies Media Videos

Orion Weller

@orionweller.bsky.social

Ever wonder how test-time compute would do in retrieval? 🤔

introducing ✨rank1✨

rank1 is distilled from R1 & designed for reranking.

rank1 is state-of-the-art at complex reranking tasks in reasoning, instruction-following, and general semantics (often 2x RankLlama 🤯)

🧵

February 26, 2025 at 2:57 PM

Orion Weller

@orionweller.bsky.social

Why is this the case? Weaker IR models benefit from the additional information provided by LLMs (e.g. recall) but strong IR models (like large rerankers) lose information they need when ranking the top documents (e.g. precision).

An example of expansions obscuring the relevance signal.

The query is "Is it possible to take a mortgage using Bitcoin as collateral?"

Document A (relevant): ... suggest that they borrow the money to invest with you. They can use their bitcoins as collateral for the loan. That way, they get the same benefit and your company doesn't go out of business if the price of bitcoin drops ...

Document B (nonrelevant): "...The most likely tool to use in this case would be a Home Equity Line of Credit (HELOC). This is a line of credit for which the full amount is backed by home equity (difference between market and book prices). Most likely your financial institution will apply a factor ..."

Expanded query has terms "Home Equity Line of Credit (HELOC)". Which causes Document B to be relevant.

November 18, 2024 at 10:30 AM

Orion Weller

@orionweller.bsky.social

We show that this finding holds both in-domain and for 4 different types of distribution shift (domain, relevance, long queries, short docs) across 12 datasets.

Interestingly, these effects are the least strong on long query shift (e.g. paragraph+ sized queries, a la ArguAna).

How different expansions affect results on datasets that measure Relevance Shift: we see a similar trend that stronger models do not see gains from augmentation in general while weaker models (like DPR) do

How different expansions affect results on datasets that measure Long Query Format Shift. Unlike previous results, we see that all models benefit from some type of expansions on these three datasets (Tip of my tongue, TREC clinical trials, and ArguAna)

How different expansions affecet results on datasets that measure Domain Shift (NFCorpus, GooAQ Technical, FiQA 2018). Notice that models with higher base scores are generally harmed by expansions while weaker models benefit from them.

In-Domain performance on the TREC Deep Learning Tracks, according to various types of expansions, showing that expansions typically help weaker models (like DPR) but hurt stronger models (especially large reranker models like MonoT5-3B).

November 18, 2024 at 10:30 AM

Orion Weller

@orionweller.bsky.social

We conduct a comprehensive evaluation on when you should use LLM-based query and doc expansion.

It turns out there's a strong and consistent negative correlation between model performance and gains from using expansion. And it holds for all 20+ rankers we tested!

A 3 by 4 grid of boxplots for the 12 datasets we tested: ArguAna, FiQA, CooAQ Tch, NFCorpus, Quora, Scifact Refute, TREC-CT, TREC-DL19, TREC-DL20, ToT, Touche, WikiQA. We show results for 5 models: Colbert V2, Finetuned Contriever, DPR, MonoT5 3B, and MonoT5 small.

For each dataset, markers show base performance for models, while the boxplot indicates the range of changes in scores for document and/or query expansion. Across all datasets and models, we note a consistent trend: models with lower base performance benefit from expansion; higher performing rankers generally suffer when expansion techniques are used.

November 18, 2024 at 10:30 AM

Orion Weller

@orionweller.bsky.social

Using LLMs for query or document expansion in retrieval (e.g. HyDE and Doc2Query) have scores going 📈

But do these approaches work for all IR models and for different types of distribution shifts? Turns out its actually more 📉 🚨

📝 (arxiv soon): orionweller.github.io/assets/pdf/L...

A plot: the x axis is baseline score of rankers, in ndcg@10. y axis is delta of model score after an expansion is applied.

There are three sets of results, one dataset for each shift type: TrecDL (no shift), FiQA (domain shift), ArguAna (query shift). For each set of result, the chart shows a scatter plot with a trend line. We observe the same trend for all: as the baseline score increases, the delta when using expansion decreases.

On TREC DL, worst models have a base score of ~40, and improve by 10 points w/expansion. the best models have a score of >70, and their performance decreases by -5 points w/expansion.

On FiQA, worse models have a base score of ~15, and improve by 5 points w/expansion. the best models have a score of ~45, and their performance decreases by -3 point w/expansion.

On ArguAna, worst models have a base score of ~25, and improve by >20 points w/expansion. the best models have a score of >55, and their performance decreases by -1 point w/expansion.

November 18, 2024 at 10:30 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news