Lightnews — Scholar-powered news

Orion Weller

@orionweller.bsky.social

Thanks as always to my advisors/coauthors at @jhuclsp.bsky.social including @vandurme.bsky.social Dawn Lawrie, Eugene Yang, Kathryn Ricci, and Andrew Yates!

February 26, 2025 at 2:57 PM

Orion Weller

@orionweller.bsky.social

Now, if you were asking about test-time compute **scaling**:

sadly, we didn't see any gains from adding more tokens. Perhaps that's because reranking isn't as hard and doesn't need many tokens. Maybe that's because we didn't use the right incantation 🤷‍♂️

Try it out yourself!

February 26, 2025 at 2:57 PM

Orion Weller

@orionweller.bsky.social

I'm really excited about test-time compute in IR:
- it's simple to train
- it doesn't require much data (and isn't overfit)
- it generalizes insanely well
- it just thinks different than all other rerankers

There's so much you could do, we've just barely started!

February 26, 2025 at 2:57 PM

Orion Weller

@orionweller.bsky.social

I was also able to do something I've been wanting to do for a while: quantize an IR model!

We quantized each of our models: rank1-32b can fit on a 24GB gpu while maintaining nearly all its performance 🚀 🚀

February 26, 2025 at 2:57 PM

Orion Weller

@orionweller.bsky.social

In IR, we typically evaluate on older TREC data that used old systems to gather annotations. But when we eval'd on DL19 our model found 374% more unjudged documents than other rerankers ‼️

Turns out, nearly all of them were relevant docs - other systems just missed them!

February 26, 2025 at 2:57 PM

Orion Weller

@orionweller.bsky.social

Despite training on English only data and from base LMs (no instruct-tuning) rank1 models excel at instruction following AND are inherently promptable.

They even are SOTA in multilingual instruction following, despite using no multilingual IR data 🤯

February 26, 2025 at 2:57 PM

Orion Weller

@orionweller.bsky.social

rank1 generates a reasoning chain before the final answer (usually ~250 tokens). Try a live demo: huggingface.co/spaces/orion...

Our data (600k) and models are open source, check them out: huggingface.co/collections/...

📝: arxiv.org/abs/2502.18418

Keep reading to see what surprised us 😮

Rank1 Demo -- Test Time Compute in Reranking - a Hugging Face Space by orionweller

Discover amazing ML apps made by the community

huggingface.co

February 26, 2025 at 2:57 PM

Reposted by Orion Weller

Kenneth Enevoldsen

@kennethenevoldsen.bsky.social

We use this collection of tasks to propose multiple benchmarks for multilingual, code, European and Indic languages, and many more.

We find that smaller multilingual models (~500M) outperform notably larger 7B models, likely due to a limited multilingual pre-training.

February 20, 2025 at 9:57 AM

Orion Weller

@orionweller.bsky.social

Still lots of areas to improve (multilingual data anyone 👀) but really happy with how successful this was!

I've even been looking at how it works for instruction-based retrieval and turns out that having modern data helps a lot 🔥

Excited to see what you do with it!

December 19, 2024 at 9:28 PM

Orion Weller

@orionweller.bsky.social

Dang, killer name idea @lateinteraction.bsky.social

Gonna have to up my game 💪

November 25, 2024 at 1:48 AM

Orion Weller

@orionweller.bsky.social

It was a year old Bluesky post, my first one 🙌

But thanks for the shoutout @mrdrozdov.com! Definitely still relevant.

There was a cool follow up from Google as well (don’t know if the authors are on 🦋): arxiv.org/pdf/2311.09175

arxiv.org

November 25, 2024 at 1:42 AM

Orion Weller

@orionweller.bsky.social

Ah missed this, thanks @din0s.me!!

November 23, 2024 at 9:59 PM

Orion Weller

@orionweller.bsky.social

Also sharing one for IR/RAG!

If people want to be added DM me!

go.bsky.app/88ULgwY

November 23, 2024 at 9:18 PM

Orion Weller

@orionweller.bsky.social

I'm super grateful to have worked with awesome internship mentors from AI2 Semantic Scholar, including @soldni.bsky.social, @kylelo.bsky.social, @armancohan.bsky.social, David Wadden, and my advisors at JHU, Ben Van Durme and Dawn Lawrie

November 18, 2024 at 10:30 AM

Orion Weller

@orionweller.bsky.social

So when should you use LLM-based query and document expansion? You should use it when your model is relatively weak or when your dataset has long queries, otherwise it's best to just let the strong models do their thing!

November 18, 2024 at 10:30 AM

Orion Weller

@orionweller.bsky.social

Why is this the case? Weaker IR models benefit from the additional information provided by LLMs (e.g. recall) but strong IR models (like large rerankers) lose information they need when ranking the top documents (e.g. precision).

An example of expansions obscuring the relevance signal.

The query is "Is it possible to take a mortgage using Bitcoin as collateral?"

Document A (relevant): ... suggest that they borrow the money to invest with you. They can use their bitcoins as collateral for the loan. That way, they get the same benefit and your company doesn't go out of business if the price of bitcoin drops ...

Document B (nonrelevant): "...The most likely tool to use in this case would be a Home Equity Line of Credit (HELOC). This is a line of credit for which the full amount is backed by home equity (difference between market and book prices). Most likely your financial institution will apply a factor ..."

Expanded query has terms "Home Equity Line of Credit (HELOC)". Which causes Document B to be relevant.

November 18, 2024 at 10:30 AM

Orion Weller

@orionweller.bsky.social

We show that this finding holds both in-domain and for 4 different types of distribution shift (domain, relevance, long queries, short docs) across 12 datasets.

Interestingly, these effects are the least strong on long query shift (e.g. paragraph+ sized queries, a la ArguAna).

How different expansions affect results on datasets that measure Relevance Shift: we see a similar trend that stronger models do not see gains from augmentation in general while weaker models (like DPR) do

How different expansions affect results on datasets that measure Long Query Format Shift. Unlike previous results, we see that all models benefit from some type of expansions on these three datasets (Tip of my tongue, TREC clinical trials, and ArguAna)

How different expansions affecet results on datasets that measure Domain Shift (NFCorpus, GooAQ Technical, FiQA 2018). Notice that models with higher base scores are generally harmed by expansions while weaker models benefit from them.

In-Domain performance on the TREC Deep Learning Tracks, according to various types of expansions, showing that expansions typically help weaker models (like DPR) but hurt stronger models (especially large reranker models like MonoT5-3B).

November 18, 2024 at 10:30 AM

Orion Weller

@orionweller.bsky.social

We conduct a comprehensive evaluation on when you should use LLM-based query and doc expansion.

It turns out there's a strong and consistent negative correlation between model performance and gains from using expansion. And it holds for all 20+ rankers we tested!

A 3 by 4 grid of boxplots for the 12 datasets we tested: ArguAna, FiQA, CooAQ Tch, NFCorpus, Quora, Scifact Refute, TREC-CT, TREC-DL19, TREC-DL20, ToT, Touche, WikiQA. We show results for 5 models: Colbert V2, Finetuned Contriever, DPR, MonoT5 3B, and MonoT5 small.

For each dataset, markers show base performance for models, while the boxplot indicates the range of changes in scores for document and/or query expansion. Across all datasets and models, we note a consistent trend: models with lower base performance benefit from expansion; higher performing rankers generally suffer when expansion techniques are used.

November 18, 2024 at 10:30 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news