Orion Weller
orionweller.bsky.social
Orion Weller
@orionweller.bsky.social
PhD Student at Johns Hopkins University. Previously: Allen Institute for AI, Apple, Samaya AI. Research for #NLProc #IR
Thanks as always to my advisors/coauthors at @jhuclsp.bsky.social including @vandurme.bsky.social Dawn Lawrie, Eugene Yang, Kathryn Ricci, and Andrew Yates!
February 26, 2025 at 2:57 PM
Now, if you were asking about test-time compute **scaling**:

sadly, we didn't see any gains from adding more tokens. Perhaps that's because reranking isn't as hard and doesn't need many tokens. Maybe that's because we didn't use the right incantation 🤷‍♂️

Try it out yourself!
February 26, 2025 at 2:57 PM
I'm really excited about test-time compute in IR:
- it's simple to train
- it doesn't require much data (and isn't overfit)
- it generalizes insanely well
- it just thinks different than all other rerankers

There's so much you could do, we've just barely started!
February 26, 2025 at 2:57 PM
I was also able to do something I've been wanting to do for a while: quantize an IR model!

We quantized each of our models: rank1-32b can fit on a 24GB gpu while maintaining nearly all its performance 🚀 🚀
February 26, 2025 at 2:57 PM
In IR, we typically evaluate on older TREC data that used old systems to gather annotations. But when we eval'd on DL19 our model found 374% more unjudged documents than other rerankers ‼️

Turns out, nearly all of them were relevant docs - other systems just missed them!
February 26, 2025 at 2:57 PM
Despite training on English only data and from base LMs (no instruct-tuning) rank1 models excel at instruction following AND are inherently promptable.

They even are SOTA in multilingual instruction following, despite using no multilingual IR data 🤯
February 26, 2025 at 2:57 PM
rank1 generates a reasoning chain before the final answer (usually ~250 tokens). Try a live demo: huggingface.co/spaces/orion...

Our data (600k) and models are open source, check them out: huggingface.co/collections/...

📝: arxiv.org/abs/2502.18418

Keep reading to see what surprised us 😮
Rank1 Demo -- Test Time Compute in Reranking - a Hugging Face Space by orionweller
Discover amazing ML apps made by the community
huggingface.co
February 26, 2025 at 2:57 PM
Reposted by Orion Weller
We use this collection of tasks to propose multiple benchmarks for multilingual, code, European and Indic languages, and many more.

We find that smaller multilingual models (~500M) outperform notably larger 7B models, likely due to a limited multilingual pre-training.
February 20, 2025 at 9:57 AM
Still lots of areas to improve (multilingual data anyone 👀) but really happy with how successful this was!

I've even been looking at how it works for instruction-based retrieval and turns out that having modern data helps a lot 🔥

Excited to see what you do with it!
December 19, 2024 at 9:28 PM
Dang, killer name idea @lateinteraction.bsky.social

Gonna have to up my game 💪
November 25, 2024 at 1:48 AM
It was a year old Bluesky post, my first one 🙌

But thanks for the shoutout @mrdrozdov.com! Definitely still relevant.

There was a cool follow up from Google as well (don’t know if the authors are on 🦋): arxiv.org/pdf/2311.09175
arxiv.org
November 25, 2024 at 1:42 AM
Ah missed this, thanks @din0s.me!!
November 23, 2024 at 9:59 PM
Also sharing one for IR/RAG!

If people want to be added DM me!

go.bsky.app/88ULgwY
November 23, 2024 at 9:18 PM
I'm super grateful to have worked with awesome internship mentors from AI2 Semantic Scholar, including @soldni.bsky.social, @kylelo.bsky.social, @armancohan.bsky.social, David Wadden, and my advisors at JHU, Ben Van Durme and Dawn Lawrie
November 18, 2024 at 10:30 AM
So when should you use LLM-based query and document expansion? You should use it when your model is relatively weak or when your dataset has long queries, otherwise it's best to just let the strong models do their thing!
November 18, 2024 at 10:30 AM
Why is this the case? Weaker IR models benefit from the additional information provided by LLMs (e.g. recall) but strong IR models (like large rerankers) lose information they need when ranking the top documents (e.g. precision).
November 18, 2024 at 10:30 AM
We show that this finding holds both in-domain and for 4 different types of distribution shift (domain, relevance, long queries, short docs) across 12 datasets.

Interestingly, these effects are the least strong on long query shift (e.g. paragraph+ sized queries, a la ArguAna).
November 18, 2024 at 10:30 AM
We conduct a comprehensive evaluation on when you should use LLM-based query and doc expansion.

It turns out there's a strong and consistent negative correlation between model performance and gains from using expansion. And it holds for all 20+ rankers we tested!
November 18, 2024 at 10:30 AM