Andreas Chari
banner
andreaschari.bsky.social
Andreas Chari
@andreaschari.bsky.social
@irglasgow.bsky.social PhD Student, University of Glasgow. Researching multilingual NLP & IR Supervisors: @macavaney.bsky.social & @iadhounis.bsky.social. 🇨🇾 Views my own
Check out all the details and more findings here:
arxiv.org/abs/2503.22508

I will also present this at the #IR4GOOD at #ECIR2025 next week, alongside many other contributions from
@irglasgow.bsky.social. Looking forward to continuing these discussions at #SIGIR2025.
Improving Low-Resource Retrieval Effectiveness using Zero-Shot Linguistic Similarity Transfer
Globalisation and colonisation have led the vast majority of the world to use only a fraction of languages, such as English and French, to communicate, excluding many others. This has severely affecte...
arxiv.org
April 2, 2025 at 8:27 AM
Then, we try to zero-shot these fine-tuned models on other language pairs. Some are related to French-Catalan, such as Occitan, and some are entirely unrelated, such as Mandarin.

We see it does transfer to Occitan-French pairs and in Cantonese-Mandarin pairs (more on the paper)
April 2, 2025 at 8:27 AM
We fine-tuned them on Catalan queries and French Docs and see that we can regularise the models to be more robust on Catalan (and we see some gains in French!)
April 2, 2025 at 8:27 AM
What will happen if you fine-tune neural rankers such as BGE-M3 and ColBERT-XM on low-resource queries and high-resource documents of two different (albeit related) languages?

Will it help regularize the rankers on these similarities?
April 2, 2025 at 8:27 AM
April 2, 2025 at 8:27 AM
We translated five collections of mMARCO into similar languages and evaluated retrieval methods based on how well they would perform if the queries were expressed in a similar low-resource language.

It turns out they do not perform very well (an understatement).
April 2, 2025 at 8:27 AM