Jaap Jumelet
@jumelet.bsky.social
700 followers 270 following 32 posts
Postdoc @rug.nl with Arianna Bisazza. Interested in NLP, interpretability, syntax, language acquisition and typology.
Posts Media Videos Starter Packs
Pinned
jumelet.bsky.social
✨New paper ✨

Introducing 🌍MultiBLiMP 1.0: A Massively Multilingual Benchmark of Minimal Pairs for Subject-Verb Agreement, covering 101 languages!

We present over 125,000 minimal pairs and evaluate 17 LLMs, finding that support is still lacking for many languages.

🧵⬇️
jumelet.bsky.social
Wij speelden als kind (in Breda) vaak "1 keer tets", waar je een voetbal maximaal 1 keer mocht laten stuiteren; ik had ook geen idee dat dat een Brabants woord was.
jumelet.bsky.social
Happening now at the SIGTYP poster session! Come talk to Leonie and me about MultiBLiMP!
Reposted by Jaap Jumelet
wzuidema.bsky.social
I'll be in Vienna only from tomorrow, but today my star PhD student Marianne is already presenting some of our work:

BLIMP-NL, in which we create a large new dataset for syntactic evaluation of Dutch LLMs, and learn a lot about dataset creation, LLM evaluation and grammatical abilities on the way.
mdhk.net
Next week I’ll be in Vienna for my first *ACL conference! 🇦🇹✨

I will present our new BLiMP-NL dataset for evaluating language models on Dutch syntactic minimal pairs and human acceptability judgments ⬇️

🗓️ Tuesday, July 29th, 16:00-17:30, Hall X4 / X5 (Austria Center Vienna)
The BLiMP-NL dataset consists of 84 Dutch minimal pair paradigms covering 22 syntactic phenomena, and comes with graded human acceptability ratings & self-paced reading times. 

An example minimal pair:
A. Ik bekijk de foto van mezelf in de kamer (I watch the photograph of myself in the room; grammatical)
B. Wij bekijken de foto van mezelf in de kamer (We watch the photograph of myself in the room; ungrammatical)

Differences in human acceptability ratings between sentences correlate with differences in model syntactic log-odds ratio scores.
jumelet.bsky.social
Congrats and good luck in Canada!
Reposted by Jaap Jumelet
casperalbers.nl
Ik snap niet dat hier niet meer ophef over is:

Het binnenhalen van Amerikaanse wetenschappers wordt betaalt door Nederlandse academici geen inflatiecorrectie op hun salaris te geven.

1/2
jumelet.bsky.social
Ohh cool! Nice to see the interactions-as-structure idea I had back in 2021 is still being explored!
Reposted by Jaap Jumelet
catherinearnett.bsky.social
My paper with @tylerachang.bsky.social and @jamichaelov.bsky.social will appear at #ACL2025NLP! The updated preprint is available on arxiv. I look forward to chatting about bilingual models in Vienna!
catherinearnett.bsky.social
✨New pre-print✨ Crosslingual transfer allows models to leverage their representations for one language to improve performance on another language. We characterize the acquisition of shared representations in order to better understand how and when crosslingual transfer happens.
Reposted by Jaap Jumelet
Reposted by Jaap Jumelet
mdlhx.bsky.social
Interested in multilingual tokenization in #NLP? Lisa Beinborn and I are hiring!

PhD candidate position in Göttingen, Germany: www.uni-goettingen.de/de/644546.ht...

PostDoc position in Leuven, Belgium:
www.kuleuven.be/personeel/jo...

Deadline 6th of June
Stellen OBP - Georg-August-Universität Göttingen
Webseiten der Georg-August-Universität Göttingen
www.uni-goettingen.de
Reposted by Jaap Jumelet
blackboxnlp.bsky.social
BlackboxNLP, the leading workshop on interpretability and analysis of language models, will be co-located with EMNLP 2025 in Suzhou this November! 📆

This edition will feature a new shared task on circuits/causal variable localization in LMs, details here: blackboxnlp.github.io/2025/task
Reposted by Jaap Jumelet
lchoshen.bsky.social
Close your books, test time!
The evaluation pipelines are out, baselines are released & the challenge is on

There is still time to join and
We are excited to learn from you on pretraining and human-model gaps

*Don't forget to fastEval on checkpoints
github.com/babylm/evalu...
📈🤖🧠
#AI #LLMS
jumelet.bsky.social
Scherp geschreven en geheel mee eens, maar beetje wrang wel dat de boodschap zich achter een paywall van 450 euro bevindt :') (dank voor de screenshots!)
Reposted by Jaap Jumelet
jiruiqi.bsky.social
✨ New Paper ✨
[1/] Retrieving passages from many languages can boost retrieval augmented generation (RAG) performance, but how good are LLMs at dealing with multilingual contexts in the prompt?

📄 Check it out: arxiv.org/abs/2504.00597
(w/ @arianna-bis.bsky.social @Raquel_Fernández)

#NLProc
jumelet.bsky.social
That is definitely possible indeed, and a potential confounding factor. In RuBLiMP, a Russian benchmark, they defined a way to validate this based on LM probs, but we left that open for future work. The poor performance on low-res langs shows they're definitely not trained on all of UD though!
Reposted by Jaap Jumelet
jumelet.bsky.social
✨New paper ✨

Introducing 🌍MultiBLiMP 1.0: A Massively Multilingual Benchmark of Minimal Pairs for Subject-Verb Agreement, covering 101 languages!

We present over 125,000 minimal pairs and evaluate 17 LLMs, finding that support is still lacking for many languages.

🧵⬇️
Reposted by Jaap Jumelet
arianna-bis.bsky.social
Modern LLMs "speak" hundreds of languages... but do they really?
Multilinguality claims are often based on downstream tasks like QA & MT, while *formal* linguistic competence remains hard to gauge in lots of languages

Meet MultiBLiMP!
(joint work w/ @jumelet.bsky.social & @weissweiler.bsky.social)
jumelet.bsky.social
✨New paper ✨

Introducing 🌍MultiBLiMP 1.0: A Massively Multilingual Benchmark of Minimal Pairs for Subject-Verb Agreement, covering 101 languages!

We present over 125,000 minimal pairs and evaluate 17 LLMs, finding that support is still lacking for many languages.

🧵⬇️
jumelet.bsky.social
Person agreement is easier to model than Gender or Number. Sentences with higher overall perplexity lead to less accurate judgements, and models are more likely to pick the wrong inflection if it is split into more tokens. Surprisingly, subject-verb distance has no effect.
jumelet.bsky.social
We find that boosting specific languages works, but only if you pre-, and not post-train: EuroLLM outperforms same size Llama3 on its target languages, but Aya is not significantly better. Neither of them outperform Llama3 significantly on a language not intentionally included.
jumelet.bsky.social
We evaluate 17 Language Models, among them Llama 3, Aya, and Gemma 3.
Overall, Llama3 70B and Gemma 27B perform best, but the monolingual 500M Goldfish models significantly outperform them in 14 languages!

Base models consistently outperform their instruction-tuned counterparts.
jumelet.bsky.social
We create 125,000 pairs for 101 languages and six types of agreement, resulting in high diversity across phenomena, typological families, geography, amount of resources available, sentence length, and word frequencies. 43 of our languages are not Indo-European.
jumelet.bsky.social
MultiBLiMP is created automatically using Universal Dependencies and Universal Morphology.

We search for subject-verb or -participle pairs with our target features Number, Person, and Gender in UD, then insert the word with the opposite feature value to form a minimal pair.