Arianna Bisazza
@arianna-bis.bsky.social
180 followers 130 following 31 posts
Associate Professor at GroNLP ( @gronlp.bsky.social‬ ) #NLP | Multilingualism | Interpretability | Language Learning in Humans vs NeuralNets | Mum^2 Head of the InClow research group: https://inclow-lm.github.io/
Posts Media Videos Starter Packs
Reposted by Arianna Bisazza
veraneplenbroek.bsky.social
Delighted to share that our paper "Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization" (joint work with @arianna-bis.bsky.social and Raquel Fernández) got accepted to the main conference of #EMNLP

Can't wait to discuss our work at #EMNLP2025 in Suzhou this November!
veraneplenbroek.bsky.social
Do LLMs assume demographic information based on stereotypes?

We (@arianna-bis.bsky.social, Raquel Fernández and I) answered this question in our new paper: "Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization".

🧵

arxiv.org/abs/2505.16467
arianna-bis.bsky.social
We hope our work will advance the evaluation of LLMs in Turkish and, in general, encourage more research on the robustness of modern language technologies to typological diversity.
arianna-bis.bsky.social
Finally, our experimental paradigms reveal that even LLMs excelling on general minimal pairs can be brittle to variations in word orders & subordination strategies, unlike human speakers.

See paper for results with 13 LLMs, including mono- and multilingual models of different sizes!
arianna-bis.bsky.social
We also collect human acceptability judgements & show that *overall* harder phenomena for LLMs are also harder for people, but there are some notable exceptions.
arianna-bis.bsky.social
TurBLiMP expands the shortlist of existing language-specific BLiMPs with 2 important properties: high word order freedom & agglutination.

To study LLMs' robustness to these properties, we create experimental paradigms testing syntactic skills w/ different word orders & subordination strategies:
arianna-bis.bsky.social
This is hard, slow-paced work going well beyond benchmark translation (let alone LLM-assisted benchmark generation!) It requires real *linguistic* expertise & long discussions on what makes a phenomenon representative of a language. Here's our proposal, inspired by EnglishBLiMP w/ major adaptations:
arianna-bis.bsky.social
Grammatical benchmarks are essential to drive progress in truly multilingual Language Modeling & to overcome the linguistic biases we inherit from the English-centeredness of our field.

I'm particularly happy to contribute to this for a language I spent years learning and still found fascinating!
arianna-bis.bsky.social
Happy to hear you find the analysis useful, Marco! If you have any extra questions, don’t hesitate to contact @jiruiqi.bsky.social
arianna-bis.bsky.social
One step further in our quest to bring interpretability techniques to the service of MT end users: Are uncertainty & model-internals based metrics a viable alternative to supervised word-level quality estimation?

New paper w/ @gsarti.com
@zouharvi.bsky.social @malvinanissim.bsky.social
gsarti.com
📢 New paper: Can unsupervised metrics extracted from MT models detect their translation errors reliably? Do annotators even *agree* on what constitutes an error? 🧐

We compare uncertainty- and interp-based WQE metrics across 12 directions, with some surprising findings!

🧵 1/
arianna-bis.bsky.social
Large Reasoning Models are raising the bar for answer accuracy & transparency, but how does that work in multilingual settings? Can LRMs reason in your language, and what does that entail?

New preprint led by @jiruiqi.bsky.social and @shan23chen.bsky.social!
jiruiqi.bsky.social
[1/]💡New Paper
Large reasoning models (LRMs) are strong in English — but how well do they reason in your language?

Our latest work uncovers their limitation and a clear trade-off:
Controlling Thinking Trace Language Comes at the Cost of Accuracy

📄Link: arxiv.org/abs/2505.22888
arianna-bis.bsky.social
Proud to share the first key output of my Vidi project team w/ @frap98.bsky.social @jumelet.bsky.social @yevgenm.bsky.social who all took this topic to heart, as proved by the many overtime discussions at lunch time 😉

See Francesca’s thread & arXiv link below
arianna-bis.bsky.social
Excited to see how the BabyLM community will take on this challenge @alexwarstadt.bsky.social @lchoshen.bsky.social @tallinzen.bsky.social @fourtassi.bsky.social and many more
arianna-bis.bsky.social
While disappointing, this result makes us reflect once again on the many non-human-like aspects of current LMs. It also prompts us to keep searching for more sophisticated ways to solve the puzzle of efficient language learning, which makes children such a fascinating object of study.
arianna-bis.bsky.social
Following the success story of BabyBERTa, I & many other NLPers have turned to language acquisition for inspiration. In this new paper we show that using Child-Directed Language as training data is unfortunately *not* beneficial for syntax learning, at least not in the traditional LM training regime
arianna-bis.bsky.social
Thinking LLM treats you just like an average user? Think again!
@veraneplenbroek.bsky.social‘s analysis shows LLMs behave differently according to your gender, race & more. Implicit personalization is always at work & is strongly based on your conversation topics.
Great collab w/ Raquel Fernández ⤵️
veraneplenbroek.bsky.social
Do LLMs assume demographic information based on stereotypes?

We (@arianna-bis.bsky.social, Raquel Fernández and I) answered this question in our new paper: "Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization".

🧵

arxiv.org/abs/2505.16467
arianna-bis.bsky.social
Happy to be part of this collaboration on personalizing translation style in the literary domain. Besides classical multi-shot prompting, various steering techniques show promising results & bring new insights! See thread ⤵️

W/ @danielsc4.it @gsarti.com ElisabettaFersini, @malvinanissim.bsky.social
Reposted by Arianna Bisazza
veraneplenbroek.bsky.social
Excited to share that "Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation" arxiv.org/abs/2412.14050 got accepted to ACL Findings! 🎉 #ACL2025 Big thanks to my supervisors Raquel Fernández and @arianna-bis.bsky.social for their guidance and support!
arianna-bis.bsky.social
RAG is a powerful way to improve LLMs' answering abilities across many languages. But how do LLMs deal with multilingual contexts? Do they answer consistently when the retrieved info is provided to them in different languages?

Joint work w/ @jiruiqi.bsky.social & Raquel_Fernández
See thread! ⤵️
jiruiqi.bsky.social
✨ New Paper ✨
[1/] Retrieving passages from many languages can boost retrieval augmented generation (RAG) performance, but how good are LLMs at dealing with multilingual contexts in the prompt?

📄 Check it out: arxiv.org/abs/2504.00597
(w/ @arianna-bis.bsky.social @Raquel_Fernández)

#NLProc
arianna-bis.bsky.social
Importantly, MultiBLiMP is also a pipeline to construct minimal pairs automatically from Universal Dependencies treebanks, which we hope to extend to many more syntactic phenomena in future collaborative efforts (reach out if interested in this!)
arianna-bis.bsky.social
To scale up current syntactic evaluation practices, we introduce a massively multilingual (n=101) benchmark of Minimal Pairs for subject-verb agreement, going well beyond the breadth of existing cross-lingual benchmarks of this kind (e.g. CLAMS @amuuueller.bsky.social @tallinzen.bsky.social)
arianna-bis.bsky.social
Modern LLMs "speak" hundreds of languages... but do they really?
Multilinguality claims are often based on downstream tasks like QA & MT, while *formal* linguistic competence remains hard to gauge in lots of languages

Meet MultiBLiMP!
(joint work w/ @jumelet.bsky.social & @weissweiler.bsky.social)
jumelet.bsky.social
✨New paper ✨

Introducing 🌍MultiBLiMP 1.0: A Massively Multilingual Benchmark of Minimal Pairs for Subject-Verb Agreement, covering 101 languages!

We present over 125,000 minimal pairs and evaluate 17 LLMs, finding that support is still lacking for many languages.

🧵⬇️
arianna-bis.bsky.social
If you'd like to know how our framework can be used to simulate the emergence of these & other language universals w/ small neural-nets + communication games + artificial languages, see our latest paper presented at CoNLL last year!

aclanthology.org/2024.conll-1...
(w/ Yuchen Lian & Tessa Verhoef)
NeLLCom-X: A Comprehensive Neural-Agent Framework to Simulate Language Learning and Group Communication
Yuchen Lian, Tessa Verhoef, Arianna Bisazza. Proceedings of the 28th Conference on Computational Natural Language Learning. 2024.
aclanthology.org