Lightnews — Scholar-powered news

Reposted by Arianna Bisazza

Vera Neplenbroek @veraneplenbroek.bsky.social · Aug 21

Delighted to share that our paper "Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization" (joint work with @arianna-bis.bsky.social and Raquel Fernández) got accepted to the main conference of #EMNLP

Can't wait to discuss our work at #EMNLP2025 in Suzhou this November!

Vera Neplenbroek @veraneplenbroek.bsky.social · May 27

Do LLMs assume demographic information based on stereotypes?

We (@arianna-bis.bsky.social, Raquel Fernández and I) answered this question in our new paper: "Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization".

🧵

arxiv.org/abs/2505.16467

2 12

Arianna Bisazza @arianna-bis.bsky.social · Jun 19

We hope our work will advance the evaluation of LLMs in Turkish and, in general, encourage more research on the robustness of modern language technologies to typological diversity.

1

Arianna Bisazza @arianna-bis.bsky.social · Jun 19

Finally, our experimental paradigms reveal that even LLMs excelling on general minimal pairs can be brittle to variations in word orders & subordination strategies, unlike human speakers.

See paper for results with 13 LLMs, including mono- and multilingual models of different sizes!

1 1

Arianna Bisazza @arianna-bis.bsky.social · Jun 19

We also collect human acceptability judgements & show that *overall* harder phenomena for LLMs are also harder for people, but there are some notable exceptions.

1 1

Arianna Bisazza @arianna-bis.bsky.social · Jun 19

TurBLiMP expands the shortlist of existing language-specific BLiMPs with 2 important properties: high word order freedom & agglutination.

To study LLMs' robustness to these properties, we create experimental paradigms testing syntactic skills w/ different word orders & subordination strategies:

1 1

Arianna Bisazza @arianna-bis.bsky.social · Jun 19

This is hard, slow-paced work going well beyond benchmark translation (let alone LLM-assisted benchmark generation!) It requires real *linguistic* expertise & long discussions on what makes a phenomenon representative of a language. Here's our proposal, inspired by EnglishBLiMP w/ major adaptations:

1 2

Arianna Bisazza @arianna-bis.bsky.social · Jun 19

Grammatical benchmarks are essential to drive progress in truly multilingual Language Modeling & to overcome the linguistic biases we inherit from the English-centeredness of our field.

I'm particularly happy to contribute to this for a language I spent years learning and still found fascinating!

1 2

Arianna Bisazza @arianna-bis.bsky.social · Jun 19

Proud to introduce TurBLiMP, the 1st benchmark of minimal pairs for free-order, morphologically rich Turkish language!

Pre-print: arxiv.org/abs/2506.13487

Fruit of an almost year-long project by amazing MS student @ezgibasar.bsky.social in collab w/ @frap98.bsky.social and @jumelet.bsky.social

TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguis...

arxiv.org

1 2 11

Arianna Bisazza @arianna-bis.bsky.social · Jun 5

Happy to hear you find the analysis useful, Marco! If you have any extra questions, don’t hesitate to contact @jiruiqi.bsky.social

1

Arianna Bisazza @arianna-bis.bsky.social · May 31

One step further in our quest to bring interpretability techniques to the service of MT end users: Are uncertainty & model-internals based metrics a viable alternative to supervised word-level quality estimation?

New paper w/ @gsarti.com
@zouharvi.bsky.social @malvinanissim.bsky.social

Gabriele Sarti @gsarti.com · May 30

📢 New paper: Can unsupervised metrics extracted from MT models detect their translation errors reliably? Do annotators even *agree* on what constitutes an error? 🧐

We compare uncertainty- and interp-based WQE metrics across 12 directions, with some surprising findings!

🧵 1/

2 7

Arianna Bisazza @arianna-bis.bsky.social · May 31

Large Reasoning Models are raising the bar for answer accuracy & transparency, but how does that work in multilingual settings? Can LRMs reason in your language, and what does that entail?

New preprint led by @jiruiqi.bsky.social and @shan23chen.bsky.social!

Jirui Qi @jiruiqi.bsky.social · May 30

[1/]💡New Paper
Large reasoning models (LRMs) are strong in English — but how well do they reason in your language?

Our latest work uncovers their limitation and a clear trade-off:
Controlling Thinking Trace Language Comes at the Cost of Accuracy

📄Link: arxiv.org/abs/2505.22888

1 5

Arianna Bisazza @arianna-bis.bsky.social · May 30

Proud to share the first key output of my Vidi project team w/ @frap98.bsky.social @jumelet.bsky.social @yevgenm.bsky.social who all took this topic to heart, as proved by the many overtime discussions at lunch time 😉

See Francesca’s thread & arXiv link below

3

Arianna Bisazza @arianna-bis.bsky.social · May 30

Excited to see how the BabyLM community will take on this challenge @alexwarstadt.bsky.social @lchoshen.bsky.social @tallinzen.bsky.social @fourtassi.bsky.social and many more

1 3

Arianna Bisazza @arianna-bis.bsky.social · May 30

While disappointing, this result makes us reflect once again on the many non-human-like aspects of current LMs. It also prompts us to keep searching for more sophisticated ways to solve the puzzle of efficient language learning, which makes children such a fascinating object of study.

1 1

Arianna Bisazza @arianna-bis.bsky.social · May 30

Following the success story of BabyBERTa, I & many other NLPers have turned to language acquisition for inspiration. In this new paper we show that using Child-Directed Language as training data is unfortunately *not* beneficial for syntax learning, at least not in the traditional LM training regime

Francesca Padovani @frap98.bsky.social · May 30

“Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models”

I’m happy to share that the preprint of my first PhD project is now online!

🎊 Paper: arxiv.org/abs/2505.23689

Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models

Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of ...

arxiv.org

1 6 24

Arianna Bisazza @arianna-bis.bsky.social · May 28

Thinking LLM treats you just like an average user? Think again!
@veraneplenbroek.bsky.social‘s analysis shows LLMs behave differently according to your gender, race & more. Implicit personalization is always at work & is strongly based on your conversation topics.
Great collab w/ Raquel Fernández ⤵️

Vera Neplenbroek @veraneplenbroek.bsky.social · May 27

Do LLMs assume demographic information based on stereotypes?

We (@arianna-bis.bsky.social, Raquel Fernández and I) answered this question in our new paper: "Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization".

🧵

arxiv.org/abs/2505.16467

1 4

Arianna Bisazza @arianna-bis.bsky.social · May 27

Happy to be part of this collaboration on personalizing translation style in the literary domain. Besides classical multi-shot prompting, various steering techniques show promising results & bring new insights! See thread ⤵️

W/ @danielsc4.it @gsarti.com ElisabettaFersini, @malvinanissim.bsky.social

3

Reposted by Arianna Bisazza

Vera Neplenbroek @veraneplenbroek.bsky.social · May 21

Excited to share that "Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation" arxiv.org/abs/2412.14050 got accepted to ACL Findings! 🎉 #ACL2025 Big thanks to my supervisors Raquel Fernández and @arianna-bis.bsky.social for their guidance and support!

1 7

Arianna Bisazza @arianna-bis.bsky.social · Apr 18

RAG is a powerful way to improve LLMs' answering abilities across many languages. But how do LLMs deal with multilingual contexts? Do they answer consistently when the retrieved info is provided to them in different languages?

Joint work w/ @jiruiqi.bsky.social & Raquel_Fernández
See thread! ⤵️

Jirui Qi @jiruiqi.bsky.social · Apr 11

✨ New Paper ✨
[1/] Retrieving passages from many languages can boost retrieval augmented generation (RAG) performance, but how good are LLMs at dealing with multilingual contexts in the prompt?

📄 Check it out: arxiv.org/abs/2504.00597
(w/ @arianna-bis.bsky.social @Raquel_Fernández)

#NLProc

2 6

Arianna Bisazza @arianna-bis.bsky.social · Apr 8

We see our work as a (breadth-first) complement to the valuable development of language-specific BLiMPs which offer more depth but are hardly scalable to lots of languages.

For more details and interesting LLM results, see
@jumelet.bsky.social's thread ⤴️ & paper arxiv.org/abs/2504.02768

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs. Our minimal ...

arxiv.org

3

Arianna Bisazza @arianna-bis.bsky.social · Apr 8

Importantly, MultiBLiMP is also a pipeline to construct minimal pairs automatically from Universal Dependencies treebanks, which we hope to extend to many more syntactic phenomena in future collaborative efforts (reach out if interested in this!)

1 3

Arianna Bisazza @arianna-bis.bsky.social · Apr 8

To scale up current syntactic evaluation practices, we introduce a massively multilingual (n=101) benchmark of Minimal Pairs for subject-verb agreement, going well beyond the breadth of existing cross-lingual benchmarks of this kind (e.g. CLAMS @amuuueller.bsky.social @tallinzen.bsky.social)

1 2

Arianna Bisazza @arianna-bis.bsky.social · Apr 8

Modern LLMs "speak" hundreds of languages... but do they really?
Multilinguality claims are often based on downstream tasks like QA & MT, while *formal* linguistic competence remains hard to gauge in lots of languages

Meet MultiBLiMP!
(joint work w/ @jumelet.bsky.social & @weissweiler.bsky.social)

Jaap Jumelet @jumelet.bsky.social · Apr 7

✨New paper ✨

Introducing 🌍MultiBLiMP 1.0: A Massively Multilingual Benchmark of Minimal Pairs for Subject-Verb Agreement, covering 101 languages!

We present over 125,000 minimal pairs and evaluate 17 LLMs, finding that support is still lacking for many languages.

🧵⬇️

2 6 21

Arianna Bisazza @arianna-bis.bsky.social · Apr 4

If you'd like to know how our framework can be used to simulate the emergence of these & other language universals w/ small neural-nets + communication games + artificial languages, see our latest paper presented at CoNLL last year!

aclanthology.org/2024.conll-1...
(w/ Yuchen Lian & Tessa Verhoef)

NeLLCom-X: A Comprehensive Neural-Agent Framework to Simulate Language Learning and Group Communication

Yuchen Lian, Tessa Verhoef, Arianna Bisazza. Proceedings of the 28th Conference on Computational Natural Language Learning. 2024.

aclanthology.org

Arianna Bisazza @arianna-bis.bsky.social · Apr 4

We've also used NeLLCom to study:
- the impact of population size on the word-order/case-marking trade-off
- the conditions leading to dependency length minimization aclanthology.org/2024.lrec-ma...
- lexical informativeness in language use (work in progress w/ @gboleda.bsky.social)

Endowing Neural Language Learners with Human-like Biases: A Case Study on Dependency Length Minimization

Yuqing Zhang, Tessa Verhoef, Gertjan van Noord, Arianna Bisazza. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 202...

aclanthology.org

1 1