Lightnews — Scholar-powered news

Jindřich Libovický @jlibovicky.bsky.social · Sep 1

Most vision-language models only work in English. We explore how different parallel data types (machine-translated vs authentic captions) affect cross-lingual transfer. Key finding: authentic data can outperform machine translation, and multilingual training beats bilingual approaches. #NLP

2

Jindřich Libovický @jlibovicky.bsky.social · Sep 1

So proud of my PhD student @andrei-a-manea.bsky.social for his first first-author publication! 🎉 He presented this work last week at TSD. Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders arxiv.org/pdf/2504.21681

1 6

Jindřich Libovický @jlibovicky.bsky.social · Aug 25

For evaluation researchers: Simple string-overlap metrics (BLEU, chrF) work surprisingly well for factual QA. 🤔 When answers are mostly named entities, exact matches matter more than we thought.

LLM-as-judge 🦙🧑‍⚖️ correlates best with human judgment, though.

1 1

Jindřich Libovický @jlibovicky.bsky.social · Aug 25

The results are... humbling 😅
Even the best models:

>40% accuracy on textual questions
<30% on visual questions
Often perform better in English than the local language (!!)

Visual QA with regional images is especially challenging.

1

Jindřich Libovický @jlibovicky.bsky.social · Aug 25

The problem: Most QA benchmarks focus on globally known facts. But real users ask about local geography, culture, and history.

We collected questions from native speakers in Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦 about facts locals know but outsiders don't.

1

Jindřich Libovický @jlibovicky.bsky.social · Aug 25

🧵 We're releasing CUS-QA - a new benchmark for testing LLMs on regional knowledge!
Find out what your model knows about Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦!
👉 Textual and visual questions, answers, and human judgment on model outputs!
huggingface.co/datasets/ufa...
www.arxiv.org/abs/2507.22752

ufal/cus-qa · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1 3 16

Jindřich Libovický @jlibovicky.bsky.social · Aug 1

Stay tuned, we will release the dataset soon...

Institute of Formal and Applied Linguistics @ufal.mff.cuni.cz · Jul 31

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset arxiv.org/abs/2507.22752
by @jlibovicky.bsky.social , ‪@jindrahelcl.bsky.social, @andrei-a-manea.bsky.social
Question that foreigners don't know the answer to + human judgment on question generation

2

Reposted by Jindřich Libovický

Jindra Helcl @jindrahelcl.bsky.social · Jul 29

We need to have poster fights at the end of every conference.

1 3

Jindřich Libovický @jlibovicky.bsky.social · Jul 28

Just presented MAGBIG, a new dataset and evaluation methodology for gender bias in multilingual text-to-image generation. Grammatical gender matters when studying these biases across languages!
Thanks to Felix Friedrich, @kathaem.bsky.social and all co-authors - it was fun to work on this together!

Institute of Formal and Applied Linguistics @ufal.mff.cuni.cz · Jul 28

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes
aclanthology.org/2025.acl-lon...
by Felix Friedrich, @kathaem.bsky.social, Patrick Schramowski, @mbrackaiml.bsky.social , @jlibovicky.bsky.social, @kerstingaiml.bsky.social, Alex Fraser

2

Jindřich Libovický @jlibovicky.bsky.social · Jul 27

This week I am at #ACL2025NLP in Vienna 🎡🇦🇹. Find me 🕵️ or message 💌 me if you want to chat about multilinguality or tokenization. Stop 🛑 by our poster on gender bias in text-to-image generation on Monday aclanthology.org/2025.acl-lon...

8

Reposted by Jindřich Libovický

Tokenization Workshop (TokShop) @ICML2025 @tokshop.bsky.social · Jun 2

TokShop @ #ICML2025 got way more submissions than expected! 📈 We could really use a few more reviewers to help out. If you have the capacity to review a #tokenization paper by Saturday, please fill out this form: forms.gle/32A6sQHQrMSb... 🙏

TokShop 2025

Registering interest in all things tokenization at TokShop @ ICML 2025 (July 18) Consider joining the Google group for future updates! https://groups.google.com/g/tokshop

forms.gle

4

Reposted by Jindřich Libovický

Tokenization Workshop (TokShop) @ICML2025 @tokshop.bsky.social · May 14

📣 Call for Paper Alert: TokShop @ ICML 2025
TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.

ICML 2025 Workshop TokShop

Welcome to the OpenReview homepage for ICML 2025 Workshop TokShop

openreview.net

1 12 18

Reposted by Jindřich Libovický

Tokenization Workshop (TokShop) @ICML2025 @tokshop.bsky.social · May 4

Got a tokenization paper that just didn't make the cut for ICML? Submit it to the Tokenization Workshop TokShop at #ICML2025 -- we'd love to see it there!
tokenization-workshop.github.io

Tokenization Workshop @ ICML 2025

tokenization-workshop.github.io

6 8

Jindřich Libovický @jlibovicky.bsky.social · Apr 30

If you will be on the virtual NAACL day on May 6, 5 pm Central European Time, don't miss @kathaem.bsky.social presenting our work on the importance of semantic token overlap in multilingual language models. aclanthology.org/2025.naacl-s...

Beyond Literal Token Overlap: Token Alignability for Multilinguality

Katharina Hämmerl, Tomasz Limisiewicz, Jindřich Libovický, Alexander Fraser. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics:...

aclanthology.org

1

Jindřich Libovický @jlibovicky.bsky.social · Apr 30

Attending #NAACL2025 virtually. Since 2022, I've been training a classifier on papers I read to tackle the arXiv madness. Ran it on the NAACL proceedings for my personalized watch list. 🤓📺 However, it's far from perfect: Multilingual cultural awareness is great, but where is tokenization? 🤷

2 2

Jindřich Libovický @jlibovicky.bsky.social · Apr 15

We're organizing ✨Tokenization Workhop✨ TokShop❗ Join us at @icmlconf.bsky.social in July in Vancouver 🇨🇦. Follow @tokshop.bsky.social for updates! Submit your paper by May 30.

Tokenization Workshop (TokShop) @ICML2025 @tokshop.bsky.social · Apr 15

🚨 NEW WORKSHOP ALERT 🚨

We're thrilled to announce the first-ever Tokenization Workshop (TokShop) at #ICML2025 @icmlconf.bsky.social! 🎉

Submissions are open for work on tokenization across all areas of machine learning.

📅 Submission deadline: May 30, 2025
🔗 tokenization-workshop.github.io

Tokenization Workshop @ ICML 2025

tokenization-workshop.github.io

4

Jindřich Libovický @jlibovicky.bsky.social · Apr 4

Random take on the #TuringTest: Rather than testing machine intelligence, it can be a measure of societal awareness about #AI capabilities. The real objective isn't creating a machine that passes but educating people to think critically and avoid being deceived, so the machines do not pass the test.

4

Jindřich Libovický @jlibovicky.bsky.social · Apr 2

Summaries of pre-prints that I noticed and liked on arXiv in March are now on my blog jlibovicky.github.io//2025/04/02/...

Highlights from Machine Translation and Multilinguality in March 2025

EuroBERT: Scaling Multilingual Encoders for European Languages

jlibovicky.github.io

4

Jindřich Libovický @jlibovicky.bsky.social · Mar 10

Our paper 'Beyond Literal Token Overlap: Token Alignability for Multilinguality' will be at #NAACL2025! We show that token alignability is a stronger predictor of cross-lingual transfer than literal token overlap.

Read it here: arxiv.org/abs/2502.06468

1 6

Jindřich Libovický @jlibovicky.bsky.social · Feb 7

Short notes about what pre-prints I noticed in December and January are now on my blog: jlibovicky.github.io/2025/02/07/M...

Highlights from Machine Translation and Multilinguality in December 2024 and January 2025

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

jlibovicky.github.io

3

Jindřich Libovický @jlibovicky.bsky.social · Jan 14

Join Mu-SHROOM 🍄, a SemEval 2025 shared task on detecting hallucination spans in multilingual LLM outputs! 🌍 Includes Czech with regional Czech questions 🇨🇿. Do you think you can spot when something isn’t true? 🤔 Try it out! 👉 helsinki-nlp.github.io/shroom #SemEval2025 #NLP

Welcome to SemEval-2025 Task-3 — Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

helsinki-nlp.github.io

4

Jindřich Libovický @jlibovicky.bsky.social · Dec 24

Happy holidays! 🎄🎅🤩🎁

9

Jindřich Libovický @jlibovicky.bsky.social · Dec 6

Highlights from multilingual #NLP and machine translation papers I found on arXiv in November are now on my blog: jlibovicky.github.io/2024/12/05/M...

Highlights from Machine Translation and Multilinguality in November 2024

Mitigating Metric Bias in Minimum Bayes Risk Decoding

jlibovicky.github.io

14

Jindřich Libovický @jlibovicky.bsky.social · Dec 3

This is going to be fun! 🤓 We have three years to spend 6.5M CZK on improving multilingual tokenization. The goal is to make subwords more alignable across languages and help languages that suffer from over-segmentation with current models.

Institute of Formal and Applied Linguistics @ufal.mff.cuni.cz · Dec 3

Good news! 🥳 GAČR will fund two of our projects:
👉 @jlibovicky.bsky.social proposes to better tokenization for #LLMs and machine translation
👉 Veronika Kolářová will study syntactic features of Czech non-verbal predicates
➕ Dominik Macháček receives Postdoc Individual Fellowship! 💪

2 1 11

Jindřich Libovický @jlibovicky.bsky.social · Nov 21

Just shared my takeaways from #EMNLP2024 on my blog: jlibovicky.github.io//2024/11/21/...

Notes from EMNLP 2024

Last week, I was at EMNLP in Miami, and here are a few notes about what I saw at the conference.

jlibovicky.github.io

4 2 40