Lightnews — Scholar-powered news

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · 1d

I’m in Montreal this week for @colmweb.org and @wmdqs.bsky.social! Looking forward to chatting about tokenizers, multilingual data, and more! #COLM2025

Name tag with “Anti Anti Tokenizer Club” pin on lanyard

12

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · 11d

Yeah, I think the models do generally capture this well and with a lot of flexibility. I think when people have done morphological tokenization, it tends to be really rigid and fragile to anything OOD

1

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · 11d

I guess the idea is basically to map strings of text to some kind of abstract representation of meaning and grammar? Maybe the closest thing is morphological tokenization. But to do this fully you would kind of need to solve Language first

1 1

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · 12d

Thanks!

1

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · 13d

huggingface.co/blog/catheri...

There is no such thing as a tokenizer-free lunch

A Blog post by Catherine Arnett on Hugging Face

huggingface.co

1 4 14

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · 13d

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

4 13 56

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · 19d

huggingface.co/blog/catheri...

An Analysis of Multilingual Models on Hugging Face

A Blog post by Catherine Arnett on Hugging Face

huggingface.co

4

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · 19d

Did you know?

❌77% of language models on @hf.co are not tagged for any language
📈For 95% of languages, most models are multilingual
🚨88% of models with tags are trained on English

In a new blog post, @tylerachang.bsky.social and I dig into these trends and why they matter! 👇

1 2 13

Reposted by Catherine Arnett @ 🍁COLM🍁

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · 26d

We are in need of some emergency reviewers for MRL. If you are available, please fill out this form!

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · Jul 21

If you would like to sign up to be a reviewer, please fill in this form: t.co/fbunVuVhdE

https://forms.gle/fbizvGghD33cP3HP7

t.co

1

Reposted by Catherine Arnett @ 🍁COLM🍁

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · Aug 24

We extended the deadline by one day, so you have until the end of today (Aug 24) AoE to submit! Good luck!

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · Aug 12

The deadline for MRL at #EMNLP2025 is next week!

⏰ Submission Deadline: August 23rd (AoE)

🔗 CfP: sigtyp.github.io/ws2025-mrl.h...

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · Jul 21

The submission deadline for the 5th Workshop on Multilingual Representation Learning is coming up! See details below!

1

Reposted by Catherine Arnett @ 🍁COLM🍁

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · Aug 18

We have over 200 volunteers now for 90+ languages! We are hoping to expand the diversity of our language coverage and are still looking for participants who speak these languages. Check out how to get involved below, and please help us spread the word!

We are still actively looking for volunteers speaking the following languages (or other languages not listed):
Afrikaans, Aymara, Basque, Bosnian, Breton, Burmese, Cebuano, Guarani, Haitian Creole, Hmong, Hungarian, Icelandic, Inuktitut, Irish, Karakalpak, Khmer, Kirghiz, Lao, Latvian, Macedonian, Malagasy, Maltese, Maori, Mongolian, Nahuatl, Navajo/Diné, Norwegian Nynorsk, Quechua, Romanian, Samoan, Scottish Gaelic, Shona, Somali, Tatar, Tibetan, Tigrinya, Waray, Walloon, Welsh, Yiddish, Zulu.

1 3 3

Reposted by Catherine Arnett @ 🍁COLM🍁

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · Aug 5

With six weeks left before the deadline, we have had over 50 volunteers sign up to contribute for over 30 languages. If you don’t see your language represented on the map, this is your sign to get involved!

1 2 3

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 27

I’m in Vienna all week for @aclmeeting.bsky.social and I’ll be presenting this paper on Wednesday at 11am (Poster Session 4 in HALL X4 X5)! Reach out if you want to chat about multilingual NLP, tokenizers, and open models!

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Mar 7

✨New pre-print✨ Crosslingual transfer allows models to leverage their representations for one language to improve performance on another language. We characterize the acquisition of shared representations in order to better understand how and when crosslingual transfer happens.

1 18

Reposted by Catherine Arnett @ 🍁COLM🍁

Pedro Ortiz Suarez @pjox.bsky.social · Jul 21

If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! 💬

Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/

Deadline: July 23, 2025 (AoE) ⏰

2 2

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 19

Really grateful to the organizers for the recognition of our work!

Tokenization Workshop (TokShop) @ICML2025 @tokshop.bsky.social · Jul 19

🏆 Announcing our Best Paper Awards!
🥇 Winner: "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" openreview.net/forum?id=AO7...
🥈 Runner-up: "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" openreview.net/forum?id=lC4...
Congrats! 🎉

1 1 12

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 10

bsky.app/profile/cath...

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jun 3

What if we didn't use UTF-8 as a starting point for tokenization? In UTF-8, different scripts need different number of bytes. And tokenizers can create merges that lead to stranded bytes and undecodable sequences. Sander Land and I propose a novel encoding strategy that solves those problems!

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 10

bsky.app/profile/cath...

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 10

MorphScore got an update! MorphScore now covers 70 languages 🌎🌍🌏 We have a new-preprint out and we will be presenting our paper at the Tokenization Workshop @tokshop.bsky.social at ICML next week! @marisahudspeth.bsky.social @brenocon.bsky.social

1

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 10

I'll be at ICML next week for the Tokenization Workshop @tokshop.bsky.social presenting two papers:
"Evaluating Morphological Alignment of Tokenizers in 70 Languages" and "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization". Check out the paper threads below!

1 2 11

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 10

Read the new pre-print: arxiv.org/abs/2507.06378
Use MorphScore: github.com/catherinearn...

arxiv.org

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 10

We replicate the findings from the COLING paper and find that higher morphological alignment scores do not correlate with better performance. In fact, they’re predictive of slightly *worse* performance across multiple tasks and models.

1 2

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 10

MorphScore v2 allows for flexible evaluation. You can decide whether to weight different words by their frequency and whether to include single-token words in the analysis. We also kept morphological tags, sentential context, and part-of-speech information to allow for analyses.

1

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 10

The original version of MorphScore, which we introduced earlier this year in this COLING paper, evaluates the extent to which tokenizers split words into morphemic tokens. In addition to expanding the language coverage, we address some of its limitations aclanthology.org/2025.coling-...

Why do language models perform worse for morphologically complex languages?

Catherine Arnett, Benjamin Bergen. Proceedings of the 31st International Conference on Computational Linguistics. 2025.

aclanthology.org

1 1

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 10

MorphScore got an update! MorphScore now covers 70 languages 🌎🌍🌏 We have a new-preprint out and we will be presenting our paper at the Tokenization Workshop @tokshop.bsky.social at ICML next week! @marisahudspeth.bsky.social @brenocon.bsky.social

1 4 12

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 9

Contribute here: dynabench.org/tasks/text-l...

Dynabench

dynabench.org

1

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jul 9

Just a few days left to contribute annotations before the first release of training data. We have over 17,000 document annotations so far!

Catherine Arnett @ 🍁COLM🍁 @catherinearnett.bsky.social · Jun 9

One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc

1 1 3