Lightnews — Scholar-powered news

@catherinearnett.bsky.social

I’ll be in San Diego for #NeurIPS2025 next week! I will be presenting posters at the main conference and at the CogInterp workshop. I will also be at the Workshop on Evaluating AI in Practice at UCSD. Looking forward to chatting about multilingual NLP, evals, and tokenizers!

November 24, 2025 at 5:20 PM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

Multilingual Representation Workshop @ EMNLP 2025

@mrl-workshop.bsky.social

We have kicked off proceedings with some brief opening remarks from @catherinearnett.bsky.social

November 9, 2025 at 1:26 AM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

EvalEval Coalition

@eval-eval.bsky.social

🚨 EvalEval is back - now in San Diego!🚨

🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)

📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025

Details below! ⬇️
evalevalai.com/events/works...

evalevalai.com

November 6, 2025 at 9:19 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

I’m so excited that Global PIQA is out! This has been a herculean effort by our 300+ contributors. The result is an extremely high-quality, culturally-specific benchmark for over 100 languages.

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · 27d

Introducing Global PIQA, a new multilingual benchmark for 100+ languages. This benchmark is the outcome of this year’s MRL shared task, in collaboration with 300+ researchers from 65 countries. This dataset evaluates physical commonsense reasoning in culturally relevant contexts.

October 29, 2025 at 3:53 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

Our #NeurIPS2025 paper shows that even comparable monolingual tokenizers have different compression rates across languages. But by getting rid of whitespace tokenization and using a custom vocab size for each language, we can reduce token premiums. Preprint out now!

October 28, 2025 at 3:11 PM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

Workshop on Multilingual Data Quality Signals

@wmdqs.bsky.social

WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025

October 10, 2025 at 4:18 PM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

Workshop on Multilingual Data Quality Signals

@wmdqs.bsky.social

In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.

October 9, 2025 at 8:17 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

I’m in Montreal this week for @colmweb.org and @wmdqs.bsky.social! Looking forward to chatting about tokenizers, multilingual data, and more! #COLM2025

Name tag with “Anti Anti Tokenizer Club” pin on lanyard

October 6, 2025 at 9:30 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

September 25, 2025 at 3:14 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

Did you know?

❌77% of language models on @hf.co are not tagged for any language
📈For 95% of languages, most models are multilingual
🚨88% of models with tags are trained on English

In a new blog post, @tylerachang.bsky.social and I dig into these trends and why they matter! 👇

September 19, 2025 at 2:53 PM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

Multilingual Representation Workshop @ EMNLP 2025

@mrl-workshop.bsky.social

We are in need of some emergency reviewers for MRL. If you are available, please fill out this form!

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · Jul 21

If you would like to sign up to be a reviewer, please fill in this form: t.co/fbunVuVhdE

https://forms.gle/fbizvGghD33cP3HP7

t.co

September 12, 2025 at 6:31 PM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

Multilingual Representation Workshop @ EMNLP 2025

@mrl-workshop.bsky.social

We extended the deadline by one day, so you have until the end of today (Aug 24) AoE to submit! Good luck!

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · Aug 12

The deadline for MRL at #EMNLP2025 is next week!

⏰ Submission Deadline: August 23rd (AoE)

🔗 CfP: sigtyp.github.io/ws2025-mrl.h...

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · Jul 21

The submission deadline for the 5th Workshop on Multilingual Representation Learning is coming up! See details below!

August 24, 2025 at 10:08 PM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

Multilingual Representation Workshop @ EMNLP 2025

@mrl-workshop.bsky.social

We have over 200 volunteers now for 90+ languages! We are hoping to expand the diversity of our language coverage and are still looking for participants who speak these languages. Check out how to get involved below, and please help us spread the word!

We are still actively looking for volunteers speaking the following languages (or other languages not listed):
Afrikaans, Aymara, Basque, Bosnian, Breton, Burmese, Cebuano, Guarani, Haitian Creole, Hmong, Hungarian, Icelandic, Inuktitut, Irish, Karakalpak, Khmer, Kirghiz, Lao, Latvian, Macedonian, Malagasy, Maltese, Maori, Mongolian, Nahuatl, Navajo/Diné, Norwegian Nynorsk, Quechua, Romanian, Samoan, Scottish Gaelic, Shona, Somali, Tatar, Tibetan, Tigrinya, Waray, Walloon, Welsh, Yiddish, Zulu.

August 18, 2025 at 3:53 PM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

Multilingual Representation Workshop @ EMNLP 2025

@mrl-workshop.bsky.social

With six weeks left before the deadline, we have had over 50 volunteers sign up to contribute for over 30 languages. If you don’t see your language represented on the map, this is your sign to get involved!

August 5, 2025 at 3:13 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

I’m in Vienna all week for @aclmeeting.bsky.social and I’ll be presenting this paper on Wednesday at 11am (Poster Session 4 in HALL X4 X5)! Reach out if you want to chat about multilingual NLP, tokenizers, and open models!

Catherine Arnett @ NeurIPS (San Diego) @catherinearnett.bsky.social · Mar 7

✨New pre-print✨ Crosslingual transfer allows models to leverage their representations for one language to improve performance on another language. We characterize the acquisition of shared representations in order to better understand how and when crosslingual transfer happens.

July 27, 2025 at 3:29 PM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

Pedro Ortiz Suarez

@pjox.bsky.social

If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! 💬

Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/

Deadline: July 23, 2025 (AoE) ⏰

July 21, 2025 at 10:40 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

Really grateful to the organizers for the recognition of our work!

Tokenization Workshop (TokShop) @ICML2025 @tokshop.bsky.social · Jul 19

🏆 Announcing our Best Paper Awards!
🥇 Winner: "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" openreview.net/forum?id=AO7...
🥈 Runner-up: "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" openreview.net/forum?id=lC4...
Congrats! 🎉

July 19, 2025 at 1:55 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

I'll be at ICML next week for the Tokenization Workshop @tokshop.bsky.social presenting two papers:
"Evaluating Morphological Alignment of Tokenizers in 70 Languages" and "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization". Check out the paper threads below!

July 10, 2025 at 4:13 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

MorphScore got an update! MorphScore now covers 70 languages 🌎🌍🌏 We have a new-preprint out and we will be presenting our paper at the Tokenization Workshop @tokshop.bsky.social at ICML next week! @marisahudspeth.bsky.social @brenocon.bsky.social

July 10, 2025 at 4:09 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

Just a few days left to contribute annotations before the first release of training data. We have over 17,000 document annotations so far!

Catherine Arnett @ NeurIPS (San Diego) @catherinearnett.bsky.social · Jun 9

One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc

July 9, 2025 at 2:21 PM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

Stella Biderman

@stellaathena.bsky.social

Stop by our discover server tomorrow, Friday June 27th, to hear about @catherinearnett.bsky.social's work!

eleutherai.bsky.social @eleutherai.bsky.social · Jun 26

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members.

Our first talk is by @catherinearnett.bsky.social on tokenizers, their limitations, and how to improve them.

June 26, 2025 at 6:18 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

I'm really excited about this shared task! We hope to create a massively multilingual physical reasoning dataset in collaboration with researchers around the world 🌍

Catherine Arnett @ NeurIPS (San Diego) @catherinearnett.bsky.social · Jun 24

As part of the workshop, we are also organizing a shared task to develop a collaborative physical commonsense reasoning evaluation dataset. See the shared task page for more information: sigtyp.github.io/st2025-mrl.h....

June 25, 2025 at 3:43 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

The call for papers is out for the 5th edition of the Workshop on Multilingual Representation Learning which will take place in Suzhou, China co-located with EMNLP 2025! See details below!

June 24, 2025 at 4:33 PM

Reposted by Catherine Arnett @ NeurIPS (San Diego)

Common Crawl Foundation

@commoncrawl.bsky.social

The deadline for paper submissions has been extended!

The new deadline is July 3, 2025. AoE.

For more information, please visit: wmdqs.org

June 23, 2025 at 2:23 PM

Catherine Arnett @ NeurIPS (San Diego)

@catherinearnett.bsky.social

One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc

June 9, 2025 at 3:44 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news