Lightnews — Scholar-powered news

Pietro Lesci @pietrolesci.bsky.social · Aug 2

Had a really great and fun time with @yanai.bsky.social, Niloofar Mireshghallah, and Reza Shokri discussing memorisation at the @l2m2workshop.bsky.social panel. Thanks to the entire organising team and attendees for making this such a fantastic workshop! #ACL2025

Yanai Elazar @yanai.bsky.social · Aug 2

I had a lot of fun contemplating about memorization questions at the @l2m2workshop.bsky.social panel yesterday together with Niloofar Mireshghallah and Reza Shokri, moderated by
@pietrolesci.bsky.social who did a fantastic job!
#ACL2025

1 8

Reposted by Pietro Lesci

Tiago Pimentel @tpimentel.bsky.social · Jul 27

@philipwitti.bsky.social will be presenting our paper "Tokenisation is NP-Complete" at #ACL2025 😁 Come to the language modelling 2 session (Wednesday morning, 9h~10h30) to learn more about how challenging tokenisation can be!

Tiago Pimentel @tpimentel.bsky.social · Dec 20

BPE is a greedy method to find a tokeniser which maximises compression! Why don't we try to find properly optimal tokenisers instead? Well, it seems this is a pretty difficult—in fact, NP-complete—problem!🤯
New paper + @philipwitti.bsky.social
@gregorbachmann.bsky.social :) arxiv.org/abs/2412.15210

Tokenisation is NP-Complete

In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $δ$ symbols by either finding a vocabulary directly (direct token...

arxiv.org

3 7

Reposted by Pietro Lesci

Debora Nozza @deboranozza.bsky.social · Jul 27

Just arrived in Vienna for ACL 2025 🇦🇹 Excited to be here and to finally meet so many people in person!

We have several papers this year and many from @milanlp.bsky.social are around, come say hi!

Here are all the works I'm involved in ⤵️

#ACL2025 #ACL2025NLP

MilaNLP Lab @milanlp.bsky.social · Jul 16

🎉 The @milanlp.bsky.social lab is excited to present 15 papers and 1 tutorial at #ACL2025 & workshops! Grateful to all our amazing collaborators, see everyone in Vienna! 🚀

1 4 20

Pietro Lesci @pietrolesci.bsky.social · Jul 27

Also, got burning questions about memorisation? Send them my way—we'll make sure to pose them to our panelists during the workshop!

Pietro Lesci @pietrolesci.bsky.social · Jul 27

Headed to Vienna for #ACL2025 to present our tokenisation bias paper and co-organise the L2M2 workshop on memorisation in language models. Reach out to chat about tokenisation, memorisation, and all things pre-training (esp. data-related topics)!

Pietro Lesci @pietrolesci.bsky.social · Jun 5

All modern LLMs run on top of a tokeniser, an often overlooked “preprocessing detail”. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias.

Let’s talk about it and why it matters👇
@aclmeeting.bsky.social #ACL2025 #NLProc

2 2 19

Pietro Lesci @pietrolesci.bsky.social · Jun 5

Paper 📄: arxiv.org/abs/2506.03149
Code 💻: github.com/pietrolesci/...

Joint work with amazing collaborators: Clara Meister, Thomas Hofmann, @andreasvlachos.bsky.social, and @tpimentel.bsky.social!

Causal Estimation of Tokenisation Bias

Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to...

arxiv.org

1 6

Pietro Lesci @pietrolesci.bsky.social · Jun 5

Also, we find that:
– Tokenisation bias appears early in training
– Doesn’t go away as models improve or with scale

We hope this approach can support:
– More principled vocabulary design
– Better understanding of generalisation trade-offs
– Fairer and more stable LMs

1 1

Pietro Lesci @pietrolesci.bsky.social · Jun 5

As our main result, we find that when a token is in a model’s vocabulary—i.e., when its characters are tokenised as a single symbol—the model may assign it up to 17x more probability than if it had been split into two tokens instead

1 1 2

Pietro Lesci @pietrolesci.bsky.social · Jun 5

The trick: tokenisers build vocabs incrementally up to a fixed size (e.g., 32k). This defines a "cutoff": tokens near it are similar (e.g., frequency), but those inside appear as one while those outside as two symbols. Perfect setup for regression discontinuity! Details in 📄!

1 4

Pietro Lesci @pietrolesci.bsky.social · Jun 5

So, did we train thousands of models, with and without each token in our vocabulary? No! Our method works observationally! 👀📊

1 1

Pietro Lesci @pietrolesci.bsky.social · Jun 5

While intuitive, this question is tricky. We can’t just compare
1️⃣ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency
2️⃣ different tokenisations (e.g., ⟨he,llo⟩or ⟨hello⟩) as the model only sees one during training

2 1

Pietro Lesci @pietrolesci.bsky.social · Jun 5

In our paper, we estimate a specific type of tokenisation bias: What’s the effect of including a token (e.g., ⟨hello⟩) in the tokeniser’s vocabulary on the log-probability this model assigns to its characters (“hello”)?

1 2

Pietro Lesci @pietrolesci.bsky.social · Jun 5

Most language models assign probabilities to raw strings (like "hello") by first tokenising them (like ⟨he, llo⟩ or ⟨hello⟩). Ideally, different tokenisations shouldn't change these models’ outputs. In practice, they do. We call this difference **tokenisation bias**

1 2

Pietro Lesci @pietrolesci.bsky.social · Jun 5

All modern LLMs run on top of a tokeniser, an often overlooked “preprocessing detail”. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias.

Let’s talk about it and why it matters👇
@aclmeeting.bsky.social #ACL2025 #NLProc

1 8 62

Reposted by Pietro Lesci

Tiago Pimentel @tpimentel.bsky.social · Jun 4

A string may get 17 times less probability if tokenised as two symbols (e.g., ⟨he, llo⟩) than as one (e.g., ⟨hello⟩)—by an LM trained from scratch in each situation! Our new ACL paper proposes an observational method to estimate this causal effect! Longer thread soon!

Title of paper "Causal Estimation of Tokenisation Bias" and schematic of how we define tokenisation bias, which is the causal effect we are interested in.

1 9 53

Reposted by Pietro Lesci

Tiago Pimentel @tpimentel.bsky.social · May 29

If you're finishing your camera-ready for ACL or ICML and want to cite co-first authors more fairly, I just made a simple fix to do this! Just add $^*$ to the authors' names in your bibtex, and the citations should change :)

github.com/tpimentelms/...

Inline citations with only first author name, or first two co-first author names.

4 23 85

Reposted by Pietro Lesci

The First Workshop on Large Language Model Memorization (L2M2) @l2m2workshop.bsky.social · May 16

📢 @aclmeeting.bsky.social notifications have been sent out, making this the perfect time to finalize your commitment. Don't miss the opportunity to be part of the L2M2 workshop!

🔗 Commit here: openreview.net/group?id=acl...

🗓️ Deadline: May 20, 2025 (AoE)

#ACL2025 #NLProc

ACL 2025 Workshop L2M2 ARR Commitment

Welcome to the OpenReview homepage for ACL 2025 Workshop L2M2 ARR Commitment

openreview.net

1 1

Pietro Lesci @pietrolesci.bsky.social · Apr 30

I'm truly honoured that our paper "Causal Estimation of Memorisation Profiles" has been selected as the Paper of the Year by @cst.cam.ac.uk 🎉

I thank my amazing co-authors Clara Meister, Thomas Hofmann, @tpimentel.bsky.social, and my great advisor and co-author @andreasvlachos.bsky.social!

Cambridge Computer Science @cst.cam.ac.uk · Apr 29

🎉 Congratulations @pietrolesci.bsky.social, Clara Meister, Thomas Hofmann, @andreasvlachos.bsky.social & Tiago Pimentel! They won Publication of the Year at our annual Hall of Fame awards last week for their paper on 'Causal Estimation of Memorisation Profiles'. www.cst.cam.ac.uk/announcing-w...