Pietro Lesci
@pietrolesci.bsky.social
250 followers 1K following 23 posts
PhD student at Cambridge University. Causality & language models. Passionate musician, professional debugger. pietrolesci.github.io
Posts Media Videos Starter Packs
pietrolesci.bsky.social
Had a really great and fun time with @yanai.bsky.social, Niloofar Mireshghallah, and Reza Shokri discussing memorisation at the @l2m2workshop.bsky.social panel. Thanks to the entire organising team and attendees for making this such a fantastic workshop! #ACL2025
yanai.bsky.social
I had a lot of fun contemplating about memorization questions at the @l2m2workshop.bsky.social panel yesterday together with Niloofar Mireshghallah and Reza Shokri, moderated by
@pietrolesci.bsky.social who did a fantastic job!
#ACL2025
Reposted by Pietro Lesci
tpimentel.bsky.social
@philipwitti.bsky.social will be presenting our paper "Tokenisation is NP-Complete" at #ACL2025 😁 Come to the language modelling 2 session (Wednesday morning, 9h~10h30) to learn more about how challenging tokenisation can be!
Reposted by Pietro Lesci
deboranozza.bsky.social
Just arrived in Vienna for ACL 2025 🇦🇹 Excited to be here and to finally meet so many people in person!

We have several papers this year and many from @milanlp.bsky.social are around, come say hi!

Here are all the works I'm involved in ⤵️

#ACL2025 #ACL2025NLP
milanlp.bsky.social
🎉 The @milanlp.bsky.social lab is excited to present 15 papers and 1 tutorial at #ACL2025 & workshops! Grateful to all our amazing collaborators, see everyone in Vienna! 🚀
pietrolesci.bsky.social
Also, got burning questions about memorisation? Send them my way—we'll make sure to pose them to our panelists during the workshop!
pietrolesci.bsky.social
Headed to Vienna for #ACL2025 to present our tokenisation bias paper and co-organise the L2M2 workshop on memorisation in language models. Reach out to chat about tokenisation, memorisation, and all things pre-training (esp. data-related topics)!
pietrolesci.bsky.social
All modern LLMs run on top of a tokeniser, an often overlooked “preprocessing detail”. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias.

Let’s talk about it and why it matters👇
@aclmeeting.bsky.social #ACL2025 #NLProc
pietrolesci.bsky.social
Also, we find that:
– Tokenisation bias appears early in training
– Doesn’t go away as models improve or with scale

We hope this approach can support:
– More principled vocabulary design
– Better understanding of generalisation trade-offs
– Fairer and more stable LMs
pietrolesci.bsky.social
As our main result, we find that when a token is in a model’s vocabulary—i.e., when its characters are tokenised as a single symbol—the model may assign it up to 17x more probability than if it had been split into two tokens instead
pietrolesci.bsky.social
The trick: tokenisers build vocabs incrementally up to a fixed size (e.g., 32k). This defines a "cutoff": tokens near it are similar (e.g., frequency), but those inside appear as one while those outside as two symbols. Perfect setup for regression discontinuity! Details in 📄!
pietrolesci.bsky.social
So, did we train thousands of models, with and without each token in our vocabulary? No! Our method works observationally! 👀📊
pietrolesci.bsky.social
While intuitive, this question is tricky. We can’t just compare
1️⃣ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency
2️⃣ different tokenisations (e.g., ⟨he,llo⟩or ⟨hello⟩) as the model only sees one during training
pietrolesci.bsky.social
In our paper, we estimate a specific type of tokenisation bias: What’s the effect of including a token (e.g., ⟨hello⟩) in the tokeniser’s vocabulary on the log-probability this model assigns to its characters (“hello”)?
pietrolesci.bsky.social
Most language models assign probabilities to raw strings (like "hello") by first tokenising them (like ⟨he, llo⟩ or ⟨hello⟩). Ideally, different tokenisations shouldn't change these models’ outputs. In practice, they do. We call this difference **tokenisation bias**
pietrolesci.bsky.social
All modern LLMs run on top of a tokeniser, an often overlooked “preprocessing detail”. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias.

Let’s talk about it and why it matters👇
@aclmeeting.bsky.social #ACL2025 #NLProc
Reposted by Pietro Lesci
tpimentel.bsky.social
A string may get 17 times less probability if tokenised as two symbols (e.g., ⟨he, llo⟩) than as one (e.g., ⟨hello⟩)—by an LM trained from scratch in each situation! Our new ACL paper proposes an observational method to estimate this causal effect! Longer thread soon!
Title of paper "Causal Estimation of Tokenisation Bias" and schematic of how we define tokenisation bias, which is the causal effect we are interested in.
Reposted by Pietro Lesci
tpimentel.bsky.social
If you're finishing your camera-ready for ACL or ICML and want to cite co-first authors more fairly, I just made a simple fix to do this! Just add $^*$ to the authors' names in your bibtex, and the citations should change :)

github.com/tpimentelms/...
Inline citations with only first author name, or first two co-first author names.
Reposted by Pietro Lesci
l2m2workshop.bsky.social
📢 @aclmeeting.bsky.social notifications have been sent out, making this the perfect time to finalize your commitment. Don't miss the opportunity to be part of the L2M2 workshop!

🔗 Commit here: openreview.net/group?id=acl...

🗓️ Deadline: May 20, 2025 (AoE)

#ACL2025 #NLProc
ACL 2025 Workshop L2M2 ARR Commitment
Welcome to the OpenReview homepage for ACL 2025 Workshop L2M2 ARR Commitment
openreview.net
pietrolesci.bsky.social
I'm truly honoured that our paper "Causal Estimation of Memorisation Profiles" has been selected as the Paper of the Year by @cst.cam.ac.uk 🎉

I thank my amazing co-authors Clara Meister, Thomas Hofmann, @tpimentel.bsky.social, and my great advisor and co-author @andreasvlachos.bsky.social!
cst.cam.ac.uk
🎉 Congratulations @pietrolesci.bsky.social, Clara Meister, Thomas Hofmann, @andreasvlachos.bsky.social & Tiago Pimentel! They won Publication of the Year at our annual Hall of Fame awards last week for their paper on 'Causal Estimation of Memorisation Profiles'. www.cst.cam.ac.uk/announcing-w...
Andreas Vlachos, our Professor of Natural Language Processing and Machine Learning, collected the award from Head of Department Alastair Beresford at our Hall of Fame Awards ceremony on 23 April 2025.
pietrolesci.bsky.social
Big thanks to my co-authors: @ovdw.bsky.social, Max Müller-Eberstein, @nsaphra.bsky.social, @hails.computer, Willem Zuidema, and @stellaathena.bsky.social
pietrolesci.bsky.social
Come find us at the poster session:
🗓️ Fri 25 Apr, 3:00–5:30 p.m. (+08)
📌 Hall 3 + Hall 2B, Poster n. 259
pietrolesci.bsky.social
We find that:
📈 Language modelling is stable: consistent scaling laws for performance and info content.
📚 Steps 1k–10k form core of linguistic structure; 10k–100k bring the biggest jumps in performance.
🗺️ Training maps capture these phases and reveal outlier seeds early
pietrolesci.bsky.social
We introduce PolyPythias: 50 training runs across 5 sizes (14M–410M) and 10 seeds to explore:
1️⃣ How stable is downstream performance?
2️⃣ How similar are the learned linguistic representations?
3️⃣ Do training dynamics reveal distinct phases, and can we spot issues early?
pietrolesci.bsky.social
✈️ Headed to @iclr-conf.bsky.social — whether you’ll be there in person or tuning in remotely, I’d love to connect!

We’ll be presenting our paper on pre-training stability in language models and the PolyPythias 🧵

🔗 ArXiv: arxiv.org/abs/2503.09543
🤗 PolyPythias: huggingface.co/collections/...
pietrolesci.bsky.social
The First Workshop on Large Language Model Memorization will be co-located at @aclmeeting.bsky.social in Vienna. Help us spread the word!
l2m2workshop.bsky.social
📢 The First Workshop on Large Language Model Memorization (L2M2) will be co-located with
@aclmeeting.bsky.social in Vienna 🎉

💡 L2M2 brings together researchers to explore memorization from multiple angles. Whether it's text-only LLMs or Vision-language models, we want to hear from you! 🌍
Reposted by Pietro Lesci
alexxthiery.bsky.social
This year, when students of my optimization class were asking for references related to forward-backward mode autodiff, I didn't suggest books or articles: #JAX documentation was actually the best thing I've found! What's your go-to reference for this?