Debjit Paul
debjit-paul.bsky.social
Debjit Paul
@debjit-paul.bsky.social
NLP Researcher
Reposted by Debjit Paul
1/ 🌍 How does mixing data from hundreds of languages affect LLM training?
In our new paper "Revisiting Multilingual Data Mixtures in Language Model Pretraining" we revisit core assumptions about multilinguality using 1.1B-3B models trained on up to 400 languages.
🧵👇
December 15, 2025 at 6:18 PM
Reposted by Debjit Paul
🚨New Preprint!

In multilingual models, the same meaning can take far more tokens in some languages, penalizing users of underrepresented languages with worse performance and higher API costs. Our Parity-aware BPE algorithm is a step toward addressing this issue: 🧵
August 11, 2025 at 12:28 PM
Super excited to share that our paper "A Logical Fallacy-Informed Framework for Argument Generation" has received the Outstanding Paper Award 🎉🎉 at NAACL 2025!

Paper: aclanthology.org/2025.naacl-l...
Code: github.com/lucamouchel/...

#NAACL2025
May 1, 2025 at 1:41 PM
Reposted by Debjit Paul
Lots of great news out of the EPFL NLP lab these last few weeks. We'll be at @iclr-conf.bsky.social and @naaclmeeting.bsky.social in April / May to present some of our work in training dynamics, model representations, reasoning, and AI democratization. Come chat with us during the conference!
February 25, 2025 at 9:18 AM
Reposted by Debjit Paul
Translating MMLU is great, but global users of multilingual #LLMs don't care all that much about an LLM's understanding of US Law!

Our new #NLProc work centers multilingual #LLM evaluations toward regional knowledge in 44 languages.
🚀 Introducing INCLUDE 🌍: A multilingual LLM evaluation benchmark spanning 44 languages!

Contains *newly-collected* data, prioritizing *regional knowledge*.
Setting the stage for truly global AI evaluation.
Ready to see how your model measures up?
#AI #Multilingual #LLM #NLProc
December 2, 2024 at 4:26 PM
Reposted by Debjit Paul
1/ 📘 Could ChatGPT get an engineering degree? Spoiler, yes! In our new @pnas.org article, we explore how AI assistants like GPT-4 perform in STEM university courses — and on average they pass a staggering 91.7% of core courses. 🧵 #AI #HigherEd #STEM #LLMs #NLProc
December 4, 2024 at 2:53 PM