Craig Schmidt
@craigschmidt.com
500 followers 2.2K following 52 posts
Interested in ML, AI, and NLP. Particularly interested in tokenization. Live in the Boston area and work in R&D at Kensho Technologies.
Posts Media Videos Starter Packs
craigschmidt.com
I’m at @colmweb.org this week in Montreal. Come see our BoundlessBPE paper in the Wed morning poster session. Love to talk to anyone else here, especially about tokenization. #COLM2025
craigschmidt.com
I believe he’s talking about Olin College of Engineering. Created from scratch as an undergraduate only school, with the first class in 2002. Kind of a Harvey Mudd of the east. Campus is near me, and they seem to attract great students.
craigschmidt.com
There are two different ways that the Huggingface Word Piece implementation can produce tokens even with ByteLevel pretokenization. A nice blog post from Stéphan Tulkens talks about how to fix one of them, in response to a question of mine.
stephantul.github.io/blog/better-...
Better Greedy Tokenizers: Handling WordPiece's [UNK] Problem
Stéphan Tulkens' Blog
stephantul.github.io
craigschmidt.com
I've been using GPT-5 on my phone (since it isn't my web account yet). I've had several bad responses with logical inconsistencies. My hot take: what if GPT-5 is mostly about saving OpenAI money on inference, which is why they are deprecating all the other models so quickly.
Reposted by Craig Schmidt
daviddarmofal.bsky.social
@crampell.bsky.social’s post got me to thinking and…yes…Trump has apparently canceled the research grant of Judea Pearl, who is one of the world’s leading scholars, is Jewish, Israeli-American, & is vocally opposed to antisemitism, & is the father of Daniel Pearl.
www.science.org/content/arti...
Reposted by Craig Schmidt
mdlhx.bsky.social
Interested in multilingual tokenization in #NLP? Lisa Beinborn and I are hiring!

PhD candidate position in Göttingen, Germany: www.uni-goettingen.de/de/644546.ht...

PostDoc position in Leuven, Belgium:
www.kuleuven.be/personeel/jo...

Deadline 6th of June
Stellen OBP - Georg-August-Universität Göttingen
Webseiten der Georg-August-Universität Göttingen
www.uni-goettingen.de
craigschmidt.com
And of course I missed some tokenization related papers at #ACL2025 in my previous post. Any more I should add?