Craig Schmidt
@craigschmidt.com
500 followers
2.2K following
52 posts
Interested in ML, AI, and NLP. Particularly interested in tokenization. Live in the Boston area and work in R&D at Kensho Technologies.
Posts
Media
Videos
Starter Packs
Craig Schmidt
@craigschmidt.com
· Aug 10
Reposted by Craig Schmidt
Reposted by Craig Schmidt
Craig Schmidt
@craigschmidt.com
· Jul 30
Craig Schmidt
@craigschmidt.com
· Jul 30
GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model
Thomas Bauwens, David Kaczér, Miryam De Lhoneux. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.
aclanthology.org
Craig Schmidt
@craigschmidt.com
· Jul 30
GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model
Thomas Bauwens, David Kaczér, Miryam De Lhoneux. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.
aclanthology.org
Craig Schmidt
@craigschmidt.com
· Jul 30
Craig Schmidt
@craigschmidt.com
· Jul 30
Evaluating Tokenizer Adaptation Methods for Large Language Models on Low-Resource Programming Languages
Georgy Andryushchenko, Vladimir V. Ivanov. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 2025.
aclanthology.org
Craig Schmidt
@craigschmidt.com
· Jul 30
Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models
Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.
aclanthology.org
Craig Schmidt
@craigschmidt.com
· Jul 30
Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
Andrew Gambardella, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025.
aclanthology.org
Craig Schmidt
@craigschmidt.com
· Jul 30
Beyond Text Compression: Evaluating Tokenizers Across Scales
Jonas F. Lotz, António V. Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.
aclanthology.org
Craig Schmidt
@craigschmidt.com
· Jul 30
Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang, Jian He, Conglin Liu. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:...
aclanthology.org
Craig Schmidt
@craigschmidt.com
· Jul 30
Byte Latent Transformer: Patches Scale Better Than Tokens
Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srini...
aclanthology.org