David Smith
dasmiq.bsky.social
David Smith
@dasmiq.bsky.social
Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.
Reposted by David Smith
(2/2) Morphology-aware tokenization improves Latin LM performance on four downstream tasks, including gains for out-of-domain texts and rare words.

📄 arxiv.org/abs/2511.09709
Contextual morphologically-guided tokenization for Latin encoder models
Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than...
arxiv.org
November 14, 2025 at 8:02 PM
Reposted by David Smith
(2) The prediction view: the cost of processing each word in a sentence can be fully reduced to the word’s contextual predictability (i.e. surprisal). Predicting the next word is exactly what LLMs are trained to do, so they’re a great tool for evaluating this view. (3/n)
November 14, 2025 at 7:19 PM
Reposted by David Smith
We conducted a high-powered (n=368) eyetracking while reading study to test two competing views:
(1) The structural processing view: eye movements reflect the cost of mentally assembling the words of a sentence into a larger meaning. (2/n)
November 14, 2025 at 7:19 PM
Reposted by David Smith
We present alongside the paper:
1. ‘NewsWords’ - unigrams from the entire digitised collection, github.com/Living-with-...
2. Newspaper metadata, openhumanitiesdata.metajnl.com/articles/10....
3. Mitchell's Press Directories, bl.iro.bl.uk/concern/data...
3/7
GitHub - Living-with-machines/newswords: Code for the counts data derived from historical newspapers
Code for the counts data derived from historical newspapers - Living-with-machines/newswords
github.com
November 11, 2025 at 4:06 PM
Reposted by David Smith
All 4 positions are open rank & all could result in multiple hires. Folks from the DH/book history/bibliography worlds might look especially at the "Information, Culture, & Society" & open information sciences positions to see if anything resonates—happy to offer what insight I can
November 6, 2025 at 3:55 PM