taka-yamakoshi.bsky.social
@taka-yamakoshi.bsky.social
PhD student at Stanford | Interested in language processing by humans and machines | prev. UTokyo Medicine | UTokyo-Princeton Exchange '19-'20 | https://taka-yamakoshi.github.io/
Overall, our work highlights the potential of knowledge distillation as a probe to study inductive biases. Our results suggest syntactic knowledge may be distributed throughout the network rather than localized in attention patterns.
November 7, 2025 at 9:17 AM
This result holds across a variety of syntactic phenomena, with the interesting exception of ellipsis, where KD via attention was competitive with KD via logits.
November 7, 2025 at 9:17 AM
Here’s what we found. (1) logit-based KD drastically improved data-efficiency, reaching teacher-level syntax performance with just 500K sentences. But (2) attention-based KD surprisingly provided limited benefits.
November 7, 2025 at 9:17 AM
We used a pretrained GPT2 as the teacher and trained a student GPT2 with the same architecture as the teacher, on datasets ranging from 10K to 5M sentences.
November 7, 2025 at 9:17 AM
We used knowledge distillation (KD) as a way to impart such soft bias.
In particular, since attention matrices have been shown to encode syntactic information, we hypothesized that attention-based KD could selectively improve the student model’s syntax performance than conventional KD via logits.
November 7, 2025 at 9:17 AM
Past work shows explicitly imparting syntactic knowledge helps language models train with limited data. But this requires predefined grammatical rules. Can we instead use a *softer* bias that does not assume a specific grammar?
November 7, 2025 at 9:17 AM
I’m excited to share our Findings of EMNLP paper w/ @cocoscilab.bsky.social , @rtommccoy.bsky.social, and @rdhawkins.bsky.social !

Language models, unlike humans, require large amounts of data, which suggests the need for an inductive bias.
But what kind of inductive biases do we need?
November 7, 2025 at 9:17 AM