Author | Lightnews

@taka-yamakoshi.bsky.social

PhD student at Stanford | Interested in language processing by humans and machines | prev. UTokyo Medicine | UTokyo-Princeton Exchange '19-'20 | https://taka-yamakoshi.github.io/

Posts Replies Media Videos

taka-yamakoshi.bsky.social

@taka-yamakoshi.bsky.social

For more, please take a look at our paper!
aclanthology.org/2025.finding...

Evaluating distillation methods for data-efficient syntax learning

Takateru Yamakoshi, Thomas L. Griffiths, R. Thomas McCoy, Robert D. Hawkins. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025.

aclanthology.org

November 7, 2025 at 9:17 AM

taka-yamakoshi.bsky.social

@taka-yamakoshi.bsky.social

Overall, our work highlights the potential of knowledge distillation as a probe to study inductive biases. Our results suggest syntactic knowledge may be distributed throughout the network rather than localized in attention patterns.

November 7, 2025 at 9:17 AM

taka-yamakoshi.bsky.social

@taka-yamakoshi.bsky.social

This result holds across a variety of syntactic phenomena, with the interesting exception of ellipsis, where KD via attention was competitive with KD via logits.

November 7, 2025 at 9:17 AM

taka-yamakoshi.bsky.social

@taka-yamakoshi.bsky.social

Here’s what we found. (1) logit-based KD drastically improved data-efficiency, reaching teacher-level syntax performance with just 500K sentences. But (2) attention-based KD surprisingly provided limited benefits.

November 7, 2025 at 9:17 AM

taka-yamakoshi.bsky.social

@taka-yamakoshi.bsky.social

We used a pretrained GPT2 as the teacher and trained a student GPT2 with the same architecture as the teacher, on datasets ranging from 10K to 5M sentences.

November 7, 2025 at 9:17 AM

taka-yamakoshi.bsky.social

@taka-yamakoshi.bsky.social

We used knowledge distillation (KD) as a way to impart such soft bias.
In particular, since attention matrices have been shown to encode syntactic information, we hypothesized that attention-based KD could selectively improve the student model’s syntax performance than conventional KD via logits.

November 7, 2025 at 9:17 AM

taka-yamakoshi.bsky.social

@taka-yamakoshi.bsky.social

Past work shows explicitly imparting syntactic knowledge helps language models train with limited data. But this requires predefined grammatical rules. Can we instead use a *softer* bias that does not assume a specific grammar?

November 7, 2025 at 9:17 AM

taka-yamakoshi.bsky.social

@taka-yamakoshi.bsky.social

Paper: aclanthology.org/2025.finding...

Evaluating distillation methods for data-efficient syntax learning

Takateru Yamakoshi, Thomas L. Griffiths, R. Thomas McCoy, Robert D. Hawkins. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025.

aclanthology.org

November 7, 2025 at 9:17 AM

taka-yamakoshi.bsky.social

@taka-yamakoshi.bsky.social

I’m excited to share our Findings of EMNLP paper w/ @cocoscilab.bsky.social , @rtommccoy.bsky.social, and @rdhawkins.bsky.social !

Language models, unlike humans, require large amounts of data, which suggests the need for an inductive bias.
But what kind of inductive biases do we need?

November 7, 2025 at 9:17 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news