aclanthology.org/2025.finding...
aclanthology.org/2025.finding...
In particular, since attention matrices have been shown to encode syntactic information, we hypothesized that attention-based KD could selectively improve the student model’s syntax performance than conventional KD via logits.
In particular, since attention matrices have been shown to encode syntactic information, we hypothesized that attention-based KD could selectively improve the student model’s syntax performance than conventional KD via logits.
Language models, unlike humans, require large amounts of data, which suggests the need for an inductive bias.
But what kind of inductive biases do we need?
Language models, unlike humans, require large amounts of data, which suggests the need for an inductive bias.
But what kind of inductive biases do we need?