Niyati Bafna
niyatibafna.bsky.social
Niyati Bafna
@niyatibafna.bsky.social
PhD student @jhuclsp. Previously @AIatMeta, @InriaParisNLP, @EM_LCT| #NLProc
Thanks! Yeah, that idea's definitely still around :) Although "language-agnostic" to a large extent seems to be "English"(arxiv.org/pdf/2402.18815, aclanthology.org/2024.acl-lon...)
July 10, 2025 at 2:51 AM
This work was done with my amazing collaborators Tianjian Li, @kentonmurray.bsky.social , @davidrmortensen.bsky.social , David Yarowsky, Hale Sirin, and @danielkhashabi.bsky.social, @jhuclsp.bsky.social .
July 4, 2025 at 5:05 PM
If you can’t decide whether to go end-to-end or MT cascade for your next multilingual experiments, or you want to build alternative architectures or adapters for multilingual LLMs, or you wanted to know why God why can't LLMs solve tasks in other languages, this paper is for you.
July 4, 2025 at 5:05 PM
Main takeaway: Translation failure is an important failure mode! Your model may be having wise and intelligent thoughts all up to its last couple of layers, and then failing to communicate them in Telugu because (like me) it has tried but failed to learn Telugu.
July 4, 2025 at 5:05 PM
We break down the patterns in the above figure by source and target language, talk about what makes the neat pipeline picture a little more complicated, and show briefly what happens with a bigger model (spoiler: things improve but not too much). See paper for details!
July 4, 2025 at 5:05 PM
In general, intermediate accuracy stays high even for LRL targets, but final accuracy quickly drops. And so TLP is high for most target languages (>50%). Except for low-resource *source* languages, in which case task-solving fails before we get to translation.
July 4, 2025 at 5:05 PM
We then quantify *translation loss proportion*: the proportion of failure cases that had successful task-solving but failed translation (see paper for less hand-waviness). We look at intermediate task-solving accuracy (over all layers), final accuracy, and TLP.
July 4, 2025 at 5:05 PM
What languages does task-solving occur in? We look at the distribution over languages of correct intermediate outputs and see that 1) English dominates 2) But other supported HRLs have a considerable combined presence! Also, this mix looks largely the same regardless of target language.
July 4, 2025 at 5:05 PM
We visualize the task-solving—>translation pipeline, showing that intermediate layers have high *off-target* accuracy (task-solving), which gets converted (via translation) to *on-target* accuracy near the final layers. For HRLs. For LRL target languages, translation fails, resulting in bad outputs.
July 4, 2025 at 5:05 PM
We look at a word translation task for 108 language pairs, and use logit lens to trace *task-solving accuracy* (correct semantics regardless of language) and *on-targetness* (correct target language) over model layers.
July 4, 2025 at 5:05 PM
This hypothesis says that 1) Multilingual generation uses a model-internal task-solving→translation cascade. 2) Failure of the translation stage *despite task-solving success* is a large part of the problem. That is, the model often solves the task but fails to articulate the answer.
July 4, 2025 at 5:05 PM
This work was done in (a super fun) collaboration with Matthew Wiesner, at the HLTCOE and @jhuclsp.bsky.social.
June 7, 2025 at 5:27 PM
Apparently the ECAPA-TDNN model thinks I'm speaking Bengali when I read out Wordsworth to it. I wish I spoke Bengali. I wish Wordsworth spoke Bengali. But the cold harsh truth: SOTA LID should be better.
June 7, 2025 at 5:27 PM
This module by itself shows very little accent-language confusion. In combination with the ECAPA-TDNN model, it shows large improvements on LID for L2-accented speech in English, French, and German, and minimal degradation on mainstream accented speech.
June 7, 2025 at 5:27 PM
Okay, so how do we fix this problem? We investigate using a module that incorporates long-range information to help out. We look at two representations of the input: as a sequence of phones and a sequence of discretised SSL representations. And we put a classifier on top.
June 7, 2025 at 5:27 PM
This suggests that language identification models behave like accent identification models under the hood, largely relying on short-range phonotactics. When the accent-language association is broken, e.g. for L2-accented speech, LID models break. Badly!
June 7, 2025 at 5:27 PM
Models that show less block permutation invariance, such as the GEO model (aclanthology.org/2024.naacl-l...), also appear more robust to L2 accents.
June 7, 2025 at 5:27 PM
To test this, we look at *block permutation invariance* i.e. the length of ordered as well as unordered input features that SOTA models rely on; our experiments indicate that they use features describing only about 1-2 phones.
June 7, 2025 at 5:27 PM
Our hypothesis: this is caused by the model’s using too-short features. The intuition is that accents are characterised by short phone-usage-type features, languages by vocabulary and syntax. L2 accented speech imposes the former over the latter, causing confusion when models are short-sighted.
June 7, 2025 at 5:27 PM
Accent-language confusion: The mis-recognition of L2-accented speech as the L1 substrate or a related language. For example, when Indonesian-accented English is classified as Indonesian, Malay, etc. A large part of model error on L2-accented speech follows this pattern!
June 7, 2025 at 5:27 PM