Lightnews — Scholar-powered news

Kwanghee Choi

@juice500ml.bsky.social

120 followers 130 following 20 posts

Master's student @ltiatcmu.bsky.social, working on speech AI at @shinjiw.bsky.social

Posts Media Videos Starter Packs

Reposted by Kwanghee Choi

Marianne de Heer Kloots @mdhk.net · Aug 19

Had such a great time presenting our tutorial on Interpretability Techniques for Speech Models at #Interspeech2025! 🔍

For anyone looking for an introduction to the topic, we've now uploaded all materials to the website: interpretingdl.github.io/speech-inter...

Kwanghee Choi @juice500ml.bsky.social · Aug 15

This wouldn't have been possible with my awesome co-first-author @mmiagshatoy.bsky.social and wonderful supervisors @shinjiw.bsky.social and @strubell.bsky.social!
I'll see you at Rotterdam, Wed 17:00-17:20 Area8-Oral4 (Streaming ASR)! (10/10)

Kwanghee Choi @juice500ml.bsky.social · Aug 15

There's also bunch of engineering tricks that can improve the performance. We provide a pareto-optimal baseline after applying all the available tricks, positioning our work as a foundation for future works in this direction. github.com/Masao-Someki... (9/n)

Kwanghee Choi @juice500ml.bsky.social · Aug 15

We also verified that DSUs are learnable with smaller weights (# of layers), i.e., more lightweight! This implies that we're using self-supervised models inefficiently when extracting DSUs. (8/n)

Kwanghee Choi @juice500ml.bsky.social · Aug 15

We verified that DSUs are learnable with limited attention size (window size), i.e., streamable! This implies that DSUs are temporally "local". (7/n)

Kwanghee Choi @juice500ml.bsky.social · Aug 15

After modifying the architecture, we fine-tune it with the DSUs extracted from the original full model. We're now understanding DSUs as "ground truth" for smaller models. (6/n)

Kwanghee Choi @juice500ml.bsky.social · Aug 15

However, the underlying Transformer model is heavy and non-streamable. We make the model more lightweight (via reducing # of layers) and streamable (via streaming window). (5/n)

Kwanghee Choi @juice500ml.bsky.social · Aug 15

Why DSUs?
(1) High transmission efficiency of ~0.6kbps (.wav files are around 512kbps, 3-4 orders of magnitude bigger!)
(2) Easy integration with LLMs (we can say DSUs are "tokenized speech")
(3) DSUs somewhat "acts" like phonemes (4/n)

Kwanghee Choi @juice500ml.bsky.social · Aug 15

A whirlwind overview of discrete speech units (DSUs): we first train a Transformer model with self-supervision (i.e., self-supervised speech models, S3Ms). Then, we simply apply k-means on top of it. Then, the k-means cluster indices becomes DSUs! (3/n)

Kwanghee Choi @juice500ml.bsky.social · Aug 15

In short, yes! Long story short:
(1) We are using self-supervised models inefficiently when extracting discrete speech units (DSUs), hence can be made more lightweight.
(2) DSUs do not require full temporal receptive field, hence streamable. (2/n)

Kwanghee Choi @juice500ml.bsky.social · Aug 15

Can we make discrete speech units lightweight🪶 and streamable🏎? Excited to share our new #Interspeech2025 paper: On-device Streaming Discrete Speech Units arxiv.org/abs/2506.01845 (1/n)

Kwanghee Choi @juice500ml.bsky.social · Jun 9

www.nature.com/articles/350...
Ted Chiang. Catching crumbs from the table. Nature 405, 517 (2000). My favorite sci-fi short, which surprisingly well-summarizes what I actually do nowadays. I bet self-supervised speech models contain undiscovered theories on phonetics and phonology.

Catching crumbs from the table - Nature

In the face of metahuman science, humans have become metascientists.

Reposted by Kwanghee Choi

Daniel Csillag @ ICML2025 @handle.invalid · Apr 25

It's good to finally have a good reference for this stuff! Kudos to the authors.
arxiv.org/abs/2501.18374

Proofs for Folklore Theorems on the Radon-Nikodym Derivative

In this paper, rigorous statements and formal proofs are presented for both foundational and advanced folklore theorems on the Radon-Nikodym derivative. The cases of conditional and marginal probabili...

Kwanghee Choi @juice500ml.bsky.social · Apr 29

Check out my presentation and poster for more details. I'll see you at NAACL, 4/30 14:00-15:30 Poster Session C! youtu.be/ZRF4u1eThJM (9/9)

Kwanghee Choi @juice500ml.bsky.social · Apr 29

We provide all the code and additional textgrids for everyone to use! github.com/juice500ml/a... (8/n)

GitHub - juice500ml/acoustic-units-for-ood: Official implementation for the paper "Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment (NAACL 2025)"

Official implementation for the paper "Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment (NAACL 2025)" - juice500ml/acoustic-units-for-ood

Kwanghee Choi @juice500ml.bsky.social · Apr 29

We provide an extensive benchmark containing both pathological and non-native speech, with 8 different methods and 4 different speech features. It measures how well does the speech features model each phonemes accurately. (7/n)

Kwanghee Choi @juice500ml.bsky.social · Apr 29

Based on the observation, we found out that using k-means + Gaussian Mixture Models (GMMs) are actually quite effective for modeling sound distributions.
It's different with classifiers! Classifiers model P(phoneme|sound), where ours model P(sound|phoneme). (6/n)

Kwanghee Choi @juice500ml.bsky.social · Apr 29

So, why is allophony important? We have to model each phonemes accurately for the atypical speech assessment task. It has direct applications to non-native and pathological speech assessment. (5/n)

Kwanghee Choi @juice500ml.bsky.social · Apr 29

Compared to traditional speech features like MFCC or Mel Spectrograms, self-supervised features are much superior in capturing allophony. (4/n)

Kwanghee Choi @juice500ml.bsky.social · Apr 29

A quick background on linguistics: this is supposed to happen! A single phoneme may have multiple realizations. For example, English /t/ is pronounced differently per context: [tʰ] in tap, [t] in stop, [ɾ] in butter, and [ʔ] in kitten. (3/n)

Kwanghee Choi @juice500ml.bsky.social · Apr 29

In short, yes! Even though self-supervised speech models are trained only from raw speech, they cluster via allophonic variations, i.e., different surrounding phonetic environments. (2/n)

Kwanghee Choi @juice500ml.bsky.social · Apr 29

Can self-supervised models 🤖 understand allophony 🗣? Excited to share my new #NAACL2025 paper: Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment arxiv.org/abs/2502.07029 (1/n)

Reposted by Kwanghee Choi

siddhant-arora.bsky.social @siddhant-arora.bsky.social · Mar 17

New #NAACL2025 demo, Excited to introduce ESPnet-SDS, a new open-source toolkit for building unified web interfaces for both cascaded & end-to-end spoken dialogue system, providing real-time evaluation, and more!
📜: arxiv.org/abs/2503.08533
Live Demo: huggingface.co/spaces/Siddh...

Reposted by Kwanghee Choi

Dave Levitan @davelevitan.bsky.social · Jan 24

More from inside NIH:

Per a source with knowledge, for all internal research (of which there is like $10 billion worth or so), ALL purchasing shut down as of yesterday.

That means gloves, reagents, anything involved with lab work, which means a lot of that work will stop.

Reposted by Kwanghee Choi

Language Technologies Institute | CMU @ltiatcmu.bsky.social · Jan 6

Are you a pre-doctoral student interested in language technologies, especially focusing on safe, fair and inclusive AI? Our Summer 2025 Language Technology for All Internship could be a great fit. See the link below for more info, and to apply:
lti.cs.cmu.edu/news-and-eve...

CMU LTI Language Technology for All Internship 2025 - Language Technologies Institute - School of Computer Science - Carnegie Mellon University

The LTI is currently seeking applicants for the summer 2025 Language Technology for All Internship