Beomseok Lee
beomseok-lee.bsky.social
Beomseok Lee
@beomseok-lee.bsky.social
PhD student @uniTrento. Affiliated in @naverlabseurope and @fbk_mt. Ex research engineer @samsungresearch
Can we make Speech LLMs actually think as they listen? 👂💭
This fascinating work applies CoT inspired by human “thinking while listening”, training models to find the inflection point when reasoning starts.
📄 arxiv.org/abs/2510.07497
Can Speech LLMs Think while Listening?
Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (Co...
arxiv.org
October 29, 2025 at 12:48 PM
🤔 Ever wondered how discrete tokens vs. continuous features behave in SpeechLLMs?
This new work dives into 6 SLU tasks and reveals some interesting takeaways!
arxiv.org/abs/2508.17863
Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs
With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong c...
arxiv.org
August 28, 2025 at 9:02 AM
Speech-language models show promise in multimodal tasks—but how well are speech & text actually aligned? 🤔

This paper arxiv.org/abs/2505.19937 proposes a new metric to measure layer-wise correlation between the two, with a focus on SLU tasks. 🔍🗣️📄
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
Large Language Models (LLMs) are widely used in Spoken Language Understanding (SLU). Recent SLU models process audio directly by adapting speech input into LLMs for better multimodal learning. A key c...
arxiv.org
June 11, 2025 at 12:53 PM
Should speech come before the instruction text, or should the instruction text come first in a speech-language model?
Find out the best positioning for speech and text—and the novel adapter that aligns speech and text modalities!
arxiv.org/abs/2412.01145
AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
Integrating speech into LLM (speech-LLM) has gaining increased attention recently. The mainstream solution is to connect a well-trained speech encoder and LLM with a neural adapter. However, the lengt...
arxiv.org
April 3, 2025 at 10:42 AM