Hanbo Xie
@psychboyh.bsky.social
81 followers 130 following 27 posts
Third-year PhD student at NRD Lab at Gatech. Interested in using Large Language Models to understand human decision-making and learning, and the core of human intelligence.
Posts Media Videos Starter Packs
Pinned
psychboyh.bsky.social
Can Think-Aloud be really useful in understanding human minds? Building on our previous work, we formally propose reopening this old debate, with one of the largest Think-Aloud datasets, "RiskyThought44K," and LLM analysis, showing Think-Aloud can complement to comp cogsci.
psychboyh.bsky.social
I am excited to share that our paper is accepted by @neuripsconf.bsky.social . This is an interesting work that uses tools and insights from comp cog neuroscience to understand LLMs. Nice work by @louannapan.bsky.social and @doctor-bob.bsky.social.
psychboyh.bsky.social
Large Language Models can do a lot of things. But do you know they cannot explore effectively, especially in open-ended tasks? Recently, Lan Pan and I dropped a preprint to investigate how LLMs explore in an open-ended task.
arxiv.org/abs/2501.18009
Large Language Models Think Too Fast To Explore Effectively
Large Language Models have emerged many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore, an essential capacit...
arxiv.org
psychboyh.bsky.social
This result already shows the revealed processes from think aloud, are generative and generalized to their behaviors. The LLM can also learn to reason like humans when giving think aloud examples, which means LLMs not just learn superficial behavioral claims in the think aloud, but also processes.
psychboyh.bsky.social
Regarding your question about introspective process,I am afraid I cannot agree that people don’t have introspective access to their computation. What we were validating in the paper is that their think aloud are predictive of their behaviors at current trials as well as other trials.
psychboyh.bsky.social
Thanks for your question, Harrison! The process level insight can be descriptive, algorithmic and computational. For example, you can code the think aloud of human participant in mental arithmetic game, which reveals the search process underlying one move.

arxiv.org/abs/2505.23931
Scaling up the think-aloud method
The think-aloud method, where participants voice their thoughts as they solve a task, is a valuable source of rich data about human reasoning processes. Yet, it has declined in popularity in contempor...
arxiv.org
psychboyh.bsky.social
This work is jointly done by @huadongxiong and @doctor-bob.bsky.social
psychboyh.bsky.social
Can Think-Aloud be really useful in understanding human minds? Building on our previous work, we formally propose reopening this old debate, with one of the largest Think-Aloud datasets, "RiskyThought44K," and LLM analysis, showing Think-Aloud can complement to comp cogsci.
psychboyh.bsky.social
You are welcome to comment below, share our work, or request codes and data for replication and extension of our work!
psychboyh.bsky.social
In sum, our work uses an open-ended task to evaluate LLM's open-ended exploration capacity and suggest important differences between traditional LLMs and reasoning models in their cognitive capacity. We want to thank @frabraendle.bsky.social , @candemircan.bsky.social and Huadong Xiong for the help!
psychboyh.bsky.social
These attempts provide evidence that the fallback of traditional LLMs in inference paradigms is not a useful way to solve open-ended exploration problems. Instead, test compute scaling in the reasoning model (or we could say 'spending more time to think') can work.
psychboyh.bsky.social
In our discussion, we mentioned we have attempted multiple approaches, including prompt engineering, intervention, and seeking model alternatives. None of these approaches changed the situation, except we used the reasoning model, deepseek-r1.
psychboyh.bsky.social
This sounds like the models are 'thinking too fast' with their choices dominated by early representations in the models and did not 'wait' until the model effectively integrates empowerment information, which hinders the traditional LLMs from better performance in this task.
psychboyh.bsky.social
The SAE result suggests that empowerment and uncertainty strategies are represented in LLaMA-3.1 70B, with relatively strong correlations with latent neurons. However, choices and uncertainty are most correlated in early transformer blocks while empowerment is in later blocks.
psychboyh.bsky.social
There are two possibilities. The traditional LLMs do not know about 'empowerment,' or they know it but they overlook it in the information processing! To test these hypotheses, we used Sparse AutoEncoders (SAE) to probe whether and where the models represent those strategies.
psychboyh.bsky.social
The strategy usage well explains why traditional LLMs are worse than humans and why o1 can surpass human performance (effective strategy use!). Then we are curious why traditional LLMs cannot well balance those strategies.
psychboyh.bsky.social
Our results suggest that humans can somehow well balance these two strategies, while traditional LLMs mainly use uncertainty-driven strategies rather than empowerment, which only yields short-term competence when the action space is small. o1 uses both strategies more than humans
psychboyh.bsky.social
There are two possible strategies for this task: one is uncertainty-driven, where agents explore based on the uncertainty of elements, while the other is empowerment, where agents explore based on an understanding of the task structure (inventory tree).
psychboyh.bsky.social
The result is intriguing. For traditional LLMs (GPT-4o, LLaMA-3.1 8B, 70B), their performance is far worse than humans, while for reasoning models like o1 and the popular deepseek-R1 (see appendix), they can surpass or reach human-level performance.
psychboyh.bsky.social
Therefore, we borrowed a paradigm with human data from a game-like experiment-'Little Alchemy 2', where agents combine known elements to invent novel ones. We wonder (1) whether LLMs can do better than humans. (2) what strategies and (3) what mechanisms explain the performance?
psychboyh.bsky.social
Exploration is an important capacity of both natural and artificial intelligence. But people rarely discuss how LLMs can explore. Previous studies mainly focus on bandit tasks, which are closed-form questions. However, exploration can also exist in an open-ended environment.
psychboyh.bsky.social
Large Language Models can do a lot of things. But do you know they cannot explore effectively, especially in open-ended tasks? Recently, Lan Pan and I dropped a preprint to investigate how LLMs explore in an open-ended task.
arxiv.org/abs/2501.18009
Large Language Models Think Too Fast To Explore Effectively
Large Language Models have emerged many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore, an essential capacit...
arxiv.org