Haokun Liu
@haokunliu.bsky.social
87 followers 53 following 27 posts
Ph.D. Student at the University of Chicago | Chicago Human + AI Lab haokunliu.com
Posts Media Videos Starter Packs
Pinned
haokunliu.bsky.social
🚀🚀🚀Excited to share our latest work: HypoBench, a systematic benchmark for evaluating LLM-based hypothesis generation methods!

There is much excitement about leveraging LLMs for scientific hypothesis generation, but principled evaluations are missing - let’s dive into HypoBench together.
Reposted by Haokun Liu
chenhaotan.bsky.social
🚀 We’re thrilled to announce the upcoming AI & Scientific Discovery online seminar! We have an amazing lineup of speakers.

This series will dive into how AI is accelerating research, enabling breakthroughs, and shaping the future of research across disciplines.

ai-scientific-discovery.github.io
Reposted by Haokun Liu
itea1001.bsky.social
#ACL2025 Poster Session 1 tomorrow 11:00-12:30 Hall 4/5!
Reposted by Haokun Liu
chenhaotan.bsky.social
Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering.

This is holding us back. 🧵and new paper with @ari-holtzman.bsky.social .
Reposted by Haokun Liu
chenhaotan.bsky.social
It predicts pretty well—not just shifts in the last week, but also:

1. Who’s working an overnight shift (in our data + external validation in MIMIC)

2. Who’s working on a disruptive circadian schedule

3. How many patients has the doc seen *on the current shift*
Reposted by Haokun Liu
elenal3ai.bsky.social
🚨 New paper alert 🚨

Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? 🤔 Should the LLM answer at all? We call these clashes Concept Incongruence. Read on! ⬇️

1/n 🧵
Reposted by Haokun Liu
itea1001.bsky.social
1/n 🚀🚀🚀 Thrilled to share our latest work🔥: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! 🧠💬📊
There’s a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability — let’s dive in! 🌊
Reposted by Haokun Liu
mheddaya.bsky.social
🧑‍⚖️How well can LLMs summarize complex legal documents? And can we use LLMs to evaluate?

Excited to be in Albuquerque presenting our paper this afternoon at @naaclmeeting 2025!
Reposted by Haokun Liu
chenhaotan.bsky.social
Although I cannot make #NAACL2025, @chicagohai.bsky.social will be there. Please say hi!

@chachachen.bsky.social GPT ❌ x-rays (Friday 9-10:30)
@mheddaya.bsky.social CaseSumm and LLM 🧑‍⚖️ (Thursday 2-3:30)
@haokunliu.bsky.social @qiaoyu-rosa.bsky.social hypothesis generation 🔬 (Saturday at 4pm)
haokunliu.bsky.social
13/ Lastly, great thanks to my wonderful collaborators Sicong Huang, Jingyu Hu, @qiaoyu-rosa.bsky.social , and my advisor @chenhaotan.bsky.social !
haokunliu.bsky.social
11/ Why HypoBench matters: Establishes a structured way to advance AI's role in scientific discovery and everyday reasoning, highlighting both current capabilities and significant challenges.
haokunliu.bsky.social
10/ Model priors matter: We see that the models have different priors, which lead to varying behaviors in different tasks—generating good hypotheses is harder when prior knowledge is not helpful.
haokunliu.bsky.social
9/ And it gets worse in counterintuitive settings - the models perform significantly worse when the underlying hypotheses are counterintuitive.
haokunliu.bsky.social
8/ 💡 Synthetic dataset results show: LLMs handle simple interactions well but struggle with increased noise, distractors, or subtleties in text—highlighting significant rooms for improvement.
haokunliu.bsky.social
7/ Qualitative Insights: Methods balancing novelty and plausibility are rare; iterative refinement boosts novelty but risks plausibility. Literature-driven hypotheses excelled in plausibility but lacked novelty.
haokunliu.bsky.social
6/ 🌍 Real-world implications: Methods integrating literature insights with data outperform simple zero/few-shot inference. Qwen excelled in generating generalizable hypotheses.
haokunliu.bsky.social
5/ 🚨 But… Even top models and methods struggle significantly as task complexity rises. At base difficulty, the best model captured 93.8% of hypotheses; this dropped sharply to 38.8% with increased complexity.
haokunliu.bsky.social
4/ Yes, LLMs can generate effective hypotheses: we tested 4 state-of-the-art models—GPT, Qwen Llama, and DeepSeek—with 6 existing hypothesis generation methods. We found that using Qwen and integrating literature with data (LITERATURE + DATA) yields the best results.
haokunliu.bsky.social
3/ 📊 Introducing HypoBench: Our novel benchmark spans 194 datasets across 7 real-world and 5 synthetic tasks, testing key hypothesis generation capabilities like explanatory power, generalizability, and discovery rate.
haokunliu.bsky.social
2/ 🤔 What makes a good hypothesis? It requires three key skills: inductive reasoning, abstraction, and synthesis. Good hypotheses should primarily have strong explanatory power and be interesting to researchers.
haokunliu.bsky.social
1/ What is hypothesis generation? We define it clearly: a hypothesis is a natural-language explanation of observed phenomena—critical for both science and everyday reasoning.
haokunliu.bsky.social
🚀🚀🚀Excited to share our latest work: HypoBench, a systematic benchmark for evaluating LLM-based hypothesis generation methods!

There is much excitement about leveraging LLMs for scientific hypothesis generation, but principled evaluations are missing - let’s dive into HypoBench together.