Lightnews — Scholar-powered news

Reposted by Haokun Liu

chenhaotan.bsky.social @chenhaotan.bsky.social · 13d

🚀 We’re thrilled to announce the upcoming AI & Scientific Discovery online seminar! We have an amazing lineup of speakers.

This series will dive into how AI is accelerating research, enabling breakthroughs, and shaping the future of research across disciplines.

ai-scientific-discovery.github.io

1 15 23

Reposted by Haokun Liu

Mingxuan (Aldous) Li @itea1001.bsky.social · Jul 27

#ACL2025 Poster Session 1 tomorrow 11:00-12:30 Hall 4/5!

1 3

Reposted by Haokun Liu

chenhaotan.bsky.social @chenhaotan.bsky.social · Jul 9

Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering.

This is holding us back. 🧵and new paper with @ari-holtzman.bsky.social .

2 15 37

Reposted by Haokun Liu

Chicago Human+AI Lab @chicagohai.bsky.social · Jul 9

We are making som exciting updates to hypogenic this summer: github.com/ChicagoHAI/h... and will post updates here.

GitHub - ChicagoHAI/hypothesis-generation: This is the official repository for HypoGeniC (Hypothesis Generation in Context) and HypoRefine, which are automated, data-driven tools that leverage large l...

This is the official repository for HypoGeniC (Hypothesis Generation in Context) and HypoRefine, which are automated, data-driven tools that leverage large language models to generate hypothesis fo...

github.com

1 2

Reposted by Haokun Liu

chenhaotan.bsky.social @chenhaotan.bsky.social · Jul 2

It predicts pretty well—not just shifts in the last week, but also:

1. Who’s working an overnight shift (in our data + external validation in MIMIC)

2. Who’s working on a disruptive circadian schedule

3. How many patients has the doc seen *on the current shift*

1 3 5

Reposted by Haokun Liu

Xiaoyan Bai @elenal3ai.bsky.social · May 27

🚨 New paper alert 🚨

Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? 🤔 Should the LLM answer at all? We call these clashes Concept Incongruence. Read on! ⬇️

1/n 🧵

1 17 28

Reposted by Haokun Liu

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

1/n 🚀🚀🚀 Thrilled to share our latest work🔥: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! 🧠💬📊
There’s a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability — let’s dive in! 🌊

1 7 22

Reposted by Haokun Liu

Mourad Heddaya @mheddaya.bsky.social · May 1

🧑‍⚖️How well can LLMs summarize complex legal documents? And can we use LLMs to evaluate?

Excited to be in Albuquerque presenting our paper this afternoon at @naaclmeeting 2025!

2 13 23

Reposted by Haokun Liu

chenhaotan.bsky.social @chenhaotan.bsky.social · Apr 30

Although I cannot make #NAACL2025, @chicagohai.bsky.social will be there. Please say hi!

@chachachen.bsky.social GPT ❌ x-rays (Friday 9-10:30)
@mheddaya.bsky.social CaseSumm and LLM 🧑‍⚖️ (Thursday 2-3:30)
@haokunliu.bsky.social @qiaoyu-rosa.bsky.social hypothesis generation 🔬 (Saturday at 4pm)

7 17

Haokun Liu @haokunliu.bsky.social · Apr 28

13/ Lastly, great thanks to my wonderful collaborators Sicong Huang, Jingyu Hu, @qiaoyu-rosa.bsky.social , and my advisor @chenhaotan.bsky.social !

1

Haokun Liu @haokunliu.bsky.social · Apr 28

12/ 🌟 For more details and to access our datasets and code, please visit our paper at arxiv.org/abs/2504.11524, we also have an official website and leaderboards available at: chicagohai.github.io/HypoBench/

HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation

There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate method...

arxiv.org

1

Haokun Liu @haokunliu.bsky.social · Apr 28

11/ Why HypoBench matters: Establishes a structured way to advance AI's role in scientific discovery and everyday reasoning, highlighting both current capabilities and significant challenges.

1

Haokun Liu @haokunliu.bsky.social · Apr 28

10/ Model priors matter: We see that the models have different priors, which lead to varying behaviors in different tasks—generating good hypotheses is harder when prior knowledge is not helpful.

1

Haokun Liu @haokunliu.bsky.social · Apr 28

9/ And it gets worse in counterintuitive settings - the models perform significantly worse when the underlying hypotheses are counterintuitive.

1

Haokun Liu @haokunliu.bsky.social · Apr 28

8/ 💡 Synthetic dataset results show: LLMs handle simple interactions well but struggle with increased noise, distractors, or subtleties in text—highlighting significant rooms for improvement.

1

Haokun Liu @haokunliu.bsky.social · Apr 28

7/ Qualitative Insights: Methods balancing novelty and plausibility are rare; iterative refinement boosts novelty but risks plausibility. Literature-driven hypotheses excelled in plausibility but lacked novelty.

1

Haokun Liu @haokunliu.bsky.social · Apr 28

6/ 🌍 Real-world implications: Methods integrating literature insights with data outperform simple zero/few-shot inference. Qwen excelled in generating generalizable hypotheses.

1

Haokun Liu @haokunliu.bsky.social · Apr 28

5/ 🚨 But… Even top models and methods struggle significantly as task complexity rises. At base difficulty, the best model captured 93.8% of hypotheses; this dropped sharply to 38.8% with increased complexity.

1

Haokun Liu @haokunliu.bsky.social · Apr 28

4/ Yes, LLMs can generate effective hypotheses: we tested 4 state-of-the-art models—GPT, Qwen Llama, and DeepSeek—with 6 existing hypothesis generation methods. We found that using Qwen and integrating literature with data (LITERATURE + DATA) yields the best results.

1

Haokun Liu @haokunliu.bsky.social · Apr 28

3/ 📊 Introducing HypoBench: Our novel benchmark spans 194 datasets across 7 real-world and 5 synthetic tasks, testing key hypothesis generation capabilities like explanatory power, generalizability, and discovery rate.

1

Haokun Liu @haokunliu.bsky.social · Apr 28

2/ 🤔 What makes a good hypothesis? It requires three key skills: inductive reasoning, abstraction, and synthesis. Good hypotheses should primarily have strong explanatory power and be interesting to researchers.

1

Haokun Liu @haokunliu.bsky.social · Apr 28

1/ What is hypothesis generation? We define it clearly: a hypothesis is a natural-language explanation of observed phenomena—critical for both science and everyday reasoning.

1 1

Haokun Liu @haokunliu.bsky.social · Apr 28

🚀🚀🚀Excited to share our latest work: HypoBench, a systematic benchmark for evaluating LLM-based hypothesis generation methods!

There is much excitement about leveraging LLMs for scientific hypothesis generation, but principled evaluations are missing - let’s dive into HypoBench together.

1 9 11